Josh Wills created CRUNCH-247:
---------------------------------
Summary: Planner should take advantage of to-be-materialized
outputs during planning
Key: CRUNCH-247
URL: https://issues.apache.org/jira/browse/CRUNCH-247
Project: Crunch
Issue Type: Bug
Components: Core
Reporter: Josh Wills
Assignee: Josh Wills
Fix For: 0.8.0
In the following pipeline, the Crunch planner will rerun the "op1" step in two
independent map-only jobs, instead of running a single job that executes the
op1 step followed by a subsequent job that consumes that output and runs the
op2 step:
PCollection<String> in = p.read(From.textFile(inputPath));
PTable<String, String> op = in.parallelDo("op1", new DoFn<String,
Pair<String, String>>() {
@Override
public void process(String input, Emitter<Pair<String, String>> emitter) {
if (input.length() > 5) {
emitter.emit(Pair.of(input.substring(0, 3), input));
}
}
}, tableOf(strings(), strings()));
SourceTarget src = (SourceTarget)((MaterializableIterable<Pair<String,
String>>) op.materialize()).getSource();
op = op.parallelDo("op2", IdentityFn.<Pair<String,String>>getInstance(),
tableOf(strings(), strings()),
ParallelDoOptions.builder().sourceTargets(src).build());
PCollection<String> output = op.values();
output.write(To.textFile(out));
The planner should be able to take advantage of the materialized output from
op1 to not re-run that step in the op2 job.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira