Chao Shi created CRUNCH-284:
-------------------------------
Summary: Optimize for minimal disk i/o rather than the number of
stages?
Key: CRUNCH-284
URL: https://issues.apache.org/jira/browse/CRUNCH-284
Project: Crunch
Issue Type: Bug
Reporter: Chao Shi
I have a pipeline as follows:
PCollection in = pipeline.read(...)
PCollection part1 = f1(in)
PCollection part2 = f2(in)
pipelien.write(part1.groupByKey...)
pipeline.write(part2.groupByKey...)
where f1 extracts a small potion from "in" and f2 returns the rest. Crunch
optimizes the pipeline into two independent MR jobs, both of which fully read
the input.
I think the ideal MRs should be a map-only job reads the input and split them
to two outputs, and then two MRs read them respectively.
The problem is that Crunch minimizes the number of MR stages, which is optimal
for most cases, but not optimal in this case.
What do you think of this folks?
--
This message was sent by Atlassian JIRA
(v6.1#6144)