Chao Shi created CRUNCH-284:
-------------------------------

             Summary: Optimize for minimal disk i/o rather than the number of 
stages?
                 Key: CRUNCH-284
                 URL: https://issues.apache.org/jira/browse/CRUNCH-284
             Project: Crunch
          Issue Type: Bug
            Reporter: Chao Shi


I have a pipeline as follows:

PCollection in = pipeline.read(...)
PCollection part1 = f1(in)
PCollection part2 = f2(in)
pipelien.write(part1.groupByKey...)
pipeline.write(part2.groupByKey...)

where f1 extracts a small potion from "in" and f2 returns the rest. Crunch 
optimizes the pipeline into two independent MR jobs, both of which fully read 
the input.

I think the ideal MRs should be a map-only job reads the input and split them 
to two outputs, and then two MRs read them respectively.

The problem is that Crunch minimizes the number of MR stages, which is optimal 
for most cases, but not optimal in this case. 

What do you think of this folks?



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to