Ron created CRUNCH-305:
--------------------------

             Summary: Multiuse between parellelDos which sharing the same input
                 Key: CRUNCH-305
                 URL: https://issues.apache.org/jira/browse/CRUNCH-305
             Project: Crunch
          Issue Type: Wish
            Reporter: Ron


  When I start to use crunch, many of my jobs are in this pattern: I have five 
different parallelDo functions, and all of them work on a same input. 
Currently, I read the input first by using "pipeline.readTextFile()", and then 
apply each parallelDo function to the PCollection. However, I find that crunch 
will break my plan into five different mr jobs, each of them read the input and 
do mr, so it need to read the input five times. However, when referring to the 
paper of flumejava, the origin of crunch, I suggest that optimizations could be 
done that the input only be read once, and then apply the five different 
paralledDo functions. Since the input size is large, and the cost of IO is big, 
this optimization may help a lot in crunch jobs in patterns similar to mine.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to