Ron created CRUNCH-305:
--------------------------
Summary: Multiuse between parellelDos which sharing the same input
Key: CRUNCH-305
URL: https://issues.apache.org/jira/browse/CRUNCH-305
Project: Crunch
Issue Type: Wish
Reporter: Ron
When I start to use crunch, many of my jobs are in this pattern: I have five
different parallelDo functions, and all of them work on a same input.
Currently, I read the input first by using "pipeline.readTextFile()", and then
apply each parallelDo function to the PCollection. However, I find that crunch
will break my plan into five different mr jobs, each of them read the input and
do mr, so it need to read the input five times. However, when referring to the
paper of flumejava, the origin of crunch, I suggest that optimizations could be
done that the input only be read once, and then apply the five different
paralledDo functions. Since the input size is large, and the cost of IO is big,
this optimization may help a lot in crunch jobs in patterns similar to mine.
--
This message was sent by Atlassian JIRA
(v6.1#6144)