[
https://issues.apache.org/jira/browse/CRUNCH-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832407#comment-13832407
]
Ron commented on CRUNCH-305:
----------------------------
I have a careful reading of crunch future work on
http://crunch.apache.org/future-work.html, and found that this is already in
the future work of crunch, as combine related groupByKey into one single MR job
like flumejava does.
> Multiuse between parellelDos which sharing the same input
> ---------------------------------------------------------
>
> Key: CRUNCH-305
> URL: https://issues.apache.org/jira/browse/CRUNCH-305
> Project: Crunch
> Issue Type: Wish
> Reporter: Ron
>
> When I start to use crunch, many of my jobs are in this pattern: I have
> five different parallelDo functions, and all of them work on a same input.
> Currently, I read the input first by using "pipeline.readTextFile()", and
> then apply each parallelDo function to the PCollection. However, I find that
> crunch will break my plan into five different mr jobs, each of them read the
> input and do mr, so it need to read the input five times. However, when
> referring to the paper of flumejava, the origin of crunch, I suggest that
> optimizations could be done that the input only be read once, and then apply
> the five different paralledDo functions. Since the input size is large, and
> the cost of IO is big, this optimization may help a lot in crunch jobs in
> patterns similar to mine.
--
This message was sent by Atlassian JIRA
(v6.1#6144)