[ 
https://issues.apache.org/jira/browse/CRUNCH-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832407#comment-13832407
 ] 

Ron commented on CRUNCH-305:
----------------------------

I have a careful reading of crunch future work on 
http://crunch.apache.org/future-work.html, and found that this is already in 
the future work of crunch, as combine related groupByKey into one single MR job 
like flumejava does. 

> Multiuse between parellelDos which sharing the same input
> ---------------------------------------------------------
>
>                 Key: CRUNCH-305
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-305
>             Project: Crunch
>          Issue Type: Wish
>            Reporter: Ron
>
>   When I start to use crunch, many of my jobs are in this pattern: I have 
> five different parallelDo functions, and all of them work on a same input. 
> Currently, I read the input first by using "pipeline.readTextFile()", and 
> then apply each parallelDo function to the PCollection. However, I find that 
> crunch will break my plan into five different mr jobs, each of them read the 
> input and do mr, so it need to read the input five times. However, when 
> referring to the paper of flumejava, the origin of crunch, I suggest that 
> optimizations could be done that the input only be read once, and then apply 
> the five different paralledDo functions. Since the input size is large, and 
> the cost of IO is big, this optimization may help a lot in crunch jobs in 
> patterns similar to mine.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to