[jira] [Commented] (CRUNCH-284) Optimize for minimal disk i/o rather than the number of stages?

Gabriel Reid (JIRA) Mon, 21 Oct 2013 04:17:26 -0700

    [ 
https://issues.apache.org/jira/browse/CRUNCH-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800552#comment-13800552
 ]


Gabriel Reid commented on CRUNCH-284:
-------------------------------------

Yes, definitely would be useful -- this kind of thing has come up before too. I 
think that the main question is how to expose something like this via the API. 

The first way (which is currently possible, but not very nice or intuitive) is 
just to explicitly write part1 and part2 and then call pipeline.run(). This 
could also probably be exposed in a nicer way, like by adding a "checkpoint" 
method to PCollection or Pipeline.

The second way is to have the planner transparently make a decision like this 
based on the calculated size of the PCollections that will be output (based on 
input size and the combined scaleFactors of all DoFns). This implicit approach 
always makes me a bit nervous, as it's very dependent on the correctness of the 
reporting of the input size (which is an issue when working with something like 
HBase), as well as the correctness of the return value of DoFn#scaleFactor 
(which is also very difficult to get right).

Any thoughts on how you would want to expose this kind of thing in the API?

> Optimize for minimal disk i/o rather than the number of stages?
> ---------------------------------------------------------------
>
>                 Key: CRUNCH-284
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-284
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Chao Shi
>
> I have a pipeline as follows:
> PCollection in = pipeline.read(...)
> PCollection part1 = f1(in)
> PCollection part2 = f2(in)
> pipelien.write(part1.groupByKey...)
> pipeline.write(part2.groupByKey...)
> where f1 extracts a small potion from "in" and f2 returns the rest. Crunch 
> optimizes the pipeline into two independent MR jobs, both of which fully read 
> the input.
> I think the ideal MRs should be a map-only job reads the input and split them 
> to two outputs, and then two MRs read them respectively.
> The problem is that Crunch minimizes the number of MR stages, which is 
> optimal for most cases, but not optimal in this case. 
> What do you think of this folks?



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (CRUNCH-284) Optimize for minimal disk i/o rather than the number of stages?

Reply via email to