[ 
https://issues.apache.org/jira/browse/CRUNCH-294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13824943#comment-13824943
 ] 

Gabriel Reid commented on CRUNCH-294:
-------------------------------------

Yes, that all sounds very right to me, and I think that sticking with the 
simple rules for now sounds like a good plan.

One thing I'm thinking is that there might be a need for information on the 
size of records in a PCollection somehow. A lot of operations will have a cpu 
footprint per record that is constant, independent of the size of the record -- 
having an indicator of the mean size of records would allow estimating the 
number of records in a PCollection.

Once we get to the point where both IO and CPU are all being taken into account 
by the planner then it could also be interesting to allow configuring some kind 
of thresholds for a job so that you can for example say "don't worry about 
optimizing IO because I'm running on SSDs" or something like that.

BTW, maybe memoryFootprint() and cpuFootprint() would be better method names on 
DoFn.

I also find the idea of a "learning" planner really interesting, although I 
worry a bit about the implications of a pipeline that might use a different 
plan in development and in production. That being said, I think that something 
like this could just be disabled if needed.

> Cost-based job planning
> -----------------------
>
>                 Key: CRUNCH-294
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-294
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-294.patch, jobplan-default-new.png, 
> jobplan-default-old.png, jobplan-large_s2_s3.png, jobplan-lopsided.png
>
>
> A bug report on the user list drove me to revisit some of the core planning 
> logic, particularly around how we decide where to split up DoFns between two 
> dependent MapReduce jobs.
> I found an old TODO about using the scale factor from a DoFn to decide where 
> to split up the nodes between dependent GBKs, so I implemented a new version 
> of the split algorithm that takes advantage of how we've propagated support 
> for multiple outputs on both the map and reduce sides of a job to do 
> finer-grained splits that use information from the scaleFactor calculations 
> to make smarter split decisions.
> One high-level change along with this: I changed the default scaleFactor() 
> value in DoFn to 0.99f to slightly prefer writes that occur later in a 
> pipeline flow by default.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to