[ 
https://issues.apache.org/jira/browse/CRUNCH-294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13824566#comment-13824566
 ] 

Gabriel Reid commented on CRUNCH-294:
-------------------------------------

Good point about the original issue that was brought up, i.e. minimizing CPU 
costs.

I'm thinking that having something like cpuCost() and memoryCost() methods on 
DoFn might be the easiest/most flexible for now, with both methods returning 
1.0f as the default value. For now we could treat any cpuCost above 1.0f as 
"attempt to run only once", but using costs instead of a flag will probably 
allow more flexibility in the future if we want to do more advanced stuff in 
the planner.

That actually makes me think of something else to consider: whether cpuCost 
should be something that should be interpreted independently, or should it be 
interpreted as scaleFactor * cpuCost?

> Cost-based job planning
> -----------------------
>
>                 Key: CRUNCH-294
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-294
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-294.patch, jobplan-default-new.png, 
> jobplan-default-old.png, jobplan-large_s2_s3.png, jobplan-lopsided.png
>
>
> A bug report on the user list drove me to revisit some of the core planning 
> logic, particularly around how we decide where to split up DoFns between two 
> dependent MapReduce jobs.
> I found an old TODO about using the scale factor from a DoFn to decide where 
> to split up the nodes between dependent GBKs, so I implemented a new version 
> of the split algorithm that takes advantage of how we've propagated support 
> for multiple outputs on both the map and reduce sides of a job to do 
> finer-grained splits that use information from the scaleFactor calculations 
> to make smarter split decisions.
> One high-level change along with this: I changed the default scaleFactor() 
> value in DoFn to 0.99f to slightly prefer writes that occur later in a 
> pipeline flow by default.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to