[ 
https://issues.apache.org/jira/browse/CRUNCH-294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13824541#comment-13824541
 ] 

Josh Wills commented on CRUNCH-294:
-----------------------------------

The thumbnails looked cool from my perspective-- thanks!

IIRC, the original issue was concerned w/the fact that S2 and S3 were 
computationally expensive operations, and should only be run once, which is the 
user was annoyed that the planner (which is focused on minimizing disk IO) was 
running them twice: once in the reducer of one job, and then again in the 
mapper of the second job-- she was less concerned w/disk IO, and more concerned 
with overall throughput. So perhaps the issue is that we don't have a concept 
of a CPU-intensive or computationally intensive DoFn-- DoFns can only signal 
(via scaleFactor) their relative IO costs.

What about a new DoFn method that was like runAtMostOnce() which would ensure 
that a DoFn was only ever run once, even if it cost more IO to do so? You could 
also argue that if you had 2+ memory-intensive DoFns, you should try to run 
them in separate jobs so that their memory usage wouldn't overwhelm the limits 
for the JVM, so that could be something else worth signaling to the planner.

I think those are the major dimensions we care about, right? Disk IO primarily, 
then CPU/memory usage? So we mainly want to optimize for Disk IO, except when 
there are one of these exceptional conditions on a particular DoFn? Or is there 
a more elegant way to do this?

> Cost-based job planning
> -----------------------
>
>                 Key: CRUNCH-294
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-294
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-294.patch, jobplan-default-new.png, 
> jobplan-default-old.png, jobplan-large_s2_s3.png, jobplan-lopsided.png
>
>
> A bug report on the user list drove me to revisit some of the core planning 
> logic, particularly around how we decide where to split up DoFns between two 
> dependent MapReduce jobs.
> I found an old TODO about using the scale factor from a DoFn to decide where 
> to split up the nodes between dependent GBKs, so I implemented a new version 
> of the split algorithm that takes advantage of how we've propagated support 
> for multiple outputs on both the map and reduce sides of a job to do 
> finer-grained splits that use information from the scaleFactor calculations 
> to make smarter split decisions.
> One high-level change along with this: I changed the default scaleFactor() 
> value in DoFn to 0.99f to slightly prefer writes that occur later in a 
> pipeline flow by default.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to