[
https://issues.apache.org/jira/browse/CRUNCH-294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Wills updated CRUNCH-294:
------------------------------
Attachment: CRUNCH-294b.patch
[~gabriel.reid] took another pass at this by adding a breakpoint() method to
PCollection that allows the client to mark where on the path between two GBK
operations that a split should occur. I was hacking on the cpuFootprint()
approach, but it felt a bit too abstracted from the real thing that the client
wanted to do in this case.
I also modified the split logic to first ignore any node paths that already
contain materialized SourceTargets, and then choose between either selecting a)
the smallest single collection that covered all of the node paths between two
GBKs and b) the individual smallest PCollections along the individual node
paths, whichever was smaller.
> Cost-based job planning
> -----------------------
>
> Key: CRUNCH-294
> URL: https://issues.apache.org/jira/browse/CRUNCH-294
> Project: Crunch
> Issue Type: Improvement
> Components: Core
> Reporter: Josh Wills
> Assignee: Josh Wills
> Attachments: CRUNCH-294.patch, CRUNCH-294b.patch,
> jobplan-default-new.png, jobplan-default-old.png, jobplan-large_s2_s3.png,
> jobplan-lopsided.png
>
>
> A bug report on the user list drove me to revisit some of the core planning
> logic, particularly around how we decide where to split up DoFns between two
> dependent MapReduce jobs.
> I found an old TODO about using the scale factor from a DoFn to decide where
> to split up the nodes between dependent GBKs, so I implemented a new version
> of the split algorithm that takes advantage of how we've propagated support
> for multiple outputs on both the map and reduce sides of a job to do
> finer-grained splits that use information from the scaleFactor calculations
> to make smarter split decisions.
> One high-level change along with this: I changed the default scaleFactor()
> value in DoFn to 0.99f to slightly prefer writes that occur later in a
> pipeline flow by default.
--
This message was sent by Atlassian JIRA
(v6.1#6144)