Josh Wills created CRUNCH-294:
---------------------------------
Summary: Cost-based job planning
Key: CRUNCH-294
URL: https://issues.apache.org/jira/browse/CRUNCH-294
Project: Crunch
Issue Type: Improvement
Components: Core
Reporter: Josh Wills
Assignee: Josh Wills
Attachments: CRUNCH-294.patch
A bug report on the user list drove me to revisit some of the core planning
logic, particularly around how we decide where to split up DoFns between two
dependent MapReduce jobs.
I found an old TODO about using the scale factor from a DoFn to decide where to
split up the nodes between dependent GBKs, so I implemented a new version of
the split algorithm that takes advantage of how we've propagated support for
multiple outputs on both the map and reduce sides of a job to do finer-grained
splits that use information from the scaleFactor calculations to make smarter
split decisions.
One high-level change along with this: I changed the default scaleFactor()
value in DoFn to 0.99f to slightly prefer writes that occur later in a pipeline
flow by default.
--
This message was sent by Atlassian JIRA
(v6.1#6144)