Dmitriy Lyubimov created MAHOUT-1573:
----------------------------------------
Summary: More explicit parallelism adjustments in math-scala DRM
apis; elements of automatic re-adjustments
Key: MAHOUT-1573
URL: https://issues.apache.org/jira/browse/MAHOUT-1573
Project: Mahout
Issue Type: Task
Affects Versions: 0.9
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
Fix For: 1.0
(1) add minSplit parameter pass-thru to drmFromHDFS to be able to explicitly
increase parallelism.
(2) add parrallelism readjustment parameter to a checkpoint() call. This
implies shuffle-less coalesce() translation to the data set before it is
requested to be cached (if specified).
Going forward, we probably should try and figure how we can automate it, at
least a little bit. For example, the simplest automatic adjustment might
include re-adjust parallelims on load to simply fit cluster size (95% or 180%
of cluster size, for example), with some rule-of-thumb safeguards here, e.g. we
cannot exceed a factor of say 8 (or whatever we configure) in splitting each
original hdfs split. We should be able to get a reasonable parallelism
performance out of the box on simple heuristics like that.
--
This message was sent by Atlassian JIRA
(v6.2#6252)