Dmitriy Lyubimov created MAHOUT-1573:
----------------------------------------

             Summary: More explicit parallelism adjustments in math-scala DRM 
apis; elements of automatic re-adjustments
                 Key: MAHOUT-1573
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1573
             Project: Mahout
          Issue Type: Task
    Affects Versions: 0.9
            Reporter: Dmitriy Lyubimov
            Assignee: Dmitriy Lyubimov
             Fix For: 1.0


(1) add minSplit parameter pass-thru to drmFromHDFS to be able to explicitly 
increase parallelism. 

(2) add parrallelism readjustment parameter to a checkpoint() call. This 
implies shuffle-less coalesce() translation to the data set before it is 
requested to be cached (if specified).

Going forward, we probably should try and figure how we can automate it,  at 
least a little bit. For example, the simplest automatic adjustment might 
include re-adjust parallelims on load to simply fit cluster size (95% or 180% 
of cluster size, for example), with some rule-of-thumb safeguards here, e.g. we 
cannot exceed a factor of say 8 (or whatever we configure) in splitting each 
original hdfs split. We should be able to get a reasonable parallelism 
performance out of the box on simple heuristics like that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to