[
https://issues.apache.org/jira/browse/MAHOUT-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020512#comment-14020512
]
ASF GitHub Bot commented on MAHOUT-1573:
----------------------------------------
GitHub user dlyubimov opened a pull request:
https://github.com/apache/mahout/pull/13
MAHOUT-1573: explicit parallelism
Per issue https://issues.apache.org/jira/browse/MAHOUT-1573.
does something like
(A + B) exact_|| 200
or
(A + B) min_|| 200
look too ugly?
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dlyubimov/mahout MAHOUT-1573
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/mahout/pull/13.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13
----
commit 2f785109b9a52e748626ba46f5bc0a35ffc98e2c
Author: Dmitriy Lyubimov <[email protected]>
Date: 2014-06-06T20:02:11Z
Refactoring drmFromHDFS()
commit cf7f18b4af4ad043d2bdcefeeda15031fa018543
Author: Dmitriy Lyubimov <[email protected]>
Date: 2014-06-06T20:21:19Z
docs
commit 2733002f4b5db3d5114a440b03967d954a3738e9
Author: Dmitriy Lyubimov <[email protected]>
Date: 2014-06-06T22:56:26Z
explicit parallelism adjustment levers exact_|| and min_||
----
> More explicit parallelism adjustments in math-scala DRM apis; elements of
> automatic re-adjustments
> --------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-1573
> URL: https://issues.apache.org/jira/browse/MAHOUT-1573
> Project: Mahout
> Issue Type: Task
> Affects Versions: 0.9
> Reporter: Dmitriy Lyubimov
> Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> (1) add minSplit parameter pass-thru to drmFromHDFS to be able to explicitly
> increase parallelism.
> (2) add parrallelism readjustment parameter to a checkpoint() call. This
> implies shuffle-less coalesce() translation to the data set before it is
> requested to be cached (if specified).
> Going forward, we probably should try and figure how we can automate it, at
> least a little bit. For example, the simplest automatic adjustment might
> include re-adjust parallelims on load to simply fit cluster size (95% or 180%
> of cluster size, for example), with some rule-of-thumb safeguards here, e.g.
> we cannot exceed a factor of say 8 (or whatever we configure) in splitting
> each original hdfs split. We should be able to get a reasonable parallelism
> performance out of the box on simple heuristics like that.
--
This message was sent by Atlassian JIRA
(v6.2#6252)