[ 
https://issues.apache.org/jira/browse/MAHOUT-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14036694#comment-14036694
 ] 

ASF GitHub Bot commented on MAHOUT-1573:
----------------------------------------

Github user tdunning commented on the pull request:

    https://github.com/apache/mahout/pull/13#issuecomment-46509902
  
    On Tue, Jun 17, 2014 at 10:43 PM, Dmitriy Lyubimov <[email protected]
    > wrote:
    
    > Ted, are you ready to help with a concrete alternative? This is a very
    > small issue compared to even the patch, lets build a list of alternatives
    > and vote. But lets get it done
    >
    > My additional variants
    >
    > minSplits,...
    > minPar, exactPar, autoPar (consitent with scala's collection.par())
    >
    > To give something to vote down for Ted
    > >=|| :=||
    > :||=
    >
    > Not ok with me
    >
    > minParts
    > minParallelism
    > minPartitions
    > repartition
    > reshuffle
    > and other do-something kind
    >
    
    minSplits is fine by me.  I strongly discourage abbreviations because they
    are hard for non-native English speakers to generate well and hard for
    non-native English speakers to understand.
    
    Actually, they are often very hard for me to understand and I claim to be a
    native English speaker some days.
    
    If more than nine letters is too hard to type (even with an IDE to help
    you) then minSplits seems to be reasonable common ground.


> More explicit parallelism adjustments in math-scala DRM apis; elements of 
> automatic parallelism management
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1573
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1573
>             Project: Mahout
>          Issue Type: Task
>    Affects Versions: 0.9
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 1.0
>
>
> (1) add minSplit parameter pass-thru to drmFromHDFS to be able to explicitly 
> increase parallelism. 
> (2) add parrallelism readjustment parameter to a checkpoint() call. This 
> implies shuffle-less coalesce() translation to the data set before it is 
> requested to be cached (if specified).
> Going forward, we probably should try and figure how we can automate it,  at 
> least a little bit. For example, the simplest automatic adjustment might 
> include re-adjust parallelims on load to simply fit cluster size (95% or 180% 
> of cluster size, for example), with some rule-of-thumb safeguards here, e.g. 
> we cannot exceed a factor of say 8 (or whatever we configure) in splitting 
> each original hdfs split. We should be able to get a reasonable parallelism 
> performance out of the box on simple heuristics like that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to