Re: [jira] Commented: (MAHOUT-414) Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency

Jeff Eastman Wed, 22 Sep 2010 06:53:27 -0700

What you say is true from the command line, but currently there is noway except via explicit arguments to control this from Java drivers. Therun() commands get a Configuration from AbstractJob via getConf() butthis returns null when calling from Java. I guess we could change thejob/run methods to accept a configuration argument and in place of thenumReducers.

The clustering drivers create a new configuration in those methods (notcalling getConf()) right now, setting the job parameters from explicitarguments. I'll take a look at refactoring this and see if there is timeto do it by end of next week. Probably is, if this is at the top of mylist, but I will check.

Actually, you changed all the clustering driver methods back to staticswhile fixing PMD/Checkstyle issues (r990892) and so getConf() cannoteven be called from them!


On 9/22/10 3:12 AM, Sean Owen (JIRA) wrote:

     [ 
https://issues.apache.org/jira/browse/MAHOUT-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913434#action_12913434
 ]

Sean Owen commented on MAHOUT-414:
----------------------------------

I tend to think this is, in fact, a Hadoop-level configuration. At times a job 
may wish to force concurrency -- 1 job only when it knows there is no 
parallelism available, or 2x more reducers than mappers when that's known to be 
good.

Users can control this already via Hadoop. Letting them control it via 
duplicate command line parameters doesn't add that. I agree, it's sometimes 
hard to know how to set parallelism, though Hadoop's guesses are good.

When I see Hadoop's guesses are too low, it's because input is too small to 
create enough input shards. This is a different issue.

So I guess I'm wondering what the concrete change here could be, for 
discussion? since it's marked as 0.4.

Usability: Mahout applications need a consistent API to allow users to specify 
desired map/reduce concurrency
-------------------------------------------------------------------------------------------------------------

                 Key: MAHOUT-414
                 URL: https://issues.apache.org/jira/browse/MAHOUT-414
             Project: Mahout
          Issue Type: Bug
    Affects Versions: 0.3
            Reporter: Jeff Eastman
             Fix For: 0.4


If specifying the number of mappers and reducers is a common activity which 
users need to perform in running Mahout applications on Hadoop clusters then we 
need to have a standard way of specifying them in our APIs without exposing the 
full set of Hadoop options, especially for our non-power-users. This is the 
case for some applications already but others require the use of Hadoop-level 
-D arguments to achieve reasonable out-of-the-box parallelism even when running 
our examples. The usability defect is that some of our algorithms won't scale 
without it and that we don't have a standard way to express this in our APIs.

Re: [jira] Commented: (MAHOUT-414) Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency

Reply via email to