What you say is true from the command line, but currently there is no way except via explicit arguments to control this from Java drivers. The run() commands get a Configuration from AbstractJob via getConf() but this returns null when calling from Java. I guess we could change the job/run methods to accept a configuration argument and in place of the numReducers.

The clustering drivers create a new configuration in those methods (not calling getConf()) right now, setting the job parameters from explicit arguments. I'll take a look at refactoring this and see if there is time to do it by end of next week. Probably is, if this is at the top of my list, but I will check.

Actually, you changed all the clustering driver methods back to statics while fixing PMD/Checkstyle issues (r990892) and so getConf() cannot even be called from them!

On 9/22/10 3:12 AM, Sean Owen (JIRA) wrote:
     [ 
https://issues.apache.org/jira/browse/MAHOUT-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913434#action_12913434
 ]

Sean Owen commented on MAHOUT-414:
----------------------------------

I tend to think this is, in fact, a Hadoop-level configuration. At times a job 
may wish to force concurrency -- 1 job only when it knows there is no 
parallelism available, or 2x more reducers than mappers when that's known to be 
good.

Users can control this already via Hadoop. Letting them control it via 
duplicate command line parameters doesn't add that. I agree, it's sometimes 
hard to know how to set parallelism, though Hadoop's guesses are good.

When I see Hadoop's guesses are too low, it's because input is too small to 
create enough input shards. This is a different issue.

So I guess I'm wondering what the concrete change here could be, for 
discussion? since it's marked as 0.4.

Usability: Mahout applications need a consistent API to allow users to specify 
desired map/reduce concurrency
-------------------------------------------------------------------------------------------------------------

                 Key: MAHOUT-414
                 URL: https://issues.apache.org/jira/browse/MAHOUT-414
             Project: Mahout
          Issue Type: Bug
    Affects Versions: 0.3
            Reporter: Jeff Eastman
             Fix For: 0.4


If specifying the number of mappers and reducers is a common activity which 
users need to perform in running Mahout applications on Hadoop clusters then we 
need to have a standard way of specifying them in our APIs without exposing the 
full set of Hadoop options, especially for our non-power-users. This is the 
case for some applications already but others require the use of Hadoop-level 
-D arguments to achieve reasonable out-of-the-box parallelism even when running 
our examples. The usability defect is that some of our algorithms won't scale 
without it and that we don't have a standard way to express this in our APIs.

Reply via email to