The clustering drivers all call new Configuration() in their implementations. When run only from the CLI, other Mahout jobs call getConf() which is where the -D arguments get pulled in (right?). So there is no way to set the Hadoop parameters when calling the static driver methods from Java programs. This is because getConf() cannot be called at all. Even with the instance versions of the methods, it would return null unless called from the CLI.

And what was the PMD/Checkstyle problem with instance methods on the drivers that motivated the regression to statics? I hate statics.

On 9/22/10 10:18 AM, Sean Owen wrote:
Oh this smells like a solvable problem for sure.

The Job eventually has a Configuration object; what exactly is the
flow where it doesn't? Surely that is fixable. That should run around
with the Job, and within that you can set whatever you like. Shouldn't
need more API changes.

I don't see what the static-ness has to do with it then?

On Wed, Sep 22, 2010 at 2:52 PM, Jeff Eastman
<[email protected]>  wrote:
  What you say is true from the command line, but currently there is no way
except via explicit arguments to control this from Java drivers. The run()
commands get a Configuration from AbstractJob via getConf() but this returns
null when calling from Java. I guess we could change the job/run methods to
accept a configuration argument and in place of the numReducers.

The clustering drivers create a new configuration in those methods (not
calling getConf()) right now, setting the job parameters from explicit
arguments. I'll take a look at refactoring this and see if there is time to
do it by end of next week. Probably is, if this is at the top of my list,
but I will check.

Actually, you changed all the clustering driver methods back to statics
while fixing PMD/Checkstyle issues (r990892) and so getConf() cannot even be
called from them!

On 9/22/10 3:12 AM, Sean Owen (JIRA) wrote:
     [
https://issues.apache.org/jira/browse/MAHOUT-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913434#action_12913434
]

Sean Owen commented on MAHOUT-414:
----------------------------------

I tend to think this is, in fact, a Hadoop-level configuration. At times a
job may wish to force concurrency -- 1 job only when it knows there is no
parallelism available, or 2x more reducers than mappers when that's known to
be good.

Users can control this already via Hadoop. Letting them control it via
duplicate command line parameters doesn't add that. I agree, it's sometimes
hard to know how to set parallelism, though Hadoop's guesses are good.

When I see Hadoop's guesses are too low, it's because input is too small
to create enough input shards. This is a different issue.

So I guess I'm wondering what the concrete change here could be, for
discussion? since it's marked as 0.4.

Usability: Mahout applications need a consistent API to allow users to
specify desired map/reduce concurrency

-------------------------------------------------------------------------------------------------------------

                 Key: MAHOUT-414
                 URL: https://issues.apache.org/jira/browse/MAHOUT-414
             Project: Mahout
          Issue Type: Bug
    Affects Versions: 0.3
            Reporter: Jeff Eastman
             Fix For: 0.4


If specifying the number of mappers and reducers is a common activity
which users need to perform in running Mahout applications on Hadoop
clusters then we need to have a standard way of specifying them in our APIs
without exposing the full set of Hadoop options, especially for our
non-power-users. This is the case for some applications already but others
require the use of Hadoop-level -D arguments to achieve reasonable
out-of-the-box parallelism even when running our examples. The usability
defect is that some of our algorithms won't scale without it and that we
don't have a standard way to express this in our APIs.


Reply via email to