What you say is true from the command line, but currently there is no
way except via explicit arguments to control this from Java drivers. The
run() commands get a Configuration from AbstractJob via getConf() but
this returns null when calling from Java. I guess we could change the
job/run methods to accept a configuration argument and in place of the
numReducers.
The clustering drivers create a new configuration in those methods (not
calling getConf()) right now, setting the job parameters from explicit
arguments. I'll take a look at refactoring this and see if there is time
to do it by end of next week. Probably is, if this is at the top of my
list, but I will check.
Actually, you changed all the clustering driver methods back to statics
while fixing PMD/Checkstyle issues (r990892) and so getConf() cannot
even be called from them!
On 9/22/10 3:12 AM, Sean Owen (JIRA) wrote:
[
https://issues.apache.org/jira/browse/MAHOUT-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913434#action_12913434
]
Sean Owen commented on MAHOUT-414:
----------------------------------
I tend to think this is, in fact, a Hadoop-level configuration. At times a job
may wish to force concurrency -- 1 job only when it knows there is no
parallelism available, or 2x more reducers than mappers when that's known to be
good.
Users can control this already via Hadoop. Letting them control it via
duplicate command line parameters doesn't add that. I agree, it's sometimes
hard to know how to set parallelism, though Hadoop's guesses are good.
When I see Hadoop's guesses are too low, it's because input is too small to
create enough input shards. This is a different issue.
So I guess I'm wondering what the concrete change here could be, for
discussion? since it's marked as 0.4.
Usability: Mahout applications need a consistent API to allow users to specify
desired map/reduce concurrency
-------------------------------------------------------------------------------------------------------------
Key: MAHOUT-414
URL: https://issues.apache.org/jira/browse/MAHOUT-414
Project: Mahout
Issue Type: Bug
Affects Versions: 0.3
Reporter: Jeff Eastman
Fix For: 0.4
If specifying the number of mappers and reducers is a common activity which
users need to perform in running Mahout applications on Hadoop clusters then we
need to have a standard way of specifying them in our APIs without exposing the
full set of Hadoop options, especially for our non-power-users. This is the
case for some applications already but others require the use of Hadoop-level
-D arguments to achieve reasonable out-of-the-box parallelism even when running
our examples. The usability defect is that some of our algorithms won't scale
without it and that we don't have a standard way to express this in our APIs.