Re: [jira] Commented: (MAHOUT-414) Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency

Sean Owen Wed, 22 Sep 2010 07:19:09 -0700

Oh this smells like a solvable problem for sure.

The Job eventually has a Configuration object; what exactly is the
flow where it doesn't? Surely that is fixable. That should run around
with the Job, and within that you can set whatever you like. Shouldn't
need more API changes.


I don't see what the static-ness has to do with it then?

On Wed, Sep 22, 2010 at 2:52 PM, Jeff Eastman
<[email protected]> wrote:
>  What you say is true from the command line, but currently there is no way
> except via explicit arguments to control this from Java drivers. The run()
> commands get a Configuration from AbstractJob via getConf() but this returns
> null when calling from Java. I guess we could change the job/run methods to
> accept a configuration argument and in place of the numReducers.
>
> The clustering drivers create a new configuration in those methods (not
> calling getConf()) right now, setting the job parameters from explicit
> arguments. I'll take a look at refactoring this and see if there is time to
> do it by end of next week. Probably is, if this is at the top of my list,
> but I will check.
>
> Actually, you changed all the clustering driver methods back to statics
> while fixing PMD/Checkstyle issues (r990892) and so getConf() cannot even be
> called from them!
>
> On 9/22/10 3:12 AM, Sean Owen (JIRA) wrote:
>>
>>     [
>> https://issues.apache.org/jira/browse/MAHOUT-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913434#action_12913434
>> ]
>>
>> Sean Owen commented on MAHOUT-414:
>> ----------------------------------
>>
>> I tend to think this is, in fact, a Hadoop-level configuration. At times a
>> job may wish to force concurrency -- 1 job only when it knows there is no
>> parallelism available, or 2x more reducers than mappers when that's known to
>> be good.
>>
>> Users can control this already via Hadoop. Letting them control it via
>> duplicate command line parameters doesn't add that. I agree, it's sometimes
>> hard to know how to set parallelism, though Hadoop's guesses are good.
>>
>> When I see Hadoop's guesses are too low, it's because input is too small
>> to create enough input shards. This is a different issue.
>>
>> So I guess I'm wondering what the concrete change here could be, for
>> discussion? since it's marked as 0.4.
>>
>>> Usability: Mahout applications need a consistent API to allow users to
>>> specify desired map/reduce concurrency
>>>
>>> -------------------------------------------------------------------------------------------------------------
>>>
>>>                 Key: MAHOUT-414
>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-414
>>>             Project: Mahout
>>>          Issue Type: Bug
>>>    Affects Versions: 0.3
>>>            Reporter: Jeff Eastman
>>>             Fix For: 0.4
>>>
>>>
>>> If specifying the number of mappers and reducers is a common activity
>>> which users need to perform in running Mahout applications on Hadoop
>>> clusters then we need to have a standard way of specifying them in our APIs
>>> without exposing the full set of Hadoop options, especially for our
>>> non-power-users. This is the case for some applications already but others
>>> require the use of Hadoop-level -D arguments to achieve reasonable
>>> out-of-the-box parallelism even when running our examples. The usability
>>> defect is that some of our algorithms won't scale without it and that we
>>> don't have a standard way to express this in our APIs.
>
>

Re: [jira] Commented: (MAHOUT-414) Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency

Reply via email to