Let me try
On Wed, Sep 22, 2010 at 3:32 PM, Jeff Eastman
<[email protected]> wrote:
> The clustering drivers all call new Configuration() in their
> implementations. When run only from the CLI, other Mahout jobs call
> getConf() which is where the -D arguments get pulled in (right?). So there
This comes from using ToolRunner.run(). It sets up all those args, and
then calls Tool.run(). So when you implement Tool, in run(), the
result of getConf() has all that stuff.
Inside, it's org.apache.hadoop.util.GenericOptionsParser that does that work.
I think your point is that this doesn't hold up for the case of
invoking from some arbitrary Java calling code. Yes, in that case, the
caller might have to populate a Configuration object (or be able to
modify it) to pass this sort of setting. At least that's how I'd play
it.
But then the question of adding a new command-line argument doesn't
help this use case anyway.
Am I following?
> And what was the PMD/Checkstyle problem with instance methods on the drivers
> that motivated the regression to statics? I hate statics.
The reasoning was simply that the methods used no instance methods or
members. It was already "really" a static method.
I have little problem with the hard-line OO approach that even such
Driver classes ought to be full of instance methods anyway, and
perhaps have this bit of glue to the non-object-oriented world at the
end:
public static void main(String[] args) {
new Foo().doIt();
}
... but I guess I'm saying it did not seem to be written that way?
Things were passed around as method args when they could otherwise be
instance members. So it looked like the intent was a static method
anyhow.
>
> On 9/22/10 10:18 AM, Sean Owen wrote:
>>
>> Oh this smells like a solvable problem for sure.
>>
>> The Job eventually has a Configuration object; what exactly is the
>> flow where it doesn't? Surely that is fixable. That should run around
>> with the Job, and within that you can set whatever you like. Shouldn't
>> need more API changes.
>>
>> I don't see what the static-ness has to do with it then?
>>
>> On Wed, Sep 22, 2010 at 2:52 PM, Jeff Eastman
>> <[email protected]> wrote:
>>>
>>> What you say is true from the command line, but currently there is no
>>> way
>>> except via explicit arguments to control this from Java drivers. The
>>> run()
>>> commands get a Configuration from AbstractJob via getConf() but this
>>> returns
>>> null when calling from Java. I guess we could change the job/run methods
>>> to
>>> accept a configuration argument and in place of the numReducers.
>>>
>>> The clustering drivers create a new configuration in those methods (not
>>> calling getConf()) right now, setting the job parameters from explicit
>>> arguments. I'll take a look at refactoring this and see if there is time
>>> to
>>> do it by end of next week. Probably is, if this is at the top of my list,
>>> but I will check.
>>>
>>> Actually, you changed all the clustering driver methods back to statics
>>> while fixing PMD/Checkstyle issues (r990892) and so getConf() cannot even
>>> be
>>> called from them!
>>>
>>> On 9/22/10 3:12 AM, Sean Owen (JIRA) wrote:
>>>>
>>>> [
>>>>
>>>> https://issues.apache.org/jira/browse/MAHOUT-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913434#action_12913434
>>>> ]
>>>>
>>>> Sean Owen commented on MAHOUT-414:
>>>> ----------------------------------
>>>>
>>>> I tend to think this is, in fact, a Hadoop-level configuration. At times
>>>> a
>>>> job may wish to force concurrency -- 1 job only when it knows there is
>>>> no
>>>> parallelism available, or 2x more reducers than mappers when that's
>>>> known to
>>>> be good.
>>>>
>>>> Users can control this already via Hadoop. Letting them control it via
>>>> duplicate command line parameters doesn't add that. I agree, it's
>>>> sometimes
>>>> hard to know how to set parallelism, though Hadoop's guesses are good.
>>>>
>>>> When I see Hadoop's guesses are too low, it's because input is too small
>>>> to create enough input shards. This is a different issue.
>>>>
>>>> So I guess I'm wondering what the concrete change here could be, for
>>>> discussion? since it's marked as 0.4.
>>>>
>>>>> Usability: Mahout applications need a consistent API to allow users to
>>>>> specify desired map/reduce concurrency
>>>>>
>>>>>
>>>>> -------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> Key: MAHOUT-414
>>>>> URL: https://issues.apache.org/jira/browse/MAHOUT-414
>>>>> Project: Mahout
>>>>> Issue Type: Bug
>>>>> Affects Versions: 0.3
>>>>> Reporter: Jeff Eastman
>>>>> Fix For: 0.4
>>>>>
>>>>>
>>>>> If specifying the number of mappers and reducers is a common activity
>>>>> which users need to perform in running Mahout applications on Hadoop
>>>>> clusters then we need to have a standard way of specifying them in our
>>>>> APIs
>>>>> without exposing the full set of Hadoop options, especially for our
>>>>> non-power-users. This is the case for some applications already but
>>>>> others
>>>>> require the use of Hadoop-level -D arguments to achieve reasonable
>>>>> out-of-the-box parallelism even when running our examples. The
>>>>> usability
>>>>> defect is that some of our algorithms won't scale without it and that
>>>>> we
>>>>> don't have a standard way to express this in our APIs.
>>>
>
>