Exactly my point. Current clustering and a lot of other drivers don't
call ToolRunner in their main method; they do new Driver().run(). This
needs to be changed everywhere. The job() methods currently create new
Configuration objects since they are invoked mostly from Java in unit
tests and layered jobs (e.g. synthetic control). I've got a version of
Canopy that does call ToolRunner and it does return a populated
Configuration from getConf() but, since the job methods are now static,
they can't call it; it needs to be an explicit argument. So, I've added
conf as the first parameter to job() (and left a convenience version
without it), and that seems to work.
Now I'm trying to use a -D argument to set a configuration parameter but
the parser won't accept it. I've tried -D foo.bar.baz=11 and
-Dfoo.bar.baz=11 with no joy on either. What is the correct syntax?
On the separate question of explicit numReducers arguments to the Java
methods and the CLI I'm all for doing it consistently. It's more work
for Java callers to create and set the conf parameter than it is with an
explicit argument but most current callers would use the convenience
method anyway.
On the static conversions themselves, new Foo().run() is how they used
to do it but, as you noted earlier, it should be ToolRunner.run(class,
conf, args) anyway. Since run() *is* an instance method it seemed more
correct to have the methods it called also be instance methods. In
clustering, the methods used to be static when I wrote them so I can't
claim to be an OO purist, though I still don't like them. Just trying to
sort out the motivation for the change: was this PMD, Checkstyle, or
Seanstyle <g>?
On 9/22/10 1:53 PM, Sean Owen wrote:
Let me try
On Wed, Sep 22, 2010 at 3:32 PM, Jeff Eastman
<[email protected]> wrote:
The clustering drivers all call new Configuration() in their
implementations. When run only from the CLI, other Mahout jobs call
getConf() which is where the -D arguments get pulled in (right?). So there
This comes from using ToolRunner.run(). It sets up all those args, and
then calls Tool.run(). So when you implement Tool, in run(), the
result of getConf() has all that stuff.
Inside, it's org.apache.hadoop.util.GenericOptionsParser that does that work.
I think your point is that this doesn't hold up for the case of
invoking from some arbitrary Java calling code. Yes, in that case, the
caller might have to populate a Configuration object (or be able to
modify it) to pass this sort of setting. At least that's how I'd play
it.
But then the question of adding a new command-line argument doesn't
help this use case anyway.
Am I following?
And what was the PMD/Checkstyle problem with instance methods on the drivers
that motivated the regression to statics? I hate statics.
The reasoning was simply that the methods used no instance methods or
members. It was already "really" a static method.
I have little problem with the hard-line OO approach that even such
Driver classes ought to be full of instance methods anyway, and
perhaps have this bit of glue to the non-object-oriented world at the
end:
public static void main(String[] args) {
new Foo().doIt();
}
... but I guess I'm saying it did not seem to be written that way?
Things were passed around as method args when they could otherwise be
instance members. So it looked like the intent was a static method
anyhow.