I've converted all the clustering to this model and am about to
commit. I added Configuration argument to all the Java methods and
removed numReducers. I also deprecated the
DefaultOptionsCreator.numReducersOption. I'm actually starting to like
putting all the Hadoop arguments into Configuration because the Java
methods now have only algorithm-specific arguments.
On 9/22/10 5:14 PM, Sean Owen wrote:
You probably know more about this than I, but conceptually, all the
params for a Hadoop job are in a Configuration. This includes
command-line Hadoop args and anything else the code tosses in.
One way or another the Job has to get such a Configuration. ToolRunner
takes care of this in the command line case. A Java caller would have
to also find a way to construct a Configuration.
That says to me that something somewhere takes a Configuration as
input in order to run the job. And that something can be called from a
main() method plus ToolRunner for the command line case, or directly
as a Java method.
I think that's about the right idea in theory, don't know how it
meshes with practice?
No the static method thing is something Findbugs and PMD flag, as well
as IntelliJ. It's not always true that a method should be static
because it can (for example if you design for inheritance) but I had
thought this was not the nature of a Driver.
On Wed, Sep 22, 2010 at 7:32 PM, Jeff Eastman
<[email protected]> wrote:
Exactly my point. Current clustering and a lot of other drivers don't call
ToolRunner in their main method; they do new Driver().run(). This needs to
be changed everywhere. The job() methods currently create new Configuration
objects since they are invoked mostly from Java in unit tests and layered
jobs (e.g. synthetic control). I've got a version of Canopy that does call
ToolRunner and it does return a populated Configuration from getConf() but,
since the job methods are now static, they can't call it; it needs to be an
explicit argument. So, I've added conf as the first parameter to job() (and
left a convenience version without it), and that seems to work.
Now I'm trying to use a -D argument to set a configuration parameter but the
parser won't accept it. I've tried -D foo.bar.baz=11 and -Dfoo.bar.baz=11
with no joy on either. What is the correct syntax?
On the separate question of explicit numReducers arguments to the Java
methods and the CLI I'm all for doing it consistently. It's more work for
Java callers to create and set the conf parameter than it is with an
explicit argument but most current callers would use the convenience method
anyway.
On the static conversions themselves, new Foo().run() is how they used to do
it but, as you noted earlier, it should be ToolRunner.run(class, conf, args)
anyway. Since run() *is* an instance method it seemed more correct to have
the methods it called also be instance methods. In clustering, the methods
used to be static when I wrote them so I can't claim to be an OO purist,
though I still don't like them. Just trying to sort out the motivation for
the change: was this PMD, Checkstyle, or Seanstyle<g>?
On 9/22/10 1:53 PM, Sean Owen wrote:
Let me try
On Wed, Sep 22, 2010 at 3:32 PM, Jeff Eastman
<[email protected]> wrote:
The clustering drivers all call new Configuration() in their
implementations. When run only from the CLI, other Mahout jobs call
getConf() which is where the -D arguments get pulled in (right?). So
there
This comes from using ToolRunner.run(). It sets up all those args, and
then calls Tool.run(). So when you implement Tool, in run(), the
result of getConf() has all that stuff.
Inside, it's org.apache.hadoop.util.GenericOptionsParser that does that
work.
I think your point is that this doesn't hold up for the case of
invoking from some arbitrary Java calling code. Yes, in that case, the
caller might have to populate a Configuration object (or be able to
modify it) to pass this sort of setting. At least that's how I'd play
it.
But then the question of adding a new command-line argument doesn't
help this use case anyway.
Am I following?
And what was the PMD/Checkstyle problem with instance methods on the
drivers
that motivated the regression to statics? I hate statics.
The reasoning was simply that the methods used no instance methods or
members. It was already "really" a static method.
I have little problem with the hard-line OO approach that even such
Driver classes ought to be full of instance methods anyway, and
perhaps have this bit of glue to the non-object-oriented world at the
end:
public static void main(String[] args) {
new Foo().doIt();
}
... but I guess I'm saying it did not seem to be written that way?
Things were passed around as method args when they could otherwise be
instance members. So it looked like the intent was a static method
anyhow.