On 26 October 2011 21:54, Jeff Eastman <[email protected]> wrote:
> I use the CLI about half the time. For many applications, esp. well-baked
> clustering & seq2sparse, I use the CLI. Other times I need to use a Java API
> because I'm building a custom job in Java and it is just easier than doing
> the scripting.
>
> Is your point that there is a CLI epic that is desired by Mahout users?
I'm not sure what Mahout users desire! Might be interesting to ask
somehow. The project has different faces for different contexts, and
that seems fine. Reason I asked ... I was a bit suprised the other day
to find one of the commandline jobs ('rowid' I think) just plain
failing, which led me to wonder whether the tool saw a lot of use. I
didn't even notice it for several months of using Mahout via API.
But mostly it was sheer curiousity. It's good to know how these tools
are actually being used (and which parts).
Also I've been thinking in very fuzzy terms about how to compose
larger tasks from smaller pieces, and wondering what might be a more
principled way of doing this than running each bin/mahout job by hand.
Obviously coding it up is one way, but also little shell scripts or
makefiles or (if forced at gunpoint) maybe Ant ...?
Oh, also I poked a little into using Apache Ant for that layer wrapped
around mahout jobs, see notes in
http://mail-archives.apache.org/mod_mbox/mahout-user/201109.mbox/raw/%3CCAFNgM+YaPgFzEJP+sDNXPKfxi0AmAY7_sN79FN2=1cvjl8c...@mail.gmail.com%3E
...but hit a wall there. I managed to blend in some jobs, e.g. as Pig
script mixing Mahout-ese collocations with Pig Latin for filters:
reuters_phrases =
collocations('/user/danbri/migtest/reuters-out-seqdir', IGNORE);
political_phrases = FILTER reuters_phrases BY phrase MATCHES
'.*(president|government|election).*' AND score > (float)10;
...but when I tried integrating seqdirectory the same way, couldn't
get it working. Also not fluent enough with Pig to be sure if it's the
right environment for error handling, ie. figuring out what to do when
some sub-job fails. I've been meaning to check out
https://issues.apache.org/jira/browse/MAHOUT-612 as that seems to
offer some more principled hope in same direction.
In my last (bibliographic dataset) experiments my initial db was small
enough that I could do some of the pre- and post- processing in nasty
ruby scripts; however I want to move that work to a much larger
dataset. This might be a good excuse to see whether Python UDFs in Pig
are any use for that kind of glue/hacking/integration work. All of
which is a long-winded way of wondering out loud how much of the
'small things loosely-joined' unix-y philosophy makes sense on the
Mahout commandline...
cheers,
Dan