[GSOC] Array of questions

Shannon Quinn Tue, 25 May 2010 20:11:02 -0700

Hi all,

Had a few questions regarding my project; I've been watching the mailing
list carefully, so I apologize if the answers to these questions snuck by
me.


1) There was a thread early on regarding having the students keep blogs
about their projects. Should we be using our own blogs for this purpose, or
do this somewhere on the Mahout wiki, or other?

2) In getting a feel for Mahout, I've been running a few of the examples on
my own, and have noticed that if I supply the "-h" argument by itself to
some of the available programs, I get an exception, followed by the list of
available options for that program. For instance, in running "./bin/mahout
seq2sparse -h":

May 25, 2010 10:21:09 PM org.slf4j.impl.JCLLoggerAdapter error
SEVERE: Exception
org.apache.commons.cli2.OptionException: Missing required option --output
at
org.apache.commons.cli2.option.DefaultOption.validate(DefaultOption.java:172)
at org.apache.commons.cli2.option.GroupImpl.validate(GroupImpl.java:265)
at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:104)
at
org.apache.mahout.text.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:123)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:171)
Usage:

[...snip...]

Running "kmeans" and "seqdirectory" will give me the same error. Running
"clusterdump", however, gave me only the "Usage" output, not an exception. I
tested this on the latest Mahout build from the trunk. If I supply all the
needed arguments to run some sort of test using data, it works just fine.
Just wondering if I'm doing something wrong?

3) Saving the best for last :) I'm having some trouble getting off the
ground with my code, and was hoping I could bounce some ideas around with
the experts who are intimately familiar with the architecture (I'm working
to implement MAHOUT-363).

I see that there are drivers associated with each program, specifically to
handle initializing the algorithm and process inputs from the command line.
It looks like the functions main() and runJob() are always present, along
with other high-level methods (though private) that invoke the other classes
in the respective clustering folders. I also see all the clustering
algorithms have mapper classes, responsible for implementing the map/reduce
portion.

I suppose my question here is, is there a "standard" architecture for
implementing a new algorithm aside from what I've just described (i.e. am I
missing anything)? Are there any superclasses/interfaces I should pay
particular attention to in terms of extending/implementing? Also, my
algorithm depends in no small part on calculating eigenvalues and
eigenvectors of affinity/markov transition matrices, and I understand that
Mahout has a parallel eigensolver implemented. How can I make use of this?

And as always, any other "general" pointers regarding the overall Mahout
architecture? :)

I apologize if any of these questions are pedantic...and I apologize in
advance, as I'll likely be a regular fixture here with more questions as the
summer continues :) Thank you very much for your help!

Regards,
Shannon

[GSOC] Array of questions

Reply via email to