Hi all, Had a few questions regarding my project; I've been watching the mailing list carefully, so I apologize if the answers to these questions snuck by me.
1) There was a thread early on regarding having the students keep blogs about their projects. Should we be using our own blogs for this purpose, or do this somewhere on the Mahout wiki, or other? 2) In getting a feel for Mahout, I've been running a few of the examples on my own, and have noticed that if I supply the "-h" argument by itself to some of the available programs, I get an exception, followed by the list of available options for that program. For instance, in running "./bin/mahout seq2sparse -h": May 25, 2010 10:21:09 PM org.slf4j.impl.JCLLoggerAdapter error SEVERE: Exception org.apache.commons.cli2.OptionException: Missing required option --output at org.apache.commons.cli2.option.DefaultOption.validate(DefaultOption.java:172) at org.apache.commons.cli2.option.GroupImpl.validate(GroupImpl.java:265) at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:104) at org.apache.mahout.text.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:123) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:171) Usage: [...snip...] Running "kmeans" and "seqdirectory" will give me the same error. Running "clusterdump", however, gave me only the "Usage" output, not an exception. I tested this on the latest Mahout build from the trunk. If I supply all the needed arguments to run some sort of test using data, it works just fine. Just wondering if I'm doing something wrong? 3) Saving the best for last :) I'm having some trouble getting off the ground with my code, and was hoping I could bounce some ideas around with the experts who are intimately familiar with the architecture (I'm working to implement MAHOUT-363). I see that there are drivers associated with each program, specifically to handle initializing the algorithm and process inputs from the command line. It looks like the functions main() and runJob() are always present, along with other high-level methods (though private) that invoke the other classes in the respective clustering folders. I also see all the clustering algorithms have mapper classes, responsible for implementing the map/reduce portion. I suppose my question here is, is there a "standard" architecture for implementing a new algorithm aside from what I've just described (i.e. am I missing anything)? Are there any superclasses/interfaces I should pay particular attention to in terms of extending/implementing? Also, my algorithm depends in no small part on calculating eigenvalues and eigenvectors of affinity/markov transition matrices, and I understand that Mahout has a parallel eigensolver implemented. How can I make use of this? And as always, any other "general" pointers regarding the overall Mahout architecture? :) I apologize if any of these questions are pedantic...and I apologize in advance, as I'll likely be a regular fixture here with more questions as the summer continues :) Thank you very much for your help! Regards, Shannon
