RE: MahoutDriver (get it?) Perhaps we should have called it MahoutMahout?
At any rate, very good stuff. On Mar 2, 2010, at 3:10 PM, Jake Mannix wrote: > Hey all, > > Just an update on the new-and-improved command-line "UI" we have now. > After a ton of iterations back and forth with Drew (thanks!), MAHOUT-301 > has been committed, and brings with it the easy ability to trim down your > long long command lines for most of our *Driver main() methods, by saving > your default command-line arguments for various drivers in properties files > (which are then overridable via the command line), either locally or on > hadoop. Feature-set is as follows (usage after that): > > Either from the binary distribution or from source (after having done "mvn > install", naturally), this is the setup - there are a bunch of properties > files with a kludgey format (because I didn't want to dig into the xml > rathole, and while a nice flexible schema is nice, I opted to follow the > YAGNI principle) : > > *) there is a new directory "conf" at the top level (of the binary dist, > as well as source), which contains a bunch of *.props files: one special one > called driver.classes.props, which has the mapping between (the keys) > fully-qualified class name of a class which has a main() method, and the > "short-name" (the values) and brief description. The current file is just > the following: > > ### > org.apache.mahout.utils.vectors.VectorDumper = vectordump : Dump vectors > from a sequence file to text > org.apache.mahout.utils.clustering.ClusterDumper = clusterdump : Dump > cluster output to text > org.apache.mahout.utils.SequenceFileDumper = seqdumper : Generic Sequence > File dumper > org.apache.mahout.clustering.kmeans.KMeansDriver = kmeans : K-means > clustering > org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver = fkmeans : Fuzzy > K-means clustering > org.apache.mahout.clustering.lda.LDADriver = lda : Latent Dirchlet > Allocation > org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver = fpg : Frequent Pattern > Growth > org.apache.mahout.clustering.dirichlet.DirichletDriver = dirichlet : > Dirichlet Clustering > org.apache.mahout.clustering.meanshift.MeanShiftCanopyDriver = meanshift : > Mean Shift clustering > org.apache.mahout.clustering.canopy.CanopyDriver = canopy : Canopy > clustering > org.apache.mahout.utils.vectors.lucene.Driver = lucene.vector : Generate > Vectors from a Lucene index > org.apache.mahout.text.SequenceFilesFromDirectory = seqdirectory : Generate > sequence files (of Text) from a directory > org.apache.mahout.text.SparseVectorsFromSequenceFiles = seq2sparse: Sparse > Vector generation from Text sequence files > org.apache.mahout.text.WikipediaToSequenceFile = seqwiki : Wikipedia xml > dump to sequence file > org.apache.mahout.classifier.bayes.TestClassifier = testclassifier : Test > Bayes Classifier > org.apache.mahout.classifier.bayes.TrainClassifier = trainclassifier : Train > Bayes Classifier > org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver = svd : > Lanczos Singular Value Decomposition > org.apache.mahout.math.hadoop.decomposer.EigenVerificationJob = cleansvd : > Cleanup and verification of SVD output > ### > > It's meant to be read into java.util.Properties, where the values on the > right hand side are further split by ":" into the short-name (to be used on > the command-line) and the description (printed to stdout if an invalid input > is made or "-h" is used with no class name to run). *If there are missing > classes from this list, please add them!* > > *) there are also a bunch of files in conf/ which are named > <shortName>.props, where <shortName> is one of the driver.classes.props > above. These files are mostly empty now (well, commented out), but for > example, conf/svd.props is currently: > > #i|input = > #o|output = > #nr|numRows = > #nc|numCols = > #r|rank = > #t|tempDir = > > the format of these props files is that the key is of the form > "singleDashCmdLineOpt|doubleDashCmdLineOpt", (if there is no "|" in the key, > the short and long form will be assumed to be the same) and the value is > whatever you would want that option to be (does not currently support > options with no value, this is a TODO). So for example if you had a > command line such as: > >> $MAHOUT_HOME/bin/mahout svd --input /path/to/input -o /path/to/output -nr > <numRows> --numCols <numCols> -r <rank> -t /tmp/svd > > You could just uncomment the lines in conf/svd.conf as > > i|input = /path/to/input > o|output = /path/to/output > nr|numRows = <numRows> > nc|numCols = <numCols> > r|rank = <rank> > t|tempDir = /tmp/svd > > and run as > >> $MAHOUT_HOME/bin/mahout svd > > If you wanted to run a second time, but you didn't want to overwrite your > old results, you could then do > >> $MAHOUT_HOME/bin/mahout svd -o /path/to/newOutput > > which would override /path/to/output and instead use /path/to/newOutput, > with all the other properties coming from the svd.props. > > *) the $MAHOUT_HOME/conf directory is just a template - the mahout shell > script adds $MAHOUT_CONF_DIR to the classpath (or $MAHOUT_HOME/conf if > $MAHOUT_CONF_DIR is not defined), and MahoutDriver reads the properties > files from the classpath. > > *) running on Hadoop: if your $HADOOP_HOME and $HADOOP_CONF_DIR are set, > the mahout shell script automatically launches your requested main method to > your hadoop cluster, otherwise it's run locally. > > *) if your main() isn't defined in driver.classes.properties, that's ok, > it'll still run via: > > > $MAHOUT_HOME/bin/mahout org.apache.mahout.blah.blah.SomeOtherDriver [remaining > args] > > and in fact, if you put "org.apache.mahout.blah.blah.SomeOtherDriver.props" > on your classpath, and has the format for the <shortName>.props listed > above, it will be used for default properties for this class. > > -------- > > I'll put this up in some nicer form for the wiki in the next couple of days. > > > Try out various driver classes that you use - we all use different ones, so > getting some dev/user manual test coverage would be nice, because it's kinda > tricky to unit test shell scripts and command line args and env variables > (and running on a real cluster, etc...). We should try to fix any bugs > before release. > > Feedback welcome. It's hacky, but it adds some useful functionality, and we > can clean up the props-file syntax (or ditch it for xml/yaml/json/whatever) > as needed later. > > -jake -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search