[ https://issues.apache.org/jira/browse/MAHOUT-180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jake Mannix updated MAHOUT-180: ------------------------------- Attachment: MAHOUT-180.patch Ok, ugly, dirty patch which needs to be cleaned up, but it does work, in some circumstances, for some inputs (on my cluster). *cough* This patch makes some extensions of the DocumentVectorizer as well. Lets say you already have a SequenceFile<Text,Text> of your corpus (living at text_path, then you can produce some good output by doing: $HADOOP_HOME/bin/hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.text.SparseVectorsFromSequenceFiles -i text_path -o corpus_as_vectors_path -seq true -w tfidf -chunk 1000 --minSupport 1 --minDF 5 --maxDFPercent 50 --norm 2 now I've got some SequentialAccessSparseVectors in corpus_as_vectors_path, tfidf weighted, stripping out terms which occur more than half of the time (L2 normalized), etc. Now for the fun: you need to know what the dimension of the vectors you spit out (you can do this by guessing and getting it wrong, and slightly more helpful CardinalityException will be spit out in the logs/console, or you can get it from the corpus_as_vectors entries themselves). If the value you find is numFeatures, then try this hadoop job: $HADOOP_HOME/bin/hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver -i corpus_as_vectors_path -o corpus_svd_path -nr 1 -nc <numFeatures> --rank 100 This will zip along making 100 passes over your data, then doing a decomposition of a nice and small (100x100) matrix in memory, and producing a SequenceFile<IntWritable,VectorWritable> (where the values are DenseVectors of dimension numFeatures - so should not be MAX_VALUE!), where the "name" of the vectors contains a string which is not actually the eigenvalue, but it's proportional to it - I'm working on that part still. There's also a unit test (which currently takes about a minute on my laptop) - DistributedLanczosSolverTest, which validates accuracy. TODO: cleanup, stuff mentioned above, a job which validates correctness explicitly after the fact, and some utilities for taking the eigenvectors and doing useful stuff with them. NOTE: Lanczos spits out desiredRank - 1 orthogonal vectors which are pretty close to being eigenvectors of the square of your matrix (ie they are right-singular vectors of the input corpus), but they span the spectrum: the first few are the ones with the highest singular values, the last few are the ones with the lowest singular values. If you really want, e.g. the highest 100 singular vectors, ask Lanczos for *300* as the rank, and then only keep the top 100, and this will give you 100 "of the largest" singular vectors, but no guarantee that you don't miss part of that top of the spectrum. For most cases, this isn't a worry, but you should keep it in mind. > port Hadoop-ified Lanczos SVD implementation from decomposer > ------------------------------------------------------------ > > Key: MAHOUT-180 > URL: https://issues.apache.org/jira/browse/MAHOUT-180 > Project: Mahout > Issue Type: New Feature > Components: Math > Affects Versions: 0.2 > Reporter: Jake Mannix > Assignee: Jake Mannix > Priority: Minor > Fix For: 0.3 > > Attachments: MAHOUT-180.patch, MAHOUT-180.patch, MAHOUT-180.patch > > > I wrote up a hadoop version of the Lanczos algorithm for performing SVD on > sparse matrices available at http://decomposer.googlecode.com/, which is > Apache-licensed, and I'm willing to donate it. I'll have to port over the > implementation to use Mahout vectors, or else add in these vectors as well. > Current issues with the decomposer implementation include: if your matrix is > really big, you need to re-normalize before decomposition: find the largest > eigenvalue first, and divide all your rows by that value, then decompose, or > else you'll blow over Double.MAX_VALUE once you've run too many iterations > (the L^2 norm of intermediate vectors grows roughly as > (largest-eigenvalue)^(num-eigenvalues-found-so-far), so losing precision on > the lower end is better than blowing over MAX_VALUE). When this is ported to > Mahout, we should add in the capability to do this automatically (run a > couple iterations to find the largest eigenvalue, save that, then iterate > while scaling vectors by 1/max_eigenvalue). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.