[ 
https://issues.apache.org/jira/browse/MAHOUT-180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jake Mannix updated MAHOUT-180:
-------------------------------

    Attachment: MAHOUT-180.patch

Ok, ugly, dirty patch which needs to be cleaned up, but it does work, in some 
circumstances, for some inputs (on my cluster).  *cough*

This patch makes some extensions of the DocumentVectorizer as well.  Lets say 
you already have a SequenceFile<Text,Text> of your corpus (living at text_path, 
then you can produce some good output by doing: 

  $HADOOP_HOME/bin/hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job 
org.apache.mahout.text.SparseVectorsFromSequenceFiles -i text_path -o 
corpus_as_vectors_path -seq true -w tfidf -chunk 1000 --minSupport 1 --minDF 5 
--maxDFPercent 50 --norm 2

now I've got some SequentialAccessSparseVectors in corpus_as_vectors_path, 
tfidf weighted, stripping out terms which occur more than half of the time (L2 
normalized), etc.  Now for the fun:  you need to know what the dimension of the 
vectors you spit out (you can do this by guessing and getting it wrong, and 
slightly more helpful CardinalityException will be spit out in the 
logs/console, or you can get it from the corpus_as_vectors entries themselves). 
 If the value you find is numFeatures, then try this hadoop job:

 $HADOOP_HOME/bin/hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job 
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver -i 
corpus_as_vectors_path -o corpus_svd_path -nr 1 -nc <numFeatures> --rank 100 

This will zip along making 100 passes over your data, then doing a 
decomposition of a nice and small (100x100) matrix in memory, and producing a 
SequenceFile<IntWritable,VectorWritable> (where the values are DenseVectors of 
dimension numFeatures - so should not be MAX_VALUE!), where the "name" of the 
vectors contains a string which is not actually the eigenvalue, but it's 
proportional to it - I'm working on that part still.

There's also a unit test (which currently takes about a minute on my laptop) - 
DistributedLanczosSolverTest, which validates accuracy.

TODO: cleanup, stuff mentioned above,  a job which validates correctness 
explicitly after the fact, and some utilities for taking the eigenvectors and 
doing useful stuff with them.

NOTE: Lanczos spits out desiredRank - 1 orthogonal vectors which are pretty 
close to being eigenvectors of the square of your matrix (ie they are 
right-singular vectors of the input corpus), but they span the spectrum: the 
first few are the ones with the highest singular values, the last few are the 
ones with the lowest singular values.  If you really want, e.g.  the highest 
100 singular vectors, ask Lanczos for *300* as the rank, and then only keep the 
top 100, and this will give you 100 "of the largest" singular vectors, but no 
guarantee that you don't miss part of that top of the spectrum.  For most 
cases, this isn't a worry, but you should keep it in mind.



> port Hadoop-ified Lanczos SVD implementation from decomposer
> ------------------------------------------------------------
>
>                 Key: MAHOUT-180
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-180
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>    Affects Versions: 0.2
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>            Priority: Minor
>             Fix For: 0.3
>
>         Attachments: MAHOUT-180.patch, MAHOUT-180.patch, MAHOUT-180.patch
>
>
> I wrote up a hadoop version of the Lanczos algorithm for performing SVD on 
> sparse matrices available at http://decomposer.googlecode.com/, which is 
> Apache-licensed, and I'm willing to donate it.  I'll have to port over the 
> implementation to use Mahout vectors, or else add in these vectors as well.
> Current issues with the decomposer implementation include: if your matrix is 
> really big, you need to re-normalize before decomposition: find the largest 
> eigenvalue first, and divide all your rows by that value, then decompose, or 
> else you'll blow over Double.MAX_VALUE once you've run too many iterations 
> (the L^2 norm of intermediate vectors grows roughly as 
> (largest-eigenvalue)^(num-eigenvalues-found-so-far), so losing precision on 
> the lower end is better than blowing over MAX_VALUE).  When this is ported to 
> Mahout, we should add in the capability to do this automatically (run a 
> couple iterations to find the largest eigenvalue, save that, then iterate 
> while scaling vectors by 1/max_eigenvalue).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to