[jira] Commented: (MAHOUT-180) port Hadoop-ified Lanczos SVD implementation from decomposer

Jake Mannix (JIRA) Tue, 23 Feb 2010 09:16:57 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837324#action_12837324
 ]


Jake Mannix commented on MAHOUT-180:
------------------------------------

Hi Danny, thanks for trying this out!  

You have indeed found some testing code which snuck in - I was trying to add 
the EigenVerificationJob to the final step of Lanczos, to save people the 
trouble of having to "clean" their eigenvectors at the end of the day, but 
didn't finish and yet it got checked in.  

The clue in the code is that I still have a line:
{code}
 // TODO ack!
{code}
Which should be a hint that I should not have checked that file in just yet. :)

I've removed it now - svn up and try again!  

If you want to see what your eigen-spectrum is like, after you've run the 
DistributedLanczosSolver, the EigenVerificationJob can be run next (it cleans 
out eigenvectors with too high error or too low eigenvalue):

{code}
$HADOOP_HOME/bin/hadoop jar 
$MAHOUT_HOME/examples/target/mahout-examples-{version}.job 
org.apache.mahout.math.hadoop.decomposer.EigenVerificationJob \
--eigenInput path/for/svd-output --corpusInput path/to/corpus --output 
path/for/cleanOutput --maxError 0.1 --minEigenvalue 10.0 
{code}

Thanks for the bug report!

> port Hadoop-ified Lanczos SVD implementation from decomposer
> ------------------------------------------------------------
>
>                 Key: MAHOUT-180
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-180
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>    Affects Versions: 0.2
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>            Priority: Minor
>             Fix For: 0.3
>
>         Attachments: MAHOUT-180.patch, MAHOUT-180.patch, MAHOUT-180.patch, 
> MAHOUT-180.patch, MAHOUT-180.patch
>
>
> I wrote up a hadoop version of the Lanczos algorithm for performing SVD on 
> sparse matrices available at http://decomposer.googlecode.com/, which is 
> Apache-licensed, and I'm willing to donate it.  I'll have to port over the 
> implementation to use Mahout vectors, or else add in these vectors as well.
> Current issues with the decomposer implementation include: if your matrix is 
> really big, you need to re-normalize before decomposition: find the largest 
> eigenvalue first, and divide all your rows by that value, then decompose, or 
> else you'll blow over Double.MAX_VALUE once you've run too many iterations 
> (the L^2 norm of intermediate vectors grows roughly as 
> (largest-eigenvalue)^(num-eigenvalues-found-so-far), so losing precision on 
> the lower end is better than blowing over MAX_VALUE).  When this is ported to 
> Mahout, we should add in the capability to do this automatically (run a 
> couple iterations to find the largest eigenvalue, save that, then iterate 
> while scaling vectors by 1/max_eigenvalue).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-180) port Hadoop-ified Lanczos SVD implementation from decomposer

Reply via email to