[ 
https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jake Mannix updated MAHOUT-369:
-------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Committed revision 1088831.

> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
>                 Key: MAHOUT-369
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-369
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.3, 0.4
>            Reporter: Danny Leshem
>            Assignee: Jake Mannix
>              Labels: DistributedLanczosSolver, decomposer
>             Fix For: 0.5
>
>         Attachments: MAHOUT-369.diff, MAHOUT-369.patch
>
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() 
> vectors.
> {code}
>     log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and 
> eigenValues to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
>     for(int i=0; i<eigenVectors.numRows() - 1; i++) {
>         ...
>     }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest 
> eigenvalue) is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in 
> *reverse* order, meaning the most significant vector is marked "0", the 2nd 
> most significant is marked "1", etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal 
> componenets), corresponding eigenvalues can be easily matched and compared.  
> 2) Makes it easier to discard the least significant principal components, 
> which for Lanczos decomposition are usually garbage.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to