[
https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983592#action_12983592
]
Sean Owen commented on MAHOUT-369:
----------------------------------
This one's also been on the shelf for about 4 months. Is it ready to go, or
should it be archived?
> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
> Key: MAHOUT-369
> URL: https://issues.apache.org/jira/browse/MAHOUT-369
> Project: Mahout
> Issue Type: Bug
> Components: Math
> Affects Versions: 0.3, 0.4
> Reporter: Danny Leshem
> Assignee: Jake Mannix
> Fix For: 0.5
>
> Attachments: MAHOUT-369.patch
>
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows()
> vectors.
> {code}
> log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and
> eigenValues to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
> for(int i=0; i<eigenVectors.numRows() - 1; i++) {
> ...
> }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest
> eigenvalue) is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in
> *reverse* order, meaning the most significant vector is marked "0", the 2nd
> most significant is marked "1", etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal
> componenets), corresponding eigenvalues can be easily matched and compared.
> 2) Makes it easier to discard the least significant principal components,
> which for Lanczos decomposition are usually garbage.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.