Jake? On Tue, Jan 18, 2011 at 11:50 PM, Sean Owen (JIRA) <[email protected]> wrote:
> > [ > https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983592#action_12983592] > > Sean Owen commented on MAHOUT-369: > ---------------------------------- > > This one's also been on the shelf for about 4 months. Is it ready to go, or > should it be archived? > > > Issues with DistributedLanczosSolver output > > ------------------------------------------- > > > > Key: MAHOUT-369 > > URL: https://issues.apache.org/jira/browse/MAHOUT-369 > > Project: Mahout > > Issue Type: Bug > > Components: Math > > Affects Versions: 0.3, 0.4 > > Reporter: Danny Leshem > > Assignee: Jake Mannix > > Fix For: 0.5 > > > > Attachments: MAHOUT-369.patch > > > > > > DistributedLanczosSolver (line 99) claims to persist > eigenVectors.numRows() vectors. > > {code} > > log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and > eigenValues to: " + outputPath); > > {code} > > However, a few lines later (line 106) we have > > {code} > > for(int i=0; i<eigenVectors.numRows() - 1; i++) { > > ... > > } > > {code} > > which only persists eigenVectors.numRows()-1 vectors. > > Seems like the most significant eigenvector (i.e. the one with the > largest eigenvalue) is omitted... off by one bug? > > Also, I think it would be better if the eigenvectors are persisted in > *reverse* order, meaning the most significant vector is marked "0", the 2nd > most significant is marked "1", etc. > > This, for two reasons: > > 1) When performing another PCA on the same corpus (say, with more > principal componenets), corresponding eigenvalues can be easily matched and > compared. > > 2) Makes it easier to discard the least significant principal components, > which for Lanczos decomposition are usually garbage. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > >
