[ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jake Mannix updated MAHOUT-369: ------------------------------- Resolution: Fixed Status: Resolved (was: Patch Available) Committed revision 1088831. > Issues with DistributedLanczosSolver output > ------------------------------------------- > > Key: MAHOUT-369 > URL: https://issues.apache.org/jira/browse/MAHOUT-369 > Project: Mahout > Issue Type: Bug > Components: Math > Affects Versions: 0.3, 0.4 > Reporter: Danny Leshem > Assignee: Jake Mannix > Labels: DistributedLanczosSolver, decomposer > Fix For: 0.5 > > Attachments: MAHOUT-369.diff, MAHOUT-369.patch > > > DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() > vectors. > {code} > log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and > eigenValues to: " + outputPath); > {code} > However, a few lines later (line 106) we have > {code} > for(int i=0; i<eigenVectors.numRows() - 1; i++) { > ... > } > {code} > which only persists eigenVectors.numRows()-1 vectors. > Seems like the most significant eigenvector (i.e. the one with the largest > eigenvalue) is omitted... off by one bug? > Also, I think it would be better if the eigenvectors are persisted in > *reverse* order, meaning the most significant vector is marked "0", the 2nd > most significant is marked "1", etc. > This, for two reasons: > 1) When performing another PCA on the same corpus (say, with more principal > componenets), corresponding eigenvalues can be easily matched and compared. > 2) Makes it easier to discard the least significant principal components, > which for Lanczos decomposition are usually garbage. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira