[
https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859855#action_12859855
]
Danny Leshem commented on MAHOUT-369:
-------------------------------------
Okay, please ignore my previous patch.
I think I finally understand what's going on with the code. There are two
seemingly unrelated issues:
1) The code returns one less eigenvector/eigenvalue than requested.
LanczosSolver.java (157) returns one less eigenvector/eigenvalue than requested-
{code}
for (int i = 0; i < basis.numRows() - 1; i++) {
{code}
DistributedLanczosSolver.java (106) serializes one less eigenvector than
requested-
{code}
for(int i=0; i<eigenVectors.numRows() - 1; i++) {
{code}
2) When asked for N eigenvectors, the code returns (serializes) the most
important N-1 eigenvectors in *descending* order (meaning, important ones are
serialized first - which, other than the "N-1" part, is excellent!), but
associates them with the bottom N-1 out of the top N most important eigenvalues
in *ascending* order (meaning, important ones are serialized last).
In other words, the #1 important eigenvalue is not serialized at all, the #2
important eigenvalue is associated with the #N-1 important eigenvector, ... and
finally the #N most important eigenvalue is associated with the #1 important
eigenvector.
Vector names changed drastically during the last few days (following Sean's
patch for MAHOUT-379), so issue #2 is less obvious now. Previously the
eigenvalues where serialized with the vector. Now they are only mentioned in
log prints.
Here's some code to verify these claims. The code creates a 1000-rows
100-columns matrix. For each row, the first element is +1/-1 with equal
probabilities, and the rest are +0.001/-0.001 with equal probabilities. The
Elements are uncorrelated. So decomposing this matrix should reveal that the
most important PC is the 100-dimensional vector (1, 0, 0, ... , 0), and it
should be associated with a *much* higher eigenvalue than the rest. To test my
claims, remove the "-1" from the two loops mentioned above, and run the
following code:
[add to TestDistributedRowMatrix.java]
{code}
public static DistributedRowMatrix distributedMatrix(final Matrix matrix,
String baseTmpDir)
throws IOException {
baseTmpDir = TESTDATA + baseTmpDir;
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
ClusteringTestUtils.writePointsToFile(new Iterable<VectorWritable>() {
@Override
public Iterator<VectorWritable> iterator() {
final Iterator<MatrixSlice> it = matrix.iterator();
return new Iterator<VectorWritable>() {
@Override
public boolean hasNext() { return it.hasNext(); }
@Override
public VectorWritable next() {
MatrixSlice slice = it.next();
return new VectorWritable(slice.vector());
}
@Override
public void remove() { it.remove(); }
};
}
}, true, baseTmpDir + "/distMatrix/part-00000", fs, conf);
DistributedRowMatrix distMatrix = new DistributedRowMatrix(baseTmpDir +
"/distMatrix",
baseTmpDir +
"/tmpOut",
matrix.numRows(),
matrix.numCols());
distMatrix.configure(new JobConf(conf));
return distMatrix;
}
{code}
[add to TestDistributedLanczosSolver.java, and run the test]
{code}
private Random rand = new Random();
private double[] randomRow(int numColumns) {
final double[] values = new double[numColumns];
// Only the first column's value is "important". Columns are uncorrelated.
values[0] = (rand.nextBoolean() ? 1 : -1);
for (int i = 1; i < numColumns; ++i) {
values[i] = (rand.nextBoolean() ? 0.001 : -0.001);
}
return values;
}
private Matrix randomMatrix(int numRows, int numColumns) {
final Matrix matrix = new DenseMatrix(numRows, numColumns);
for (int row = 0; row < numRows; ++row) {
matrix.set(row, randomRow(numColumns));
}
return matrix;
}
public void testDistributedLanczosSolverSanity() throws Exception {
final File testData = new File("testdata");
if (!testData.exists()) {
testData.mkdir();
}
final Matrix matrix = randomMatrix(1000, 100);
final DistributedRowMatrix distMatrix =
TestDistributedRowMatrix.distributedMatrix(matrix, "testdata");
distMatrix.configure(new JobConf());
final DistributedLanczosSolver solver = new DistributedLanczosSolver();
final Matrix eigenVectors = new DenseMatrix(30, 100);
final List<Double> eigenValues = new ArrayList<Double>();
solver.solve(distMatrix, 30, eigenVectors, eigenValues, false);
System.out.println("--- Eigenvalues ---");
printDoubleList(eigenValues);
System.out.println("--- Eigenvectors ---");
printMatrix(eigenVectors);
}
private void printDoubleList(List<Double> values) {
for (double value : values) {
System.out.print(value + "\t");
}
System.out.println("");
}
private void printMatrix(Matrix matrix) {
for (int row = 0; row < matrix.numRows(); ++row) {
for (int col = 0; col < matrix.numCols(); ++col) {
System.out.print(matrix.get(row, col) + "\t");
}
System.out.println("");
}
}
{code}
The test does not fail the current code, but you can see what's wrong by its
prints. To make this a real test, one can replace the printing with verifying
that 1) the first eigenvalue is, say, at least 100 times bigger than the
second; and 2) the returned eigenvectors in fact correspond to their
eigenvalues. This would fail the current code.
I considered submitting another patch, but I'm no longer sure about the best
way to fix this. However, I am positive now that this is a real issue that
needs to be fixed.
> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
> Key: MAHOUT-369
> URL: https://issues.apache.org/jira/browse/MAHOUT-369
> Project: Mahout
> Issue Type: Bug
> Components: Math
> Affects Versions: 0.3, 0.4
> Reporter: Danny Leshem
> Fix For: 0.4
>
> Attachments: MAHOUT-369.patch
>
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows()
> vectors.
> {code}
> log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and
> eigenValues to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
> for(int i=0; i<eigenVectors.numRows() - 1; i++) {
> ...
> }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest
> eigenvalue) is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in
> *reverse* order, meaning the most significant vector is marked "0", the 2nd
> most significant is marked "1", etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal
> componenets), corresponding eigenvalues can be easily matched and compared.
> 2) Makes it easier to discard the least significant principal components,
> which for Lanczos decomposition are usually garbage.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.