[ 
https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859855#action_12859855
 ] 

Danny Leshem commented on MAHOUT-369:
-------------------------------------

Okay, please ignore my previous patch.

I think I finally understand what's going on with the code. There are two 
seemingly unrelated issues:

1) The code returns one less eigenvector/eigenvalue than requested.

LanczosSolver.java (157) returns one less eigenvector/eigenvalue than requested-
{code}
    for (int i = 0; i < basis.numRows() - 1; i++) {
{code}

DistributedLanczosSolver.java (106) serializes one less eigenvector than 
requested-
{code}
    for(int i=0; i<eigenVectors.numRows() - 1; i++) {
{code}

2) When asked for N eigenvectors, the code returns (serializes) the most 
important N-1 eigenvectors in *descending* order (meaning, important ones are 
serialized first - which, other than the "N-1" part, is excellent!), but 
associates them with the bottom N-1 out of the top N most important eigenvalues 
in *ascending* order (meaning, important ones are serialized last).

In other words, the #1 important eigenvalue is not serialized at all, the #2 
important eigenvalue is associated with the #N-1 important eigenvector, ... and 
finally the #N most important eigenvalue is associated with the #1 important 
eigenvector.

Vector names changed drastically during the last few days (following Sean's 
patch for MAHOUT-379), so issue #2 is less obvious now. Previously the 
eigenvalues where serialized with the vector. Now they are only mentioned in 
log prints.

Here's some code to verify these claims. The code creates a 1000-rows 
100-columns matrix. For each row, the first element is +1/-1 with equal 
probabilities, and the rest are +0.001/-0.001 with equal probabilities. The 
Elements are uncorrelated. So decomposing this matrix should reveal that the 
most important PC is the 100-dimensional vector (1, 0, 0, ... , 0), and it 
should be associated with a *much* higher eigenvalue than the rest. To test my 
claims, remove the "-1" from the two loops mentioned above, and run the 
following code:

[add to TestDistributedRowMatrix.java]
{code}
  public static DistributedRowMatrix distributedMatrix(final Matrix matrix,
                                                       String baseTmpDir) 
throws IOException {
    baseTmpDir = TESTDATA + baseTmpDir;
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(conf);

    ClusteringTestUtils.writePointsToFile(new Iterable<VectorWritable>() {
      @Override
      public Iterator<VectorWritable> iterator() {
        final Iterator<MatrixSlice> it = matrix.iterator();
        return new Iterator<VectorWritable>() {
          @Override
          public boolean hasNext() { return it.hasNext(); }
          @Override
          public VectorWritable next() {
            MatrixSlice slice = it.next();
            return new VectorWritable(slice.vector());
          }
          @Override
          public void remove() { it.remove(); }
        };
      }
    }, true, baseTmpDir + "/distMatrix/part-00000", fs, conf);

    DistributedRowMatrix distMatrix = new DistributedRowMatrix(baseTmpDir + 
"/distMatrix",
                                                               baseTmpDir + 
"/tmpOut",
                                                               matrix.numRows(),
                                                               
matrix.numCols());
    distMatrix.configure(new JobConf(conf));

    return distMatrix;
  }
{code}

[add to TestDistributedLanczosSolver.java, and run the test]
{code}
  private Random rand = new Random();

  private double[] randomRow(int numColumns) {
      final double[] values = new double[numColumns];

      // Only the first column's value is "important". Columns are uncorrelated.
      values[0] = (rand.nextBoolean() ? 1 : -1);
      for (int i = 1; i < numColumns; ++i) {
          values[i] = (rand.nextBoolean() ? 0.001 : -0.001);
      }

      return values;
  }

  private Matrix randomMatrix(int numRows, int numColumns) {
      final Matrix matrix = new DenseMatrix(numRows, numColumns);
      for (int row = 0; row < numRows; ++row) {
          matrix.set(row, randomRow(numColumns));
      }
      return matrix;
  }

  public void testDistributedLanczosSolverSanity() throws Exception {
    final File testData = new File("testdata");
    if (!testData.exists()) {
      testData.mkdir();
    }

    final Matrix matrix = randomMatrix(1000, 100);
    final DistributedRowMatrix distMatrix =
            TestDistributedRowMatrix.distributedMatrix(matrix, "testdata");
    distMatrix.configure(new JobConf());

    final DistributedLanczosSolver solver = new DistributedLanczosSolver();
    final Matrix eigenVectors = new DenseMatrix(30, 100);
    final List<Double> eigenValues = new ArrayList<Double>();
    solver.solve(distMatrix, 30, eigenVectors, eigenValues, false);

    System.out.println("--- Eigenvalues ---");
    printDoubleList(eigenValues);
    
    System.out.println("--- Eigenvectors ---");
    printMatrix(eigenVectors);
  }

  private void printDoubleList(List<Double> values) {
    for (double value : values) {
        System.out.print(value + "\t");
    }
    System.out.println("");
  }

  private void printMatrix(Matrix matrix) {
      for (int row = 0; row < matrix.numRows(); ++row) {
          for (int col = 0; col < matrix.numCols(); ++col) {
              System.out.print(matrix.get(row, col) + "\t");
          }
          System.out.println("");
      }
  }
{code}

The test does not fail the current code, but you can see what's wrong by its 
prints. To make this a real test, one can replace the printing with verifying 
that 1) the first eigenvalue is, say, at least 100 times bigger than the 
second; and 2) the returned eigenvectors in fact correspond to their 
eigenvalues. This would fail the current code.

I considered submitting another patch, but I'm no longer sure about the best 
way to fix this. However, I am positive now that this is a real issue that 
needs to be fixed.

> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
>                 Key: MAHOUT-369
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-369
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.3, 0.4
>            Reporter: Danny Leshem
>             Fix For: 0.4
>
>         Attachments: MAHOUT-369.patch
>
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() 
> vectors.
> {code}
>     log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and 
> eigenValues to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
>     for(int i=0; i<eigenVectors.numRows() - 1; i++) {
>         ...
>     }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest 
> eigenvalue) is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in 
> *reverse* order, meaning the most significant vector is marked "0", the 2nd 
> most significant is marked "1", etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal 
> componenets), corresponding eigenvalues can be easily matched and compared.  
> 2) Makes it easier to discard the least significant principal components, 
> which for Lanczos decomposition are usually garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to