Re: Using SVD with Canopy/KMeans

Jeff Eastman Sat, 11 Sep 2010 18:18:47 -0700

+1 Ok, this makes so much sense now. It had been bothering me that U Dwould not have the same number of rows as A in the non-square case - ourtypical. It's interesting that theDistributedLanczosSolver.serializeOutput() method does not actuallyserialize the eigenvalues. I've fixed that locally using NamedVector butsomething like WeightedVectorWritable (not itself a Vector) would bebetter. There's also EigenVector objects which might be used here.


On 9/11/10 3:57 PM, Ted Dunning wrote:

Well, firstly the code is doing a singular value decomposition.  This is
similar to an eigenvector decomposition except that there is no assumption
that A is square.  With the SVD, we have

      A = U D V'

This is in contrast to the eigenvector decomposition which has

     A = U D U'

The difference lies in having both left and right singular vectors, U and V.

Practically speaking, with the SVD decomposition, we can consider A or any
other appropriate matrix to be either a pile of row vectors or a bunch of
column vectors and we can project either into the reduced dimensional
representation using U' on the left or V on the right.  In Mahout, the SVD
only gives us one of U or V (I forget which).  Most commonly, I consider A
to be row vectors.  This means that I can use the fact that V' V = I to do
this:

      A V = U D

Now, if I wanted to do this some new matrix B, I would need to keep V around
in order to do the multiplication.  But since I just did the SVD of A, all I
need is U and D which is what I think that the Mahout eigen-solver gives us.
  As you say, multiplying by D is trivial because it is diagonal.

Does that make sense?

This is similar to what you said, but differs in some details.

Note also that it is common to split D into two factors:

     A = U sqrt(D) sqrt(D) V'

and then use this for the decomposition of A

     A V sqrt(D)^-1

The multipication by the inverse of sqrt(D) is trivial since D is diagonal.


On Sat, Sep 11, 2010 at 3:35 PM, Jeff Eastman<[email protected]>wrote:

  Ted,

Is this because, for any matrix A, its eigenvector matrix P and its
diagonal eigenvalue matrix D, that A P = P D? And P D does not require a
full matrix multiplication since it is diagonal? Could you please elaborate?

Jeff



On 9/11/10 2:50 PM, Ted Dunning wrote:

Should be close.  The matrixMult step may be redundant if you want to
cluster the same data that you decomposed.  That would make the second
transpose unnecessary as well.

On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll<[email protected]

wrote:

  To put this in bin/mahout speak, this would look like, munging some names

and taking liberties with the actual argument to be passed in:

bin/mahout svd (original ->   svdOut)
bin/mahout cleansvd ...
bin/mahout transpose svdOut ->   svdT
bin/mahout transpose original ->   originalT
bin/mahout matrixmult originalT svdT ->   newMatrix
bin/mahout kmeans newMatrix

Is that about right?


On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:

  Ok, the transposed computation seems to work and the cast exception was
caused by my unit test writing LongWritable keys to the testdata file.
The
following test produces a comparable answer to the non-distributed case.
I
still want to rename the method to transposeTimes for clarity. And
better,
implement timesTranspose to make this particular computation more
efficient:

  public void testKmeansDSVD() throws Exception {
    DistanceMeasure measure = new EuclideanDistanceMeasure();
    Path output = getTestTempDirPath("output");
    Path tmp = getTestTempDirPath("tmp");
    Path eigenvectors = new Path(output, "eigenvectors");
    int desiredRank = 13;
    DistributedLanczosSolver solver = new DistributedLanczosSolver();
    Configuration config = new Configuration();
    solver.setConf(config);
    Path testData = getTestTempDirPath("testdata");
    int sampleDimension = sampleData.get(0).get().size();
    solver.run(testData, tmp, eigenvectors, sampleData.size(),

sampleDimension, false, desiredRank);

    // now multiply the testdata matrix and the eigenvector matrix
    DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,

tmp, desiredRank - 1, sampleDimension);

    JobConf conf = new JobConf(config);
    svdT.configure(conf);
    DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,

sampleData.size(), sampleDimension);

    a.configure(conf);
    DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
    sData.configure(conf);

    // now run the Canopy job to prime kMeans canopies
    CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,

false);

    // now run the KMeans job
    KMeansDriver.runJob(sData.getRowPath(), new Path(output,

"clusters-0"), output, measure, 0.001, 10, 1, true, false);

    // run ClusterDumper
    ClusterDumper clusterDumper = new ClusterDumper(new Path(output,

"clusters-2"), new Path(output, "clusteredPoints"));

    clusterDumper.printClusters(termDictionary);
  }

On 9/3/10 7:54 AM, Jeff Eastman wrote:

Looking at the single unit test of DMR.times() it seems to be

implementing Matrix expected = inputA.transpose().times(inputB), and not

inputA.times(inputB.transpose()), so the bounds checking is correct as
implemented. But the method still has the wrong name and AFAICT is not
useful for performing this particular computation. Should I use this
instead?

DistributedRowMatrix sData =
a.transpose().t[ransposeT]imes(svdT.transpose())
ugh! And it still fails with:

java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot

be cast to org.apache.hadoop.io.IntWritable
    at

org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)

    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)

    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at

org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
7-8

Re: Using SVD with Canopy/KMeans

Reply via email to