Re: Using SVD with Canopy/KMeans

Jeff Eastman Fri, 03 Sep 2010 07:54:48 -0700

Looking at the single unit test of DMR.times() it seems to beimplementing Matrix expected = inputA.transpose().times(inputB), and notinputA.times(inputB.transpose()), so the bounds checking is correct asimplemented. But the method still has the wrong name and AFAICT is notuseful for performing this particular computation. Should I use thisinstead?

DistributedRowMatrix sData =a.transpose().t[ransposeT]imes(svdT.transpose())


ugh! And it still fails with:

java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannotbe cast to org.apache.hadoop.io.IntWritableatorg.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)

    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)

atorg.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)


On 9/3/10 7:15 AM, Jeff Eastman wrote:

Ah, ok, that would explain some of the cardinality checkingdifferences. Its kind of confusing to have the times() method doingsomething different from Matrix in the distributed case. Mayberenaming to transposeTimes would be clearer? At least adding somecomments for the unwary?
Here's my revised test using DistributedRowMatrix for thecomputations. It fails in DMR.times() with a CardinalityException.Looking at the source of the exception (again), it is comparingnumRows != other.numRows(). Since matrix A is [15x39] and svdT is[12x39] shouldn't A.t[ransposeT]imes(svdT) be comparing the numColsinstead? What exactly is DMR.times() doing?
  @Test
  public void testKmeansDSVD() throws Exception {
    DistanceMeasure measure = new EuclideanDistanceMeasure();
    Path output = getTestTempDirPath("output");
    Path tmp = getTestTempDirPath("tmp");
    Path eigenvectors = new Path(output, "eigenvectors");
    int desiredRank = 13;
    DistributedLanczosSolver solver = new DistributedLanczosSolver();
    Configuration config = new Configuration();
    solver.setConf(config);
    Path testData = getTestTempDirPath("testdata");
    int sampleDimension = sampleData.get(0).get().size();
solver.run(testData, tmp, eigenvectors, sampleData.size(),sampleDimension, false, desiredRank);
    // now multiply the testdata matrix and the eigenvector matrix
DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,tmp, desiredRank - 1, sampleDimension);
    JobConf conf = new JobConf(config);
    svdT.configure(conf);
DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,sampleData.size(), sampleDimension);
    a.configure(conf);
    // DMR.times() is really transposeTimes()? Then this should work.
    DistributedRowMatrix sData = a.times(svdT);
    sData.configure(conf);

    // now run the Canopy job to prime kMeans canopies
CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4,false, false);
    // now run the KMeans job
KMeansDriver.runJob(sData.getRowPath(), new Path(output,"clusters-0"), output, measure, 0.001, 10, 1, true, false);
    // run ClusterDumper
ClusterDumper clusterDumper = new ClusterDumper(new Path(output,"clusters-2"), new Path(output, "clusteredPoints"));
    clusterDumper.printClusters(termDictionary);
  }



On 9/2/10 10:10 PM, Ted Dunning wrote:
I think that the solver actually does an SVD, but most of what you say
follows.
THere is one strangeness, I think in that theDistributedRowMatrix.times is
doing a transposeTimes operation, not the normal times.

Jake should comment.
On Thu, Sep 2, 2010 at 8:28 PM, JeffEastman<[email protected]>wrote:
  On 9/2/10 7:41 PM, Jeff Eastman wrote:
Hopefully answering my own question here but ending up withanother. Thesvd matrix I'd built from the eigenvectors is the wrong shape as Ibuilt it.Taking Jake's "column space" literally and building a matrix whereeach ofthe columns is one of the eigenvectors does give a matrix of thecorrectshape. The math works with DenseMatrix, producing a new data matrixwhich is
15x7; a significant dimensionality reduction from 15x39.

In this example, with 15 samples having 39 terms and 7 eigenvectors:
    A = [15x39]
    P = [39x7]
    A P = [15x7]
<snip>
Representing the eigen decomposition math in the above notation, A Pis the
projection of the data set onto the eigenvector basis:

If:
A = original data matrix
P = eigenvector column matrix
D = eigenvalue diagonal matrix

Then:
A P = P D =>  A = P D P'
Since we have A and P is already calculated byDistributedLanczosSolver itis easy to compute A P and we don't need the eigenvalues at all.This isgood because the DLS does not output them. Is this why it doesn'tbother?

Re: Using SVD with Canopy/KMeans

Reply via email to