Re: Using SVD with Canopy/KMeans

Jeff Eastman Sat, 11 Sep 2010 15:36:20 -0700

 Ted,

Is this because, for any matrix A, its eigenvector matrix P and itsdiagonal eigenvalue matrix D, that A P = P D? And P D does not require afull matrix multiplication since it is diagonal? Could you please elaborate?


Jeff


On 9/11/10 2:50 PM, Ted Dunning wrote:

Should be close.  The matrixMult step may be redundant if you want to
cluster the same data that you decomposed.  That would make the second
transpose unnecessary as well.

On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll<[email protected]>wrote:

To put this in bin/mahout speak, this would look like, munging some names
and taking liberties with the actual argument to be passed in:

bin/mahout svd (original ->  svdOut)
bin/mahout cleansvd ...
bin/mahout transpose svdOut ->  svdT
bin/mahout transpose original ->  originalT
bin/mahout matrixmult originalT svdT ->  newMatrix
bin/mahout kmeans newMatrix

Is that about right?


On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:

Ok, the transposed computation seems to work and the cast exception was

caused by my unit test writing LongWritable keys to the testdata file. The
following test produces a comparable answer to the non-distributed case. I
still want to rename the method to transposeTimes for clarity. And better,
implement timesTranspose to make this particular computation more efficient:

  public void testKmeansDSVD() throws Exception {
    DistanceMeasure measure = new EuclideanDistanceMeasure();
    Path output = getTestTempDirPath("output");
    Path tmp = getTestTempDirPath("tmp");
    Path eigenvectors = new Path(output, "eigenvectors");
    int desiredRank = 13;
    DistributedLanczosSolver solver = new DistributedLanczosSolver();
    Configuration config = new Configuration();
    solver.setConf(config);
    Path testData = getTestTempDirPath("testdata");
    int sampleDimension = sampleData.get(0).get().size();
    solver.run(testData, tmp, eigenvectors, sampleData.size(),

sampleDimension, false, desiredRank);

    // now multiply the testdata matrix and the eigenvector matrix
    DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,

tmp, desiredRank - 1, sampleDimension);

    JobConf conf = new JobConf(config);
    svdT.configure(conf);
    DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,

sampleData.size(), sampleDimension);

    a.configure(conf);
    DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
    sData.configure(conf);

    // now run the Canopy job to prime kMeans canopies
    CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,

false);

    // now run the KMeans job
    KMeansDriver.runJob(sData.getRowPath(), new Path(output,

"clusters-0"), output, measure, 0.001, 10, 1, true, false);

    // run ClusterDumper
    ClusterDumper clusterDumper = new ClusterDumper(new Path(output,

"clusters-2"), new Path(output, "clusteredPoints"));

    clusterDumper.printClusters(termDictionary);
  }

On 9/3/10 7:54 AM, Jeff Eastman wrote:

Looking at the single unit test of DMR.times() it seems to be

implementing Matrix expected = inputA.transpose().times(inputB), and not
inputA.times(inputB.transpose()), so the bounds checking is correct as
implemented. But the method still has the wrong name and AFAICT is not
useful for performing this particular computation. Should I use this
instead?

DistributedRowMatrix sData =

a.transpose().t[ransposeT]imes(svdT.transpose())

ugh! And it still fails with:

java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot

be cast to org.apache.hadoop.io.IntWritable

at

org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)

    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at

org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

Re: Using SVD with Canopy/KMeans

Reply via email to