Re: Using SVD with Canopy/KMeans

Jeff Eastman Fri, 03 Sep 2010 08:20:06 -0700

Ok, the transposed computation seems to work and the cast exceptionwas caused by my unit test writing LongWritable keys to the testdatafile. The following test produces a comparable answer to thenon-distributed case. I still want to rename the method totransposeTimes for clarity. And better, implement timesTranspose to makethis particular computation more efficient:


  public void testKmeansDSVD() throws Exception {
    DistanceMeasure measure = new EuclideanDistanceMeasure();
    Path output = getTestTempDirPath("output");
    Path tmp = getTestTempDirPath("tmp");
    Path eigenvectors = new Path(output, "eigenvectors");
    int desiredRank = 13;
    DistributedLanczosSolver solver = new DistributedLanczosSolver();
    Configuration config = new Configuration();
    solver.setConf(config);
    Path testData = getTestTempDirPath("testdata");
    int sampleDimension = sampleData.get(0).get().size();

solver.run(testData, tmp, eigenvectors, sampleData.size(),sampleDimension, false, desiredRank);


    // now multiply the testdata matrix and the eigenvector matrix

DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,tmp, desiredRank - 1, sampleDimension);

    JobConf conf = new JobConf(config);
    svdT.configure(conf);

DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,sampleData.size(), sampleDimension);

    a.configure(conf);
    DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
    sData.configure(conf);

    // now run the Canopy job to prime kMeans canopies

CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4,false, false);

    // now run the KMeans job

KMeansDriver.runJob(sData.getRowPath(), new Path(output,"clusters-0"), output, measure, 0.001, 10, 1, true, false);

    // run ClusterDumper

ClusterDumper clusterDumper = new ClusterDumper(new Path(output,"clusters-2"), new Path(output, "clusteredPoints"));

    clusterDumper.printClusters(termDictionary);
  }

On 9/3/10 7:54 AM, Jeff Eastman wrote:

Looking at the single unit test of DMR.times() it seems to beimplementing Matrix expected = inputA.transpose().times(inputB), andnot inputA.times(inputB.transpose()), so the bounds checking iscorrect as implemented. But the method still has the wrong name andAFAICT is not useful for performing this particular computation.Should I use this instead?
DistributedRowMatrix sData =a.transpose().t[ransposeT]imes(svdT.transpose())
ugh! And it still fails with:
java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannotbe cast to org.apache.hadoop.io.IntWritableatorg.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
atorg.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

Re: Using SVD with Canopy/KMeans

Reply via email to