To put this in bin/mahout speak, this would look like, munging some names and
taking liberties with the actual argument to be passed in:
bin/mahout svd (original -> svdOut)
bin/mahout cleansvd ...
bin/mahout transpose svdOut -> svdT
bin/mahout transpose original -> originalT
bin/mahout matrixmult originalT svdT -> newMatrix
bin/mahout kmeans newMatrix
Is that about right?
On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
> Ok, the transposed computation seems to work and the cast exception was
> caused by my unit test writing LongWritable keys to the testdata file. The
> following test produces a comparable answer to the non-distributed case. I
> still want to rename the method to transposeTimes for clarity. And better,
> implement timesTranspose to make this particular computation more efficient:
>
> public void testKmeansDSVD() throws Exception {
> DistanceMeasure measure = new EuclideanDistanceMeasure();
> Path output = getTestTempDirPath("output");
> Path tmp = getTestTempDirPath("tmp");
> Path eigenvectors = new Path(output, "eigenvectors");
> int desiredRank = 13;
> DistributedLanczosSolver solver = new DistributedLanczosSolver();
> Configuration config = new Configuration();
> solver.setConf(config);
> Path testData = getTestTempDirPath("testdata");
> int sampleDimension = sampleData.get(0).get().size();
> solver.run(testData, tmp, eigenvectors, sampleData.size(),
> sampleDimension, false, desiredRank);
>
> // now multiply the testdata matrix and the eigenvector matrix
> DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors, tmp,
> desiredRank - 1, sampleDimension);
> JobConf conf = new JobConf(config);
> svdT.configure(conf);
> DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
> sampleData.size(), sampleDimension);
> a.configure(conf);
> DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
> sData.configure(conf);
>
> // now run the Canopy job to prime kMeans canopies
> CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
> false);
> // now run the KMeans job
> KMeansDriver.runJob(sData.getRowPath(), new Path(output, "clusters-0"),
> output, measure, 0.001, 10, 1, true, false);
> // run ClusterDumper
> ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
> "clusters-2"), new Path(output, "clusteredPoints"));
> clusterDumper.printClusters(termDictionary);
> }
>
> On 9/3/10 7:54 AM, Jeff Eastman wrote:
>> Looking at the single unit test of DMR.times() it seems to be implementing
>> Matrix expected = inputA.transpose().times(inputB), and not
>> inputA.times(inputB.transpose()), so the bounds checking is correct as
>> implemented. But the method still has the wrong name and AFAICT is not
>> useful for performing this particular computation. Should I use this instead?
>>
>> DistributedRowMatrix sData = a.transpose().t[ransposeT]imes(svdT.transpose())
>>
>> ugh! And it still fails with:
>>
>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be
>> cast to org.apache.hadoop.io.IntWritable
>> at
>> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>> at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8