Ok, the transposed computation seems to work and the cast exception
was caused by my unit test writing LongWritable keys to the testdata
file. The following test produces a comparable answer to the
non-distributed case. I still want to rename the method to
transposeTimes for clarity. And better, implement timesTranspose to make
this particular computation more efficient:
public void testKmeansDSVD() throws Exception {
DistanceMeasure measure = new EuclideanDistanceMeasure();
Path output = getTestTempDirPath("output");
Path tmp = getTestTempDirPath("tmp");
Path eigenvectors = new Path(output, "eigenvectors");
int desiredRank = 13;
DistributedLanczosSolver solver = new DistributedLanczosSolver();
Configuration config = new Configuration();
solver.setConf(config);
Path testData = getTestTempDirPath("testdata");
int sampleDimension = sampleData.get(0).get().size();
solver.run(testData, tmp, eigenvectors, sampleData.size(),
sampleDimension, false, desiredRank);
// now multiply the testdata matrix and the eigenvector matrix
DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
tmp, desiredRank - 1, sampleDimension);
JobConf conf = new JobConf(config);
svdT.configure(conf);
DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
sampleData.size(), sampleDimension);
a.configure(conf);
DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
sData.configure(conf);
// now run the Canopy job to prime kMeans canopies
CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4,
false, false);
// now run the KMeans job
KMeansDriver.runJob(sData.getRowPath(), new Path(output,
"clusters-0"), output, measure, 0.001, 10, 1, true, false);
// run ClusterDumper
ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
"clusters-2"), new Path(output, "clusteredPoints"));
clusterDumper.printClusters(termDictionary);
}
On 9/3/10 7:54 AM, Jeff Eastman wrote:
Looking at the single unit test of DMR.times() it seems to be
implementing Matrix expected = inputA.transpose().times(inputB), and
not inputA.times(inputB.transpose()), so the bounds checking is
correct as implemented. But the method still has the wrong name and
AFAICT is not useful for performing this particular computation.
Should I use this instead?
DistributedRowMatrix sData =
a.transpose().t[ransposeT]imes(svdT.transpose())
ugh! And it still fails with:
java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
be cast to org.apache.hadoop.io.IntWritable
at
org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)