I think you were translating. But the last multiply is still redundant, I think.
On Sat, Sep 11, 2010 at 4:55 PM, Grant Ingersoll <[email protected]>wrote: > > On Sep 11, 2010, at 5:50 PM, Ted Dunning wrote: > > > Should be close. The matrixMult step may be redundant if you want to > > cluster the same data that you decomposed. That would make the second > > transpose unnecessary as well. > > Hmm, I thought I was just translating what Jeff had done below, > specifically: > > >>> DistributedRowMatrix sData = a.transpose().times(svdT.transpose()); > >>> sData.configure(conf); > >>> > >>> // now run the Canopy job to prime kMeans canopies > >>> CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false, > >> false); > >>> // now run the KMeans job > >>> KMeansDriver.runJob(sData.getRowPath(), new Path(output, > > > > > > > On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll <[email protected] > >wrote: > > > >> To put this in bin/mahout speak, this would look like, munging some > names > >> and taking liberties with the actual argument to be passed in: > >> > >> bin/mahout svd (original -> svdOut) > >> bin/mahout cleansvd ... > >> bin/mahout transpose svdOut -> svdT > >> bin/mahout transpose original -> originalT > >> bin/mahout matrixmult originalT svdT -> newMatrix > >> bin/mahout kmeans newMatrix > >> > >> Is that about right? > >> > >> > >> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote: > >> > >>> Ok, the transposed computation seems to work and the cast exception was > >> caused by my unit test writing LongWritable keys to the testdata file. > The > >> following test produces a comparable answer to the non-distributed case. > I > >> still want to rename the method to transposeTimes for clarity. And > better, > >> implement timesTranspose to make this particular computation more > efficient: > >>> > >>> public void testKmeansDSVD() throws Exception { > >>> DistanceMeasure measure = new EuclideanDistanceMeasure(); > >>> Path output = getTestTempDirPath("output"); > >>> Path tmp = getTestTempDirPath("tmp"); > >>> Path eigenvectors = new Path(output, "eigenvectors"); > >>> int desiredRank = 13; > >>> DistributedLanczosSolver solver = new DistributedLanczosSolver(); > >>> Configuration config = new Configuration(); > >>> solver.setConf(config); > >>> Path testData = getTestTempDirPath("testdata"); > >>> int sampleDimension = sampleData.get(0).get().size(); > >>> solver.run(testData, tmp, eigenvectors, sampleData.size(), > >> sampleDimension, false, desiredRank); > >>> > >>> // now multiply the testdata matrix and the eigenvector matrix > >>> DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors, > >> tmp, desiredRank - 1, sampleDimension); > >>> JobConf conf = new JobConf(config); > >>> svdT.configure(conf); > >>> DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp, > >> sampleData.size(), sampleDimension); > >>> a.configure(conf); > >>> DistributedRowMatrix sData = a.transpose().times(svdT.transpose()); > >>> sData.configure(conf); > >>> > >>> // now run the Canopy job to prime kMeans canopies > >>> CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false, > >> false); > >>> // now run the KMeans job > >>> KMeansDriver.runJob(sData.getRowPath(), new Path(output, > >> "clusters-0"), output, measure, 0.001, 10, 1, true, false); > >>> // run ClusterDumper > >>> ClusterDumper clusterDumper = new ClusterDumper(new Path(output, > >> "clusters-2"), new Path(output, "clusteredPoints")); > >>> clusterDumper.printClusters(termDictionary); > >>> } > >>> > >>> On 9/3/10 7:54 AM, Jeff Eastman wrote: > >>>> Looking at the single unit test of DMR.times() it seems to be > >> implementing Matrix expected = inputA.transpose().times(inputB), and not > >> inputA.times(inputB.transpose()), so the bounds checking is correct as > >> implemented. But the method still has the wrong name and AFAICT is not > >> useful for performing this particular computation. Should I use this > >> instead? > >>>> > >>>> DistributedRowMatrix sData = > >> a.transpose().t[ransposeT]imes(svdT.transpose()) > >>>> > >>>> ugh! And it still fails with: > >>>> > >>>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot > >> be cast to org.apache.hadoop.io.IntWritable > >>>> at > >> > org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1) > >>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > >>>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > >>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > >>>> at > >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > >> > >> -------------------------- > >> Grant Ingersoll > >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct > 7-8 > >> > >> > > -------------------------- > Grant Ingersoll > http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8 > >
