Re: Using SVD with Canopy/KMeans

Ted Dunning Sat, 11 Sep 2010 17:10:48 -0700

I think you were translating.  But the last multiply is still redundant, I
think.


On Sat, Sep 11, 2010 at 4:55 PM, Grant Ingersoll <[email protected]>wrote:

>
> On Sep 11, 2010, at 5:50 PM, Ted Dunning wrote:
>
> > Should be close.  The matrixMult step may be redundant if you want to
> > cluster the same data that you decomposed.  That would make the second
> > transpose unnecessary as well.
>
> Hmm, I thought I was just translating what Jeff had done below,
> specifically:
>
> >>> DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
> >>>   sData.configure(conf);
> >>>
> >>>   // now run the Canopy job to prime kMeans canopies
> >>>   CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
> >> false);
> >>>   // now run the KMeans job
> >>>   KMeansDriver.runJob(sData.getRowPath(), new Path(output,
>
>
>
> >
> > On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll <[email protected]
> >wrote:
> >
> >> To put this in bin/mahout speak, this would look like, munging some
> names
> >> and taking liberties with the actual argument to be passed in:
> >>
> >> bin/mahout svd (original -> svdOut)
> >> bin/mahout cleansvd ...
> >> bin/mahout transpose svdOut -> svdT
> >> bin/mahout transpose original -> originalT
> >> bin/mahout matrixmult originalT svdT -> newMatrix
> >> bin/mahout kmeans newMatrix
> >>
> >> Is that about right?
> >>
> >>
> >> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
> >>
> >>> Ok, the transposed computation seems to work and the cast exception was
> >> caused by my unit test writing LongWritable keys to the testdata file.
> The
> >> following test produces a comparable answer to the non-distributed case.
> I
> >> still want to rename the method to transposeTimes for clarity. And
> better,
> >> implement timesTranspose to make this particular computation more
> efficient:
> >>>
> >>> public void testKmeansDSVD() throws Exception {
> >>>   DistanceMeasure measure = new EuclideanDistanceMeasure();
> >>>   Path output = getTestTempDirPath("output");
> >>>   Path tmp = getTestTempDirPath("tmp");
> >>>   Path eigenvectors = new Path(output, "eigenvectors");
> >>>   int desiredRank = 13;
> >>>   DistributedLanczosSolver solver = new DistributedLanczosSolver();
> >>>   Configuration config = new Configuration();
> >>>   solver.setConf(config);
> >>>   Path testData = getTestTempDirPath("testdata");
> >>>   int sampleDimension = sampleData.get(0).get().size();
> >>>   solver.run(testData, tmp, eigenvectors, sampleData.size(),
> >> sampleDimension, false, desiredRank);
> >>>
> >>>   // now multiply the testdata matrix and the eigenvector matrix
> >>>   DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
> >> tmp, desiredRank - 1, sampleDimension);
> >>>   JobConf conf = new JobConf(config);
> >>>   svdT.configure(conf);
> >>>   DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
> >> sampleData.size(), sampleDimension);
> >>>   a.configure(conf);
> >>>   DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
> >>>   sData.configure(conf);
> >>>
> >>>   // now run the Canopy job to prime kMeans canopies
> >>>   CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
> >> false);
> >>>   // now run the KMeans job
> >>>   KMeansDriver.runJob(sData.getRowPath(), new Path(output,
> >> "clusters-0"), output, measure, 0.001, 10, 1, true, false);
> >>>   // run ClusterDumper
> >>>   ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
> >> "clusters-2"), new Path(output, "clusteredPoints"));
> >>>   clusterDumper.printClusters(termDictionary);
> >>> }
> >>>
> >>> On 9/3/10 7:54 AM, Jeff Eastman wrote:
> >>>> Looking at the single unit test of DMR.times() it seems to be
> >> implementing Matrix expected = inputA.transpose().times(inputB), and not
> >> inputA.times(inputB.transpose()), so the bounds checking is correct as
> >> implemented. But the method still has the wrong name and AFAICT is not
> >> useful for performing this particular computation. Should I use this
> >> instead?
> >>>>
> >>>> DistributedRowMatrix sData =
> >> a.transpose().t[ransposeT]imes(svdT.transpose())
> >>>>
> >>>> ugh! And it still fails with:
> >>>>
> >>>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
> >> be cast to org.apache.hadoop.io.IntWritable
> >>>>   at
> >>
> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
> >>>>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >>>>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> >>>>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> >>>>   at
> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> >>
> >> --------------------------
> >> Grant Ingersoll
> >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
> 7-8
> >>
> >>
>
> --------------------------
> Grant Ingersoll
> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>
>

Re: Using SVD with Canopy/KMeans

Reply via email to