Re: Using SVD with Canopy/KMeans

Ted Dunning Sat, 11 Sep 2010 15:58:31 -0700

Well, firstly the code is doing a singular value decomposition.  This is
similar to an eigenvector decomposition except that there is no assumption
that A is square.  With the SVD, we have


     A = U D V'

This is in contrast to the eigenvector decomposition which has

    A = U D U'

The difference lies in having both left and right singular vectors, U and V.

Practically speaking, with the SVD decomposition, we can consider A or any
other appropriate matrix to be either a pile of row vectors or a bunch of
column vectors and we can project either into the reduced dimensional
representation using U' on the left or V on the right.  In Mahout, the SVD
only gives us one of U or V (I forget which).  Most commonly, I consider A
to be row vectors.  This means that I can use the fact that V' V = I to do
this:

     A V = U D

Now, if I wanted to do this some new matrix B, I would need to keep V around
in order to do the multiplication.  But since I just did the SVD of A, all I
need is U and D which is what I think that the Mahout eigen-solver gives us.
 As you say, multiplying by D is trivial because it is diagonal.

Does that make sense?

This is similar to what you said, but differs in some details.

Note also that it is common to split D into two factors:

    A = U sqrt(D) sqrt(D) V'

and then use this for the decomposition of A

    A V sqrt(D)^-1

The multipication by the inverse of sqrt(D) is trivial since D is diagonal.


On Sat, Sep 11, 2010 at 3:35 PM, Jeff Eastman <[email protected]>wrote:

>  Ted,
>
> Is this because, for any matrix A, its eigenvector matrix P and its
> diagonal eigenvalue matrix D, that A P = P D? And P D does not require a
> full matrix multiplication since it is diagonal? Could you please elaborate?
>
> Jeff
>
>
>
> On 9/11/10 2:50 PM, Ted Dunning wrote:
>
>> Should be close.  The matrixMult step may be redundant if you want to
>> cluster the same data that you decomposed.  That would make the second
>> transpose unnecessary as well.
>>
>> On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll<[email protected]
>> >wrote:
>>
>>  To put this in bin/mahout speak, this would look like, munging some names
>>> and taking liberties with the actual argument to be passed in:
>>>
>>> bin/mahout svd (original ->  svdOut)
>>> bin/mahout cleansvd ...
>>> bin/mahout transpose svdOut ->  svdT
>>> bin/mahout transpose original ->  originalT
>>> bin/mahout matrixmult originalT svdT ->  newMatrix
>>> bin/mahout kmeans newMatrix
>>>
>>> Is that about right?
>>>
>>>
>>> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
>>>
>>>  Ok, the transposed computation seems to work and the cast exception was
>>>>
>>> caused by my unit test writing LongWritable keys to the testdata file.
>>> The
>>> following test produces a comparable answer to the non-distributed case.
>>> I
>>> still want to rename the method to transposeTimes for clarity. And
>>> better,
>>> implement timesTranspose to make this particular computation more
>>> efficient:
>>>
>>>>  public void testKmeansDSVD() throws Exception {
>>>>    DistanceMeasure measure = new EuclideanDistanceMeasure();
>>>>    Path output = getTestTempDirPath("output");
>>>>    Path tmp = getTestTempDirPath("tmp");
>>>>    Path eigenvectors = new Path(output, "eigenvectors");
>>>>    int desiredRank = 13;
>>>>    DistributedLanczosSolver solver = new DistributedLanczosSolver();
>>>>    Configuration config = new Configuration();
>>>>    solver.setConf(config);
>>>>    Path testData = getTestTempDirPath("testdata");
>>>>    int sampleDimension = sampleData.get(0).get().size();
>>>>    solver.run(testData, tmp, eigenvectors, sampleData.size(),
>>>>
>>> sampleDimension, false, desiredRank);
>>>
>>>>    // now multiply the testdata matrix and the eigenvector matrix
>>>>    DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
>>>>
>>> tmp, desiredRank - 1, sampleDimension);
>>>
>>>>    JobConf conf = new JobConf(config);
>>>>    svdT.configure(conf);
>>>>    DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
>>>>
>>> sampleData.size(), sampleDimension);
>>>
>>>>    a.configure(conf);
>>>>    DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
>>>>    sData.configure(conf);
>>>>
>>>>    // now run the Canopy job to prime kMeans canopies
>>>>    CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
>>>>
>>> false);
>>>
>>>>    // now run the KMeans job
>>>>    KMeansDriver.runJob(sData.getRowPath(), new Path(output,
>>>>
>>> "clusters-0"), output, measure, 0.001, 10, 1, true, false);
>>>
>>>>    // run ClusterDumper
>>>>    ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
>>>>
>>> "clusters-2"), new Path(output, "clusteredPoints"));
>>>
>>>>    clusterDumper.printClusters(termDictionary);
>>>>  }
>>>>
>>>> On 9/3/10 7:54 AM, Jeff Eastman wrote:
>>>>
>>>>> Looking at the single unit test of DMR.times() it seems to be
>>>>>
>>>> implementing Matrix expected = inputA.transpose().times(inputB), and not
>>> inputA.times(inputB.transpose()), so the bounds checking is correct as
>>> implemented. But the method still has the wrong name and AFAICT is not
>>> useful for performing this particular computation. Should I use this
>>> instead?
>>>
>>>> DistributedRowMatrix sData =
>>>>>
>>>> a.transpose().t[ransposeT]imes(svdT.transpose())
>>>
>>>> ugh! And it still fails with:
>>>>>
>>>>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
>>>>>
>>>> be cast to org.apache.hadoop.io.IntWritable
>>>
>>>>    at
>>>>>
>>>>
>>> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
>>>
>>>>    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>>    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>>>>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>>>>    at
>>>>>
>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
>>> 7-8
>>>
>>>
>>>
>

Re: Using SVD with Canopy/KMeans

Reply via email to