Re: Using SVD with Canopy/KMeans

Jake Mannix Sat, 11 Sep 2010 20:22:49 -0700

I really should have chimed in in this thread earlier... sorry folks.

  Regarding the relation between eigenvector decomposition and SVD:  The
Lanczos algorithm is specifically for eigenvector decomposition (and hence
works on symmetric square matrices), and operates by looking at the set of
vectors { v, Av, A(Av), A(A(Av))), ... } and doing some clever recombination
on them.  To adapt this to do SVD, you just substitute A = C' * C, for any
(non-square) matrix C.  You get a set of eigenvectors of C' * C, which in
turn are the right-singular vectors of C (what is called V, if C = U D V').
 In particular, if you are representing C as rows of vectors of
dimensionality numColumns, then the singular vectors you get out are also of
dimensionality numColumns (i.e. they live in the same vector space as your
"documents", because they are mixtures of them.

  Regarding DistributedMatrix.times(DistributedMatrix other), yes, it should
be called transposeTimes(), but unfortunately, timesTranspose is *not* that
efficient, in terms of Map-Reduce passes.  If you can come up with a way to
do a single MR pass to compute timesTranspose(), or simply times(), by all
means, add it!  But the only matrix multiplication I could do on two large
sparse matrices in one pass over the data was what is currently in there.

  Mahout DistributedRowMatrix does *not* currently have the very easy
"multiply this matrix by a diagonal matrix", or more generally "multiply by
this matrix which is so small as to be able to fit in memory in all of the
mappers".  That would be a very helpful addition.

  Regarding correctness of the matrix multiplication: isn't there, right
now, a unit test comparing DistributedRowMatrix.times() to
DenseMatrix.transpose().times(), and checks that the results are the same
for small matrices?  It should be in TestDistributedRowMatrix or
something...

  As for outputting eigenvalues (or singular values),
DistributedLanczosSolver sometimes produces "extra" eigenvectors, whose
eigenvalues aren't valid, and also scales all of the eigenvalues down by the
max eignenvalue (to avoid floating point overflow), and so the step which
spits out the nice correctly scaled (and non-spurious) eigenvector/value
pairs is that done by the "cleaneigens" shell script step (c.f.
EigenVerificationJob).  This is when EigenVector instances are created.

  It would make sense for EigenVerificationJob to be folded into
DistributedLanczosSolver, to reduce confusion.  Maybe add a solver flag on
whether you care about the "cleaning" step.

  One more note:  if you really want to get the projection of your original
matrix's rows onto the SVD basis (a common desire for dimensionality
reduction), then what you want is not V, but U' : U is a set of k vectors,
each of which having numRows (i.e. "numDocuments" / "numUsers")
dimensionality, but what you want is numRows vectors, each of dimension k
(these numRows rows have had their dimensionality reduced from sparse
numColumns size down to dense k), which is U'.

  To get U', instead of V, just do DistributedRowMatrix.transpose() on your
*original* matrix before feeding it to the DistributedLanczosSolver.  You'll
get U, and to get U', just take one more transpose of the
DistributedRowMatrix represented by the k-eigenvectors (possibly you want to
do the "multiply by D^-1 or D" step here too):

  C = U D V'  --> C' C = V D U' U D V' = V D^2 V', and
  C' = V D U' --> C C' = U D V' V D U' = U D^2 U'

  Overall, however, this only saves you one map reduce pass, because the
original way of getting U' was to: take SVD (k MR passes), then transpose
both the output (1 MR pass) and input (1 MR pass), and then multiply them
together (1 MR pass) = k+3 MR passes.  The new way is: take transpose (1 MR
pass), then take SVD (k MR passes), then transpose the output (1 MR pass) =
k + 2 MR passes.

  Of course, it all depends on what you need.  I've often wanted to hang
onto the V-vectors, because they are useful for "folding-in" new data into a
larger approximate SVD, and they already have the form of "pseudo-documents"
and can give you a feel for how you've clustered your input features.

  -jake

On Sat, Sep 11, 2010 at 6:18 PM, Jeff Eastman <[email protected]>wrote:

>  +1 Ok, this makes so much sense now. It had been bothering me that U D
> would not have the same number of rows as A in the non-square case - our
> typical. It's interesting that the
> DistributedLanczosSolver.serializeOutput() method does not actually
> serialize the eigenvalues. I've fixed that locally using NamedVector but
> something like WeightedVectorWritable (not itself a Vector) would be better.
> There's also EigenVector objects which might be used here.
>
>
>
> On 9/11/10 3:57 PM, Ted Dunning wrote:
>
>> Well, firstly the code is doing a singular value decomposition.  This is
>> similar to an eigenvector decomposition except that there is no assumption
>> that A is square.  With the SVD, we have
>>
>>      A = U D V'
>>
>> This is in contrast to the eigenvector decomposition which has
>>
>>     A = U D U'
>>
>> The difference lies in having both left and right singular vectors, U and
>> V.
>>
>> Practically speaking, with the SVD decomposition, we can consider A or any
>> other appropriate matrix to be either a pile of row vectors or a bunch of
>> column vectors and we can project either into the reduced dimensional
>> representation using U' on the left or V on the right.  In Mahout, the SVD
>> only gives us one of U or V (I forget which).  Most commonly, I consider A
>> to be row vectors.  This means that I can use the fact that V' V = I to do
>> this:
>>
>>      A V = U D
>>
>> Now, if I wanted to do this some new matrix B, I would need to keep V
>> around
>> in order to do the multiplication.  But since I just did the SVD of A, all
>> I
>> need is U and D which is what I think that the Mahout eigen-solver gives
>> us.
>>  As you say, multiplying by D is trivial because it is diagonal.
>>
>> Does that make sense?
>>
>> This is similar to what you said, but differs in some details.
>>
>> Note also that it is common to split D into two factors:
>>
>>     A = U sqrt(D) sqrt(D) V'
>>
>> and then use this for the decomposition of A
>>
>>     A V sqrt(D)^-1
>>
>> The multipication by the inverse of sqrt(D) is trivial since D is
>> diagonal.
>>
>>
>> On Sat, Sep 11, 2010 at 3:35 PM, Jeff Eastman<[email protected]
>> >wrote:
>>
>>   Ted,
>>>
>>> Is this because, for any matrix A, its eigenvector matrix P and its
>>> diagonal eigenvalue matrix D, that A P = P D? And P D does not require a
>>> full matrix multiplication since it is diagonal? Could you please
>>> elaborate?
>>>
>>> Jeff
>>>
>>>
>>>
>>> On 9/11/10 2:50 PM, Ted Dunning wrote:
>>>
>>>  Should be close.  The matrixMult step may be redundant if you want to
>>>> cluster the same data that you decomposed.  That would make the second
>>>> transpose unnecessary as well.
>>>>
>>>> On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll<[email protected]
>>>>
>>>>> wrote:
>>>>>
>>>>  To put this in bin/mahout speak, this would look like, munging some
>>>> names
>>>>
>>>>> and taking liberties with the actual argument to be passed in:
>>>>>
>>>>> bin/mahout svd (original ->   svdOut)
>>>>> bin/mahout cleansvd ...
>>>>> bin/mahout transpose svdOut ->   svdT
>>>>> bin/mahout transpose original ->   originalT
>>>>> bin/mahout matrixmult originalT svdT ->   newMatrix
>>>>> bin/mahout kmeans newMatrix
>>>>>
>>>>> Is that about right?
>>>>>
>>>>>
>>>>> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
>>>>>
>>>>>  Ok, the transposed computation seems to work and the cast exception
>>>>> was
>>>>> caused by my unit test writing LongWritable keys to the testdata file.
>>>>> The
>>>>> following test produces a comparable answer to the non-distributed
>>>>> case.
>>>>> I
>>>>> still want to rename the method to transposeTimes for clarity. And
>>>>> better,
>>>>> implement timesTranspose to make this particular computation more
>>>>> efficient:
>>>>>
>>>>>   public void testKmeansDSVD() throws Exception {
>>>>>>    DistanceMeasure measure = new EuclideanDistanceMeasure();
>>>>>>    Path output = getTestTempDirPath("output");
>>>>>>    Path tmp = getTestTempDirPath("tmp");
>>>>>>    Path eigenvectors = new Path(output, "eigenvectors");
>>>>>>    int desiredRank = 13;
>>>>>>    DistributedLanczosSolver solver = new DistributedLanczosSolver();
>>>>>>    Configuration config = new Configuration();
>>>>>>    solver.setConf(config);
>>>>>>    Path testData = getTestTempDirPath("testdata");
>>>>>>    int sampleDimension = sampleData.get(0).get().size();
>>>>>>    solver.run(testData, tmp, eigenvectors, sampleData.size(),
>>>>>>
>>>>>>  sampleDimension, false, desiredRank);
>>>>>
>>>>>     // now multiply the testdata matrix and the eigenvector matrix
>>>>>>    DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
>>>>>>
>>>>>>  tmp, desiredRank - 1, sampleDimension);
>>>>>
>>>>>     JobConf conf = new JobConf(config);
>>>>>>    svdT.configure(conf);
>>>>>>    DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
>>>>>>
>>>>>>  sampleData.size(), sampleDimension);
>>>>>
>>>>>     a.configure(conf);
>>>>>>    DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
>>>>>>    sData.configure(conf);
>>>>>>
>>>>>>    // now run the Canopy job to prime kMeans canopies
>>>>>>    CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4,
>>>>>> false,
>>>>>>
>>>>>>  false);
>>>>>
>>>>>     // now run the KMeans job
>>>>>>    KMeansDriver.runJob(sData.getRowPath(), new Path(output,
>>>>>>
>>>>>>  "clusters-0"), output, measure, 0.001, 10, 1, true, false);
>>>>>
>>>>>     // run ClusterDumper
>>>>>>    ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
>>>>>>
>>>>>>  "clusters-2"), new Path(output, "clusteredPoints"));
>>>>>
>>>>>     clusterDumper.printClusters(termDictionary);
>>>>>>  }
>>>>>>
>>>>>> On 9/3/10 7:54 AM, Jeff Eastman wrote:
>>>>>>
>>>>>>  Looking at the single unit test of DMR.times() it seems to be
>>>>>>>
>>>>>>>  implementing Matrix expected = inputA.transpose().times(inputB), and
>>>>>> not
>>>>>>
>>>>> inputA.times(inputB.transpose()), so the bounds checking is correct as
>>>>> implemented. But the method still has the wrong name and AFAICT is not
>>>>> useful for performing this particular computation. Should I use this
>>>>> instead?
>>>>>
>>>>>  DistributedRowMatrix sData =
>>>>>> a.transpose().t[ransposeT]imes(svdT.transpose())
>>>>>> ugh! And it still fails with:
>>>>>>
>>>>>>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable
>>>>>>> cannot
>>>>>>>
>>>>>>>  be cast to org.apache.hadoop.io.IntWritable
>>>>>>    at
>>>>>>
>>>>>
>>>>> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
>>>>>
>>>>>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>>>
>>>>>>>    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>>>>>>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>>>>>>    at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>>>>>
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
>>>>> 7-8
>>>>>
>>>>>
>>>>>
>>>>>
>

Re: Using SVD with Canopy/KMeans

Reply via email to