Re: Using SVD with Canopy/KMeans

Grant Ingersoll Sun, 12 Sep 2010 12:23:06 -0700

On Sep 12, 2010, at 3:12 PM, Jeff Eastman wrote:

> Wow, thanks Jake, that is really definitive. There's a lot of gears whirring 
> silently under the hood here. So Grant's command line script would need to 
> add a "mahout cleaneigens <...>" line


I believe Jake is referring to the cleansvd command, which is in my list.

> to really get the valid eigenvectors (or use the second approach). And too, 
> that step will yield the eigenvalues so that either decomposition approach 
> may be used. The problem I'm seeing with doing this programmatically is there 
> is no ready Java method (e.g. job() called by run() in the clustering stuff) 
> which I could use.
> 
> +1 on folding EigenVerificationJob into DistributedLanczosSolver. Or, at 
> least implement a job() method on EVJ.
> 
> +1 on renaming DistributedMatrix.times() to transposeTimes() to avoid 
> confusion.
> 
> +1 on adding DistributedMatrix.timesDiagonal(Matrix) and perhaps also 
> timesDiagonal(Vector)? Perhaps after the 20.2 retrofit?
> 
> On the multiplication correctness, you are right: the unit test compares the 
> results of a.transpose().times(b).
> 
> Thanks for the post,
> Jeff
> 
> 
> On 9/11/10 8:21 PM, Jake Mannix wrote:
>> I really should have chimed in in this thread earlier... sorry folks.
>> 
>>   Regarding the relation between eigenvector decomposition and SVD:  The
>> Lanczos algorithm is specifically for eigenvector decomposition (and hence
>> works on symmetric square matrices), and operates by looking at the set of
>> vectors { v, Av, A(Av), A(A(Av))), ... } and doing some clever recombination
>> on them.  To adapt this to do SVD, you just substitute A = C' * C, for any
>> (non-square) matrix C.  You get a set of eigenvectors of C' * C, which in
>> turn are the right-singular vectors of C (what is called V, if C = U D V').
>>  In particular, if you are representing C as rows of vectors of
>> dimensionality numColumns, then the singular vectors you get out are also of
>> dimensionality numColumns (i.e. they live in the same vector space as your
>> "documents", because they are mixtures of them.
>> 
>>   Regarding DistributedMatrix.times(DistributedMatrix other), yes, it should
>> be called transposeTimes(), but unfortunately, timesTranspose is *not* that
>> efficient, in terms of Map-Reduce passes.  If you can come up with a way to
>> do a single MR pass to compute timesTranspose(), or simply times(), by all
>> means, add it!  But the only matrix multiplication I could do on two large
>> sparse matrices in one pass over the data was what is currently in there.
>> 
>>   Mahout DistributedRowMatrix does *not* currently have the very easy
>> "multiply this matrix by a diagonal matrix", or more generally "multiply by
>> this matrix which is so small as to be able to fit in memory in all of the
>> mappers".  That would be a very helpful addition.
>> 
>>   Regarding correctness of the matrix multiplication: isn't there, right
>> now, a unit test comparing DistributedRowMatrix.times() to
>> DenseMatrix.transpose().times(), and checks that the results are the same
>> for small matrices?  It should be in TestDistributedRowMatrix or
>> something...
>> 
>>   As for outputting eigenvalues (or singular values),
>> DistributedLanczosSolver sometimes produces "extra" eigenvectors, whose
>> eigenvalues aren't valid, and also scales all of the eigenvalues down by the
>> max eignenvalue (to avoid floating point overflow), and so the step which
>> spits out the nice correctly scaled (and non-spurious) eigenvector/value
>> pairs is that done by the "cleaneigens" shell script step (c.f.
>> EigenVerificationJob).  This is when EigenVector instances are created.
>> 
>>   It would make sense for EigenVerificationJob to be folded into
>> DistributedLanczosSolver, to reduce confusion.  Maybe add a solver flag on
>> whether you care about the "cleaning" step.
>> 
>>   One more note:  if you really want to get the projection of your original
>> matrix's rows onto the SVD basis (a common desire for dimensionality
>> reduction), then what you want is not V, but U' : U is a set of k vectors,
>> each of which having numRows (i.e. "numDocuments" / "numUsers")
>> dimensionality, but what you want is numRows vectors, each of dimension k
>> (these numRows rows have had their dimensionality reduced from sparse
>> numColumns size down to dense k), which is U'.
>> 
>>   To get U', instead of V, just do DistributedRowMatrix.transpose() on your
>> *original* matrix before feeding it to the DistributedLanczosSolver.  You'll
>> get U, and to get U', just take one more transpose of the
>> DistributedRowMatrix represented by the k-eigenvectors (possibly you want to
>> do the "multiply by D^-1 or D" step here too):
>> 
>>   C = U D V'  -->  C' C = V D U' U D V' = V D^2 V', and
>>   C' = V D U' -->  C C' = U D V' V D U' = U D^2 U'
>> 
>>   Overall, however, this only saves you one map reduce pass, because the
>> original way of getting U' was to: take SVD (k MR passes), then transpose
>> both the output (1 MR pass) and input (1 MR pass), and then multiply them
>> together (1 MR pass) = k+3 MR passes.  The new way is: take transpose (1 MR
>> pass), then take SVD (k MR passes), then transpose the output (1 MR pass) =
>> k + 2 MR passes.
>> 
>>   Of course, it all depends on what you need.  I've often wanted to hang
>> onto the V-vectors, because they are useful for "folding-in" new data into a
>> larger approximate SVD, and they already have the form of "pseudo-documents"
>> and can give you a feel for how you've clustered your input features.
>> 
>>   -jake
>> 
>> On Sat, Sep 11, 2010 at 6:18 PM, Jeff 
>> Eastman<[email protected]>wrote:
>> 
>>>  +1 Ok, this makes so much sense now. It had been bothering me that U D
>>> would not have the same number of rows as A in the non-square case - our
>>> typical. It's interesting that the
>>> DistributedLanczosSolver.serializeOutput() method does not actually
>>> serialize the eigenvalues. I've fixed that locally using NamedVector but
>>> something like WeightedVectorWritable (not itself a Vector) would be better.
>>> There's also EigenVector objects which might be used here.
>>> 
>>> 
>>> 
>>> On 9/11/10 3:57 PM, Ted Dunning wrote:
>>> 
>>>> Well, firstly the code is doing a singular value decomposition.  This is
>>>> similar to an eigenvector decomposition except that there is no assumption
>>>> that A is square.  With the SVD, we have
>>>> 
>>>>      A = U D V'
>>>> 
>>>> This is in contrast to the eigenvector decomposition which has
>>>> 
>>>>     A = U D U'
>>>> 
>>>> The difference lies in having both left and right singular vectors, U and
>>>> V.
>>>> 
>>>> Practically speaking, with the SVD decomposition, we can consider A or any
>>>> other appropriate matrix to be either a pile of row vectors or a bunch of
>>>> column vectors and we can project either into the reduced dimensional
>>>> representation using U' on the left or V on the right.  In Mahout, the SVD
>>>> only gives us one of U or V (I forget which).  Most commonly, I consider A
>>>> to be row vectors.  This means that I can use the fact that V' V = I to do
>>>> this:
>>>> 
>>>>      A V = U D
>>>> 
>>>> Now, if I wanted to do this some new matrix B, I would need to keep V
>>>> around
>>>> in order to do the multiplication.  But since I just did the SVD of A, all
>>>> I
>>>> need is U and D which is what I think that the Mahout eigen-solver gives
>>>> us.
>>>>  As you say, multiplying by D is trivial because it is diagonal.
>>>> 
>>>> Does that make sense?
>>>> 
>>>> This is similar to what you said, but differs in some details.
>>>> 
>>>> Note also that it is common to split D into two factors:
>>>> 
>>>>     A = U sqrt(D) sqrt(D) V'
>>>> 
>>>> and then use this for the decomposition of A
>>>> 
>>>>     A V sqrt(D)^-1
>>>> 
>>>> The multipication by the inverse of sqrt(D) is trivial since D is
>>>> diagonal.
>>>> 
>>>> 
>>>> On Sat, Sep 11, 2010 at 3:35 PM, Jeff Eastman<[email protected]
>>>>> wrote:
>>>>   Ted,
>>>>> Is this because, for any matrix A, its eigenvector matrix P and its
>>>>> diagonal eigenvalue matrix D, that A P = P D? And P D does not require a
>>>>> full matrix multiplication since it is diagonal? Could you please
>>>>> elaborate?
>>>>> 
>>>>> Jeff
>>>>> 
>>>>> 
>>>>> 
>>>>> On 9/11/10 2:50 PM, Ted Dunning wrote:
>>>>> 
>>>>>  Should be close.  The matrixMult step may be redundant if you want to
>>>>>> cluster the same data that you decomposed.  That would make the second
>>>>>> transpose unnecessary as well.
>>>>>> 
>>>>>> On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll<[email protected]
>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>  To put this in bin/mahout speak, this would look like, munging some
>>>>>> names
>>>>>> 
>>>>>>> and taking liberties with the actual argument to be passed in:
>>>>>>> 
>>>>>>> bin/mahout svd (original ->    svdOut)
>>>>>>> bin/mahout cleansvd ...
>>>>>>> bin/mahout transpose svdOut ->    svdT
>>>>>>> bin/mahout transpose original ->    originalT
>>>>>>> bin/mahout matrixmult originalT svdT ->    newMatrix
>>>>>>> bin/mahout kmeans newMatrix
>>>>>>> 
>>>>>>> Is that about right?
>>>>>>> 
>>>>>>> 
>>>>>>> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
>>>>>>> 
>>>>>>>  Ok, the transposed computation seems to work and the cast exception
>>>>>>> was
>>>>>>> caused by my unit test writing LongWritable keys to the testdata file.
>>>>>>> The
>>>>>>> following test produces a comparable answer to the non-distributed
>>>>>>> case.
>>>>>>> I
>>>>>>> still want to rename the method to transposeTimes for clarity. And
>>>>>>> better,
>>>>>>> implement timesTranspose to make this particular computation more
>>>>>>> efficient:
>>>>>>> 
>>>>>>>   public void testKmeansDSVD() throws Exception {
>>>>>>>>    DistanceMeasure measure = new EuclideanDistanceMeasure();
>>>>>>>>    Path output = getTestTempDirPath("output");
>>>>>>>>    Path tmp = getTestTempDirPath("tmp");
>>>>>>>>    Path eigenvectors = new Path(output, "eigenvectors");
>>>>>>>>    int desiredRank = 13;
>>>>>>>>    DistributedLanczosSolver solver = new DistributedLanczosSolver();
>>>>>>>>    Configuration config = new Configuration();
>>>>>>>>    solver.setConf(config);
>>>>>>>>    Path testData = getTestTempDirPath("testdata");
>>>>>>>>    int sampleDimension = sampleData.get(0).get().size();
>>>>>>>>    solver.run(testData, tmp, eigenvectors, sampleData.size(),
>>>>>>>> 
>>>>>>>>  sampleDimension, false, desiredRank);
>>>>>>>     // now multiply the testdata matrix and the eigenvector matrix
>>>>>>>>    DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
>>>>>>>> 
>>>>>>>>  tmp, desiredRank - 1, sampleDimension);
>>>>>>>     JobConf conf = new JobConf(config);
>>>>>>>>    svdT.configure(conf);
>>>>>>>>    DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
>>>>>>>> 
>>>>>>>>  sampleData.size(), sampleDimension);
>>>>>>>     a.configure(conf);
>>>>>>>>    DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
>>>>>>>>    sData.configure(conf);
>>>>>>>> 
>>>>>>>>    // now run the Canopy job to prime kMeans canopies
>>>>>>>>    CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4,
>>>>>>>> false,
>>>>>>>> 
>>>>>>>>  false);
>>>>>>>     // now run the KMeans job
>>>>>>>>    KMeansDriver.runJob(sData.getRowPath(), new Path(output,
>>>>>>>> 
>>>>>>>>  "clusters-0"), output, measure, 0.001, 10, 1, true, false);
>>>>>>>     // run ClusterDumper
>>>>>>>>    ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
>>>>>>>> 
>>>>>>>>  "clusters-2"), new Path(output, "clusteredPoints"));
>>>>>>>     clusterDumper.printClusters(termDictionary);
>>>>>>>>  }
>>>>>>>> 
>>>>>>>> On 9/3/10 7:54 AM, Jeff Eastman wrote:
>>>>>>>> 
>>>>>>>>  Looking at the single unit test of DMR.times() it seems to be
>>>>>>>>>  implementing Matrix expected = inputA.transpose().times(inputB), and
>>>>>>>> not
>>>>>>>> 
>>>>>>> inputA.times(inputB.transpose()), so the bounds checking is correct as
>>>>>>> implemented. But the method still has the wrong name and AFAICT is not
>>>>>>> useful for performing this particular computation. Should I use this
>>>>>>> instead?
>>>>>>> 
>>>>>>>  DistributedRowMatrix sData =
>>>>>>>> a.transpose().t[ransposeT]imes(svdT.transpose())
>>>>>>>> ugh! And it still fails with:
>>>>>>>> 
>>>>>>>>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable
>>>>>>>>> cannot
>>>>>>>>> 
>>>>>>>>>  be cast to org.apache.hadoop.io.IntWritable
>>>>>>>>    at
>>>>>>>> 
>>>>>>> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
>>>>>>> 
>>>>>>>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>>>>>>    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>>>>>>>>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>>>>>>>>    at
>>>>>>>>> 
>>>>>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>>>>>> --------------------------
>>>>>>> Grant Ingersoll
>>>>>>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
>>>>>>> 7-8
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
> 

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

Re: Using SVD with Canopy/KMeans

Reply via email to