Re: Using SVD with Canopy/KMeans

Jeff Eastman Sun, 12 Sep 2010 12:13:26 -0700

Wow, thanks Jake, that is really definitive. There's a lot of gearswhirring silently under the hood here. So Grant's command line scriptwould need to add a "mahout cleaneigens <...>" line to really get thevalid eigenvectors (or use the second approach). And too, that step willyield the eigenvalues so that either decomposition approach may be used.The problem I'm seeing with doing this programmatically is there is noready Java method (e.g. job() called by run() in the clustering stuff)which I could use.

+1 on folding EigenVerificationJob into DistributedLanczosSolver. Or, atleast implement a job() method on EVJ.

+1 on renaming DistributedMatrix.times() to transposeTimes() to avoidconfusion.

+1 on adding DistributedMatrix.timesDiagonal(Matrix) and perhaps alsotimesDiagonal(Vector)? Perhaps after the 20.2 retrofit?

On the multiplication correctness, you are right: the unit test comparesthe results of a.transpose().times(b).


Thanks for the post,
Jeff


On 9/11/10 8:21 PM, Jake Mannix wrote:

I really should have chimed in in this thread earlier... sorry folks.

   Regarding the relation between eigenvector decomposition and SVD:  The
Lanczos algorithm is specifically for eigenvector decomposition (and hence
works on symmetric square matrices), and operates by looking at the set of
vectors { v, Av, A(Av), A(A(Av))), ... } and doing some clever recombination
on them.  To adapt this to do SVD, you just substitute A = C' * C, for any
(non-square) matrix C.  You get a set of eigenvectors of C' * C, which in
turn are the right-singular vectors of C (what is called V, if C = U D V').
  In particular, if you are representing C as rows of vectors of
dimensionality numColumns, then the singular vectors you get out are also of
dimensionality numColumns (i.e. they live in the same vector space as your
"documents", because they are mixtures of them.

   Regarding DistributedMatrix.times(DistributedMatrix other), yes, it should
be called transposeTimes(), but unfortunately, timesTranspose is *not* that
efficient, in terms of Map-Reduce passes.  If you can come up with a way to
do a single MR pass to compute timesTranspose(), or simply times(), by all
means, add it!  But the only matrix multiplication I could do on two large
sparse matrices in one pass over the data was what is currently in there.

   Mahout DistributedRowMatrix does *not* currently have the very easy
"multiply this matrix by a diagonal matrix", or more generally "multiply by
this matrix which is so small as to be able to fit in memory in all of the
mappers".  That would be a very helpful addition.

   Regarding correctness of the matrix multiplication: isn't there, right
now, a unit test comparing DistributedRowMatrix.times() to
DenseMatrix.transpose().times(), and checks that the results are the same
for small matrices?  It should be in TestDistributedRowMatrix or
something...

   As for outputting eigenvalues (or singular values),
DistributedLanczosSolver sometimes produces "extra" eigenvectors, whose
eigenvalues aren't valid, and also scales all of the eigenvalues down by the
max eignenvalue (to avoid floating point overflow), and so the step which
spits out the nice correctly scaled (and non-spurious) eigenvector/value
pairs is that done by the "cleaneigens" shell script step (c.f.
EigenVerificationJob).  This is when EigenVector instances are created.

   It would make sense for EigenVerificationJob to be folded into
DistributedLanczosSolver, to reduce confusion.  Maybe add a solver flag on
whether you care about the "cleaning" step.

   One more note:  if you really want to get the projection of your original
matrix's rows onto the SVD basis (a common desire for dimensionality
reduction), then what you want is not V, but U' : U is a set of k vectors,
each of which having numRows (i.e. "numDocuments" / "numUsers")
dimensionality, but what you want is numRows vectors, each of dimension k
(these numRows rows have had their dimensionality reduced from sparse
numColumns size down to dense k), which is U'.

   To get U', instead of V, just do DistributedRowMatrix.transpose() on your
*original* matrix before feeding it to the DistributedLanczosSolver.  You'll
get U, and to get U', just take one more transpose of the
DistributedRowMatrix represented by the k-eigenvectors (possibly you want to
do the "multiply by D^-1 or D" step here too):

   C = U D V'  -->  C' C = V D U' U D V' = V D^2 V', and
   C' = V D U' -->  C C' = U D V' V D U' = U D^2 U'

   Overall, however, this only saves you one map reduce pass, because the
original way of getting U' was to: take SVD (k MR passes), then transpose
both the output (1 MR pass) and input (1 MR pass), and then multiply them
together (1 MR pass) = k+3 MR passes.  The new way is: take transpose (1 MR
pass), then take SVD (k MR passes), then transpose the output (1 MR pass) =
k + 2 MR passes.

   Of course, it all depends on what you need.  I've often wanted to hang
onto the V-vectors, because they are useful for "folding-in" new data into a
larger approximate SVD, and they already have the form of "pseudo-documents"
and can give you a feel for how you've clustered your input features.

   -jake

On Sat, Sep 11, 2010 at 6:18 PM, Jeff Eastman<[email protected]>wrote:

  +1 Ok, this makes so much sense now. It had been bothering me that U D
would not have the same number of rows as A in the non-square case - our
typical. It's interesting that the
DistributedLanczosSolver.serializeOutput() method does not actually
serialize the eigenvalues. I've fixed that locally using NamedVector but
something like WeightedVectorWritable (not itself a Vector) would be better.
There's also EigenVector objects which might be used here.



On 9/11/10 3:57 PM, Ted Dunning wrote:

Well, firstly the code is doing a singular value decomposition.  This is
similar to an eigenvector decomposition except that there is no assumption
that A is square.  With the SVD, we have

      A = U D V'

This is in contrast to the eigenvector decomposition which has

     A = U D U'

The difference lies in having both left and right singular vectors, U and
V.

Practically speaking, with the SVD decomposition, we can consider A or any
other appropriate matrix to be either a pile of row vectors or a bunch of
column vectors and we can project either into the reduced dimensional
representation using U' on the left or V on the right.  In Mahout, the SVD
only gives us one of U or V (I forget which).  Most commonly, I consider A
to be row vectors.  This means that I can use the fact that V' V = I to do
this:

      A V = U D

Now, if I wanted to do this some new matrix B, I would need to keep V
around
in order to do the multiplication.  But since I just did the SVD of A, all
I
need is U and D which is what I think that the Mahout eigen-solver gives
us.
  As you say, multiplying by D is trivial because it is diagonal.

Does that make sense?

This is similar to what you said, but differs in some details.

Note also that it is common to split D into two factors:

     A = U sqrt(D) sqrt(D) V'

and then use this for the decomposition of A

     A V sqrt(D)^-1

The multipication by the inverse of sqrt(D) is trivial since D is
diagonal.


On Sat, Sep 11, 2010 at 3:35 PM, Jeff Eastman<[email protected]

wrote:

   Ted,

Is this because, for any matrix A, its eigenvector matrix P and its
diagonal eigenvalue matrix D, that A P = P D? And P D does not require a
full matrix multiplication since it is diagonal? Could you please
elaborate?

Jeff



On 9/11/10 2:50 PM, Ted Dunning wrote:

  Should be close.  The matrixMult step may be redundant if you want to

cluster the same data that you decomposed.  That would make the second
transpose unnecessary as well.

On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll<[email protected]

wrote:

  To put this in bin/mahout speak, this would look like, munging some
names

and taking liberties with the actual argument to be passed in:

bin/mahout svd (original ->    svdOut)
bin/mahout cleansvd ...
bin/mahout transpose svdOut ->    svdT
bin/mahout transpose original ->    originalT
bin/mahout matrixmult originalT svdT ->    newMatrix
bin/mahout kmeans newMatrix

Is that about right?


On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:

  Ok, the transposed computation seems to work and the cast exception
was
caused by my unit test writing LongWritable keys to the testdata file.
The
following test produces a comparable answer to the non-distributed
case.
I
still want to rename the method to transposeTimes for clarity. And
better,
implement timesTranspose to make this particular computation more
efficient:

   public void testKmeansDSVD() throws Exception {

    DistanceMeasure measure = new EuclideanDistanceMeasure();
    Path output = getTestTempDirPath("output");
    Path tmp = getTestTempDirPath("tmp");
    Path eigenvectors = new Path(output, "eigenvectors");
    int desiredRank = 13;
    DistributedLanczosSolver solver = new DistributedLanczosSolver();
    Configuration config = new Configuration();
    solver.setConf(config);
    Path testData = getTestTempDirPath("testdata");
    int sampleDimension = sampleData.get(0).get().size();
    solver.run(testData, tmp, eigenvectors, sampleData.size(),

  sampleDimension, false, desiredRank);

     // now multiply the testdata matrix and the eigenvector matrix

    DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,

  tmp, desiredRank - 1, sampleDimension);

     JobConf conf = new JobConf(config);

    svdT.configure(conf);
    DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,

  sampleData.size(), sampleDimension);

     a.configure(conf);

    DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
    sData.configure(conf);

    // now run the Canopy job to prime kMeans canopies
    CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4,
false,

  false);

     // now run the KMeans job

    KMeansDriver.runJob(sData.getRowPath(), new Path(output,

  "clusters-0"), output, measure, 0.001, 10, 1, true, false);

     // run ClusterDumper

    ClusterDumper clusterDumper = new ClusterDumper(new Path(output,

  "clusters-2"), new Path(output, "clusteredPoints"));

     clusterDumper.printClusters(termDictionary);

  }

On 9/3/10 7:54 AM, Jeff Eastman wrote:

  Looking at the single unit test of DMR.times() it seems to be

  implementing Matrix expected = inputA.transpose().times(inputB), and

not

inputA.times(inputB.transpose()), so the bounds checking is correct as
implemented. But the method still has the wrong name and AFAICT is not
useful for performing this particular computation. Should I use this
instead?

  DistributedRowMatrix sData =

a.transpose().t[ransposeT]imes(svdT.transpose())
ugh! And it still fails with:

java.lang.ClassCastException: org.apache.hadoop.io.LongWritable
cannot

  be cast to org.apache.hadoop.io.IntWritable

at

org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)

     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)

    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at

org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
7-8

Re: Using SVD with Canopy/KMeans

Reply via email to