On Sep 12, 2010, at 3:12 PM, Jeff Eastman wrote:
> Wow, thanks Jake, that is really definitive. There's a lot of gears whirring
> silently under the hood here. So Grant's command line script would need to
> add a "mahout cleaneigens <...>" line
I believe Jake is referring to the cleansvd command, which is in my list.
> to really get the valid eigenvectors (or use the second approach). And too,
> that step will yield the eigenvalues so that either decomposition approach
> may be used. The problem I'm seeing with doing this programmatically is there
> is no ready Java method (e.g. job() called by run() in the clustering stuff)
> which I could use.
>
> +1 on folding EigenVerificationJob into DistributedLanczosSolver. Or, at
> least implement a job() method on EVJ.
>
> +1 on renaming DistributedMatrix.times() to transposeTimes() to avoid
> confusion.
>
> +1 on adding DistributedMatrix.timesDiagonal(Matrix) and perhaps also
> timesDiagonal(Vector)? Perhaps after the 20.2 retrofit?
>
> On the multiplication correctness, you are right: the unit test compares the
> results of a.transpose().times(b).
>
> Thanks for the post,
> Jeff
>
>
> On 9/11/10 8:21 PM, Jake Mannix wrote:
>> I really should have chimed in in this thread earlier... sorry folks.
>>
>> Regarding the relation between eigenvector decomposition and SVD: The
>> Lanczos algorithm is specifically for eigenvector decomposition (and hence
>> works on symmetric square matrices), and operates by looking at the set of
>> vectors { v, Av, A(Av), A(A(Av))), ... } and doing some clever recombination
>> on them. To adapt this to do SVD, you just substitute A = C' * C, for any
>> (non-square) matrix C. You get a set of eigenvectors of C' * C, which in
>> turn are the right-singular vectors of C (what is called V, if C = U D V').
>> In particular, if you are representing C as rows of vectors of
>> dimensionality numColumns, then the singular vectors you get out are also of
>> dimensionality numColumns (i.e. they live in the same vector space as your
>> "documents", because they are mixtures of them.
>>
>> Regarding DistributedMatrix.times(DistributedMatrix other), yes, it should
>> be called transposeTimes(), but unfortunately, timesTranspose is *not* that
>> efficient, in terms of Map-Reduce passes. If you can come up with a way to
>> do a single MR pass to compute timesTranspose(), or simply times(), by all
>> means, add it! But the only matrix multiplication I could do on two large
>> sparse matrices in one pass over the data was what is currently in there.
>>
>> Mahout DistributedRowMatrix does *not* currently have the very easy
>> "multiply this matrix by a diagonal matrix", or more generally "multiply by
>> this matrix which is so small as to be able to fit in memory in all of the
>> mappers". That would be a very helpful addition.
>>
>> Regarding correctness of the matrix multiplication: isn't there, right
>> now, a unit test comparing DistributedRowMatrix.times() to
>> DenseMatrix.transpose().times(), and checks that the results are the same
>> for small matrices? It should be in TestDistributedRowMatrix or
>> something...
>>
>> As for outputting eigenvalues (or singular values),
>> DistributedLanczosSolver sometimes produces "extra" eigenvectors, whose
>> eigenvalues aren't valid, and also scales all of the eigenvalues down by the
>> max eignenvalue (to avoid floating point overflow), and so the step which
>> spits out the nice correctly scaled (and non-spurious) eigenvector/value
>> pairs is that done by the "cleaneigens" shell script step (c.f.
>> EigenVerificationJob). This is when EigenVector instances are created.
>>
>> It would make sense for EigenVerificationJob to be folded into
>> DistributedLanczosSolver, to reduce confusion. Maybe add a solver flag on
>> whether you care about the "cleaning" step.
>>
>> One more note: if you really want to get the projection of your original
>> matrix's rows onto the SVD basis (a common desire for dimensionality
>> reduction), then what you want is not V, but U' : U is a set of k vectors,
>> each of which having numRows (i.e. "numDocuments" / "numUsers")
>> dimensionality, but what you want is numRows vectors, each of dimension k
>> (these numRows rows have had their dimensionality reduced from sparse
>> numColumns size down to dense k), which is U'.
>>
>> To get U', instead of V, just do DistributedRowMatrix.transpose() on your
>> *original* matrix before feeding it to the DistributedLanczosSolver. You'll
>> get U, and to get U', just take one more transpose of the
>> DistributedRowMatrix represented by the k-eigenvectors (possibly you want to
>> do the "multiply by D^-1 or D" step here too):
>>
>> C = U D V' --> C' C = V D U' U D V' = V D^2 V', and
>> C' = V D U' --> C C' = U D V' V D U' = U D^2 U'
>>
>> Overall, however, this only saves you one map reduce pass, because the
>> original way of getting U' was to: take SVD (k MR passes), then transpose
>> both the output (1 MR pass) and input (1 MR pass), and then multiply them
>> together (1 MR pass) = k+3 MR passes. The new way is: take transpose (1 MR
>> pass), then take SVD (k MR passes), then transpose the output (1 MR pass) =
>> k + 2 MR passes.
>>
>> Of course, it all depends on what you need. I've often wanted to hang
>> onto the V-vectors, because they are useful for "folding-in" new data into a
>> larger approximate SVD, and they already have the form of "pseudo-documents"
>> and can give you a feel for how you've clustered your input features.
>>
>> -jake
>>
>> On Sat, Sep 11, 2010 at 6:18 PM, Jeff
>> Eastman<[email protected]>wrote:
>>
>>> +1 Ok, this makes so much sense now. It had been bothering me that U D
>>> would not have the same number of rows as A in the non-square case - our
>>> typical. It's interesting that the
>>> DistributedLanczosSolver.serializeOutput() method does not actually
>>> serialize the eigenvalues. I've fixed that locally using NamedVector but
>>> something like WeightedVectorWritable (not itself a Vector) would be better.
>>> There's also EigenVector objects which might be used here.
>>>
>>>
>>>
>>> On 9/11/10 3:57 PM, Ted Dunning wrote:
>>>
>>>> Well, firstly the code is doing a singular value decomposition. This is
>>>> similar to an eigenvector decomposition except that there is no assumption
>>>> that A is square. With the SVD, we have
>>>>
>>>> A = U D V'
>>>>
>>>> This is in contrast to the eigenvector decomposition which has
>>>>
>>>> A = U D U'
>>>>
>>>> The difference lies in having both left and right singular vectors, U and
>>>> V.
>>>>
>>>> Practically speaking, with the SVD decomposition, we can consider A or any
>>>> other appropriate matrix to be either a pile of row vectors or a bunch of
>>>> column vectors and we can project either into the reduced dimensional
>>>> representation using U' on the left or V on the right. In Mahout, the SVD
>>>> only gives us one of U or V (I forget which). Most commonly, I consider A
>>>> to be row vectors. This means that I can use the fact that V' V = I to do
>>>> this:
>>>>
>>>> A V = U D
>>>>
>>>> Now, if I wanted to do this some new matrix B, I would need to keep V
>>>> around
>>>> in order to do the multiplication. But since I just did the SVD of A, all
>>>> I
>>>> need is U and D which is what I think that the Mahout eigen-solver gives
>>>> us.
>>>> As you say, multiplying by D is trivial because it is diagonal.
>>>>
>>>> Does that make sense?
>>>>
>>>> This is similar to what you said, but differs in some details.
>>>>
>>>> Note also that it is common to split D into two factors:
>>>>
>>>> A = U sqrt(D) sqrt(D) V'
>>>>
>>>> and then use this for the decomposition of A
>>>>
>>>> A V sqrt(D)^-1
>>>>
>>>> The multipication by the inverse of sqrt(D) is trivial since D is
>>>> diagonal.
>>>>
>>>>
>>>> On Sat, Sep 11, 2010 at 3:35 PM, Jeff Eastman<[email protected]
>>>>> wrote:
>>>> Ted,
>>>>> Is this because, for any matrix A, its eigenvector matrix P and its
>>>>> diagonal eigenvalue matrix D, that A P = P D? And P D does not require a
>>>>> full matrix multiplication since it is diagonal? Could you please
>>>>> elaborate?
>>>>>
>>>>> Jeff
>>>>>
>>>>>
>>>>>
>>>>> On 9/11/10 2:50 PM, Ted Dunning wrote:
>>>>>
>>>>> Should be close. The matrixMult step may be redundant if you want to
>>>>>> cluster the same data that you decomposed. That would make the second
>>>>>> transpose unnecessary as well.
>>>>>>
>>>>>> On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll<[email protected]
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>> To put this in bin/mahout speak, this would look like, munging some
>>>>>> names
>>>>>>
>>>>>>> and taking liberties with the actual argument to be passed in:
>>>>>>>
>>>>>>> bin/mahout svd (original -> svdOut)
>>>>>>> bin/mahout cleansvd ...
>>>>>>> bin/mahout transpose svdOut -> svdT
>>>>>>> bin/mahout transpose original -> originalT
>>>>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>>>>> bin/mahout kmeans newMatrix
>>>>>>>
>>>>>>> Is that about right?
>>>>>>>
>>>>>>>
>>>>>>> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
>>>>>>>
>>>>>>> Ok, the transposed computation seems to work and the cast exception
>>>>>>> was
>>>>>>> caused by my unit test writing LongWritable keys to the testdata file.
>>>>>>> The
>>>>>>> following test produces a comparable answer to the non-distributed
>>>>>>> case.
>>>>>>> I
>>>>>>> still want to rename the method to transposeTimes for clarity. And
>>>>>>> better,
>>>>>>> implement timesTranspose to make this particular computation more
>>>>>>> efficient:
>>>>>>>
>>>>>>> public void testKmeansDSVD() throws Exception {
>>>>>>>> DistanceMeasure measure = new EuclideanDistanceMeasure();
>>>>>>>> Path output = getTestTempDirPath("output");
>>>>>>>> Path tmp = getTestTempDirPath("tmp");
>>>>>>>> Path eigenvectors = new Path(output, "eigenvectors");
>>>>>>>> int desiredRank = 13;
>>>>>>>> DistributedLanczosSolver solver = new DistributedLanczosSolver();
>>>>>>>> Configuration config = new Configuration();
>>>>>>>> solver.setConf(config);
>>>>>>>> Path testData = getTestTempDirPath("testdata");
>>>>>>>> int sampleDimension = sampleData.get(0).get().size();
>>>>>>>> solver.run(testData, tmp, eigenvectors, sampleData.size(),
>>>>>>>>
>>>>>>>> sampleDimension, false, desiredRank);
>>>>>>> // now multiply the testdata matrix and the eigenvector matrix
>>>>>>>> DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
>>>>>>>>
>>>>>>>> tmp, desiredRank - 1, sampleDimension);
>>>>>>> JobConf conf = new JobConf(config);
>>>>>>>> svdT.configure(conf);
>>>>>>>> DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
>>>>>>>>
>>>>>>>> sampleData.size(), sampleDimension);
>>>>>>> a.configure(conf);
>>>>>>>> DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
>>>>>>>> sData.configure(conf);
>>>>>>>>
>>>>>>>> // now run the Canopy job to prime kMeans canopies
>>>>>>>> CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4,
>>>>>>>> false,
>>>>>>>>
>>>>>>>> false);
>>>>>>> // now run the KMeans job
>>>>>>>> KMeansDriver.runJob(sData.getRowPath(), new Path(output,
>>>>>>>>
>>>>>>>> "clusters-0"), output, measure, 0.001, 10, 1, true, false);
>>>>>>> // run ClusterDumper
>>>>>>>> ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
>>>>>>>>
>>>>>>>> "clusters-2"), new Path(output, "clusteredPoints"));
>>>>>>> clusterDumper.printClusters(termDictionary);
>>>>>>>> }
>>>>>>>>
>>>>>>>> On 9/3/10 7:54 AM, Jeff Eastman wrote:
>>>>>>>>
>>>>>>>> Looking at the single unit test of DMR.times() it seems to be
>>>>>>>>> implementing Matrix expected = inputA.transpose().times(inputB), and
>>>>>>>> not
>>>>>>>>
>>>>>>> inputA.times(inputB.transpose()), so the bounds checking is correct as
>>>>>>> implemented. But the method still has the wrong name and AFAICT is not
>>>>>>> useful for performing this particular computation. Should I use this
>>>>>>> instead?
>>>>>>>
>>>>>>> DistributedRowMatrix sData =
>>>>>>>> a.transpose().t[ransposeT]imes(svdT.transpose())
>>>>>>>> ugh! And it still fails with:
>>>>>>>>
>>>>>>>>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable
>>>>>>>>> cannot
>>>>>>>>>
>>>>>>>>> be cast to org.apache.hadoop.io.IntWritable
>>>>>>>> at
>>>>>>>>
>>>>>>> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
>>>>>>>
>>>>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>>>>>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>>>>>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>>>>>> --------------------------
>>>>>>> Grant Ingersoll
>>>>>>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
>>>>>>> 7-8
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>
--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8