Re: Using SVD with Canopy/KMeans

Jeff Eastman Thu, 02 Sep 2010 19:42:41 -0700

Hopefully answering my own question here but ending up with another.The svd matrix I'd built from the eigenvectors is the wrong shape as Ibuilt it. Taking Jake's "column space" literally and building a matrixwhere each of the columns is one of the eigenvectors does give a matrixof the correct shape. The math works with DenseMatrix, producing a newdata matrix which is 15x7; a significant dimensionality reduction from15x39.


In this example, with 15 samples having 39 terms and 7 eigenvectors:
    A = [15x39]
    P = [39x7]
    A P = [15x7]

Running Canopy and then KMeans against AP produces 5 clusters thegoodness of which is a bit hard for me to ascertain right now, but theydo have the reduced number of terms.

To do this with DistributedRowMatrix, I think I need to usesvd.transpose() instead to get my original svd matrix into the correctshape for multiplication.


For the same example:
    data = [15x39]
    svd = [7x39] as I've constructed it
    svd.transpose = [39x7] which is the correct shape

When I run this; however, it fails with a cardinality exception and I'mperplexed by this line from DistributedRowMatrix.times():


    if (numRows != other.numRows()) {
      throw new CardinalityException(numRows, other.numRows());
    }

... which is inconsistent with AbstractMatrix.times() [and, I think,also incorrect]:


    int[] c = size();
    int[] o = other.size();
    if (c[COL] != o[ROW]) {
      throw new CardinalityException(c[COL], o[ROW]);
    }

If I change the numRows to numCols in DRM.times() it gets past thecardinality test but blows up with an org.apache.hadoop.io.LongWritablecannot be cast to org.apache.hadoop.io.IntWritable error somewhere inthe bowels of times(). I need to keep debugging to localize that.

So the upshot is I can make it work using DenseMatrix but not withDistributedRowMatrix.




On 9/2/10 1:45 PM, Jeff Eastman wrote:

 (cross-posting to dev)

Hi Jake,
I'm on thin ice here, but just a few more words on the math detailshere would help me sort this out. I've run theDistributedLanczosSolver on the small testdata set in TestClusterDumper:
    Path output = getTestTempDirPath("output");
    Path tmp = getTestTempDirPath("tmp");
    Configuration config = new Configuration();
    Path eigenvectors = new Path(output, "eigenvectors");
    config.set("mapred.output.dir", eigenvectors.toString());
    DistributedLanczosSolver solver = new DistributedLanczosSolver();
    solver.setConf(config);
    Path testData = getTestTempDirPath("testdata");
solver.run(testData, tmp, sampleData.size(),sampleData.get(0).get().size(), false, 8);
This produces 7 (not 8?) vectors in the eigenvectors file. If I thenbuild DistributedRowMatrices out of these I get matrices that areill-shaped to multiply directly. Clearly a literal translation of yourtext is incorrect:
    // now multiply the testdata matrix and the eigenvector matrix
DistributedRowMatrix svd = new DistributedRowMatrix(eigenvectors,tmp, 8, 38);DistributedRowMatrix data= new DistributedRowMatrix(testData, tmp,15, 38);
    DistributedRowMatrix sData = data.times(svd);

    // now run the Canopy job to prime kMeans canopies
CanopyDriver.runJob(svd.getRowPath(), output, measure, 8, 4,false, false);
Reading up on eigendecomposition, it looks like (DATA ~= SVD D SVD')would be more like it. But the solver only outputs the eigenvectorsand it ignores the eigenvalues. So, I cannot construct D. Can youpoint me back towards the right path? It has been soo long since mygrad school advanced matrices course.
Isn't this related to spectral clustering?


On 9/2/10 10:50 AM, Jake Mannix wrote:
Derek,
The step Jeff's referring to is that the SVD output is a set ofvectors inthe "column space" of your original set of rows (your input matrix).If youwant to cluster your original data, projected onto this new SVDbasis, youneed to matrix multiply your SVD matrix by your original data.Depending onhow big your data is (number of rows and columns and rank of thereduction),
you can do this in either one or two map-reduce passes.
If you need more detail, I can spell that out a little moredirectly. Itshould actually be not just explained in words, but coded into theexamples,
now that I think of it... need. more. hours. in. day....

   -jake

Re: Using SVD with Canopy/KMeans

Reply via email to