(cross-posting to dev)
Hi Jake,
I'm on thin ice here, but just a few more words on the math details here
would help me sort this out. I've run the DistributedLanczosSolver on
the small testdata set in TestClusterDumper:
Path output = getTestTempDirPath("output");
Path tmp = getTestTempDirPath("tmp");
Configuration config = new Configuration();
Path eigenvectors = new Path(output, "eigenvectors");
config.set("mapred.output.dir", eigenvectors.toString());
DistributedLanczosSolver solver = new DistributedLanczosSolver();
solver.setConf(config);
Path testData = getTestTempDirPath("testdata");
solver.run(testData, tmp, sampleData.size(),
sampleData.get(0).get().size(), false, 8);
This produces 7 (not 8?) vectors in the eigenvectors file. If I then
build DistributedRowMatrices out of these I get matrices that are
ill-shaped to multiply directly. Clearly a literal translation of your
text is incorrect:
// now multiply the testdata matrix and the eigenvector matrix
DistributedRowMatrix svd = new DistributedRowMatrix(eigenvectors,
tmp, 8, 38);
DistributedRowMatrix data= new DistributedRowMatrix(testData, tmp,
15, 38);
DistributedRowMatrix sData = data.times(svd);
// now run the Canopy job to prime kMeans canopies
CanopyDriver.runJob(svd.getRowPath(), output, measure, 8, 4, false,
false);
Reading up on eigendecomposition, it looks like (DATA ~= SVD D SVD')
would be more like it. But the solver only outputs the eigenvectors and
it ignores the eigenvalues. So, I cannot construct D. Can you point me
back towards the right path? It has been soo long since my grad school
advanced matrices course.
Isn't this related to spectral clustering?
On 9/2/10 10:50 AM, Jake Mannix wrote:
Derek,
The step Jeff's referring to is that the SVD output is a set of vectors in
the "column space" of your original set of rows (your input matrix). If you
want to cluster your original data, projected onto this new SVD basis, you
need to matrix multiply your SVD matrix by your original data. Depending on
how big your data is (number of rows and columns and rank of the reduction),
you can do this in either one or two map-reduce passes.
If you need more detail, I can spell that out a little more directly. It
should actually be not just explained in words, but coded into the examples,
now that I think of it... need. more. hours. in. day....
-jake