Hopefully answering my own question here but ending up with another.
The svd matrix I'd built from the eigenvectors is the wrong shape as I
built it. Taking Jake's "column space" literally and building a matrix
where each of the columns is one of the eigenvectors does give a matrix
of the correct shape. The math works with DenseMatrix, producing a new
data matrix which is 15x7; a significant dimensionality reduction from
15x39.
In this example, with 15 samples having 39 terms and 7 eigenvectors:
A = [15x39]
P = [39x7]
A P = [15x7]
Running Canopy and then KMeans against AP produces 5 clusters the
goodness of which is a bit hard for me to ascertain right now, but they
do have the reduced number of terms.
To do this with DistributedRowMatrix, I think I need to use
svd.transpose() instead to get my original svd matrix into the correct
shape for multiplication.
For the same example:
data = [15x39]
svd = [7x39] as I've constructed it
svd.transpose = [39x7] which is the correct shape
When I run this; however, it fails with a cardinality exception and I'm
perplexed by this line from DistributedRowMatrix.times():
if (numRows != other.numRows()) {
throw new CardinalityException(numRows, other.numRows());
}
... which is inconsistent with AbstractMatrix.times() [and, I think,
also incorrect]:
int[] c = size();
int[] o = other.size();
if (c[COL] != o[ROW]) {
throw new CardinalityException(c[COL], o[ROW]);
}
If I change the numRows to numCols in DRM.times() it gets past the
cardinality test but blows up with an org.apache.hadoop.io.LongWritable
cannot be cast to org.apache.hadoop.io.IntWritable error somewhere in
the bowels of times(). I need to keep debugging to localize that.
So the upshot is I can make it work using DenseMatrix but not with
DistributedRowMatrix.
On 9/2/10 1:45 PM, Jeff Eastman wrote:
(cross-posting to dev)
Hi Jake,
I'm on thin ice here, but just a few more words on the math details
here would help me sort this out. I've run the
DistributedLanczosSolver on the small testdata set in TestClusterDumper:
Path output = getTestTempDirPath("output");
Path tmp = getTestTempDirPath("tmp");
Configuration config = new Configuration();
Path eigenvectors = new Path(output, "eigenvectors");
config.set("mapred.output.dir", eigenvectors.toString());
DistributedLanczosSolver solver = new DistributedLanczosSolver();
solver.setConf(config);
Path testData = getTestTempDirPath("testdata");
solver.run(testData, tmp, sampleData.size(),
sampleData.get(0).get().size(), false, 8);
This produces 7 (not 8?) vectors in the eigenvectors file. If I then
build DistributedRowMatrices out of these I get matrices that are
ill-shaped to multiply directly. Clearly a literal translation of your
text is incorrect:
// now multiply the testdata matrix and the eigenvector matrix
DistributedRowMatrix svd = new DistributedRowMatrix(eigenvectors,
tmp, 8, 38);
DistributedRowMatrix data= new DistributedRowMatrix(testData, tmp,
15, 38);
DistributedRowMatrix sData = data.times(svd);
// now run the Canopy job to prime kMeans canopies
CanopyDriver.runJob(svd.getRowPath(), output, measure, 8, 4,
false, false);
Reading up on eigendecomposition, it looks like (DATA ~= SVD D SVD')
would be more like it. But the solver only outputs the eigenvectors
and it ignores the eigenvalues. So, I cannot construct D. Can you
point me back towards the right path? It has been soo long since my
grad school advanced matrices course.
Isn't this related to spectral clustering?
On 9/2/10 10:50 AM, Jake Mannix wrote:
Derek,
The step Jeff's referring to is that the SVD output is a set of
vectors in
the "column space" of your original set of rows (your input matrix).
If you
want to cluster your original data, projected onto this new SVD
basis, you
need to matrix multiply your SVD matrix by your original data.
Depending on
how big your data is (number of rows and columns and rank of the
reduction),
you can do this in either one or two map-reduce passes.
If you need more detail, I can spell that out a little more
directly. It
should actually be not just explained in words, but coded into the
examples,
now that I think of it... need. more. hours. in. day....
-jake