Looking at the single unit test of DMR.times() it seems to be
implementing Matrix expected = inputA.transpose().times(inputB), and not
inputA.times(inputB.transpose()), so the bounds checking is correct as
implemented. But the method still has the wrong name and AFAICT is not
useful for performing this particular computation. Should I use this
instead?
DistributedRowMatrix sData =
a.transpose().t[ransposeT]imes(svdT.transpose())
ugh! And it still fails with:
java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
be cast to org.apache.hadoop.io.IntWritable
at
org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
On 9/3/10 7:15 AM, Jeff Eastman wrote:
Ah, ok, that would explain some of the cardinality checking
differences. Its kind of confusing to have the times() method doing
something different from Matrix in the distributed case. Maybe
renaming to transposeTimes would be clearer? At least adding some
comments for the unwary?
Here's my revised test using DistributedRowMatrix for the
computations. It fails in DMR.times() with a CardinalityException.
Looking at the source of the exception (again), it is comparing
numRows != other.numRows(). Since matrix A is [15x39] and svdT is
[12x39] shouldn't A.t[ransposeT]imes(svdT) be comparing the numCols
instead? What exactly is DMR.times() doing?
@Test
public void testKmeansDSVD() throws Exception {
DistanceMeasure measure = new EuclideanDistanceMeasure();
Path output = getTestTempDirPath("output");
Path tmp = getTestTempDirPath("tmp");
Path eigenvectors = new Path(output, "eigenvectors");
int desiredRank = 13;
DistributedLanczosSolver solver = new DistributedLanczosSolver();
Configuration config = new Configuration();
solver.setConf(config);
Path testData = getTestTempDirPath("testdata");
int sampleDimension = sampleData.get(0).get().size();
solver.run(testData, tmp, eigenvectors, sampleData.size(),
sampleDimension, false, desiredRank);
// now multiply the testdata matrix and the eigenvector matrix
DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
tmp, desiredRank - 1, sampleDimension);
JobConf conf = new JobConf(config);
svdT.configure(conf);
DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
sampleData.size(), sampleDimension);
a.configure(conf);
// DMR.times() is really transposeTimes()? Then this should work.
DistributedRowMatrix sData = a.times(svdT);
sData.configure(conf);
// now run the Canopy job to prime kMeans canopies
CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4,
false, false);
// now run the KMeans job
KMeansDriver.runJob(sData.getRowPath(), new Path(output,
"clusters-0"), output, measure, 0.001, 10, 1, true, false);
// run ClusterDumper
ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
"clusters-2"), new Path(output, "clusteredPoints"));
clusterDumper.printClusters(termDictionary);
}
On 9/2/10 10:10 PM, Ted Dunning wrote:
I think that the solver actually does an SVD, but most of what you say
follows.
THere is one strangeness, I think in that the
DistributedRowMatrix.times is
doing a transposeTimes operation, not the normal times.
Jake should comment.
On Thu, Sep 2, 2010 at 8:28 PM, Jeff
Eastman<[email protected]>wrote:
On 9/2/10 7:41 PM, Jeff Eastman wrote:
Hopefully answering my own question here but ending up with
another. The
svd matrix I'd built from the eigenvectors is the wrong shape as I
built it.
Taking Jake's "column space" literally and building a matrix where
each of
the columns is one of the eigenvectors does give a matrix of the
correct
shape. The math works with DenseMatrix, producing a new data matrix
which is
15x7; a significant dimensionality reduction from 15x39.
In this example, with 15 samples having 39 terms and 7 eigenvectors:
A = [15x39]
P = [39x7]
A P = [15x7]
<snip>
Representing the eigen decomposition math in the above notation, A P
is the
projection of the data set onto the eigenvector basis:
If:
A = original data matrix
P = eigenvector column matrix
D = eigenvalue diagonal matrix
Then:
A P = P D => A = P D P'
Since we have A and P is already calculated by
DistributedLanczosSolver it
is easy to compute A P and we don't need the eigenvalues at all.
This is
good because the DLS does not output them. Is this why it doesn't
bother?