Looking at the single unit test of DMR.times() it seems to be implementing Matrix expected = inputA.transpose().times(inputB), and not inputA.times(inputB.transpose()), so the bounds checking is correct as implemented. But the method still has the wrong name and AFAICT is not useful for performing this particular computation. Should I use this instead?

DistributedRowMatrix sData = a.transpose().t[ransposeT]imes(svdT.transpose())

ugh! And it still fails with:

java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable at org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

On 9/3/10 7:15 AM, Jeff Eastman wrote:
Ah, ok, that would explain some of the cardinality checking differences. Its kind of confusing to have the times() method doing something different from Matrix in the distributed case. Maybe renaming to transposeTimes would be clearer? At least adding some comments for the unwary?

Here's my revised test using DistributedRowMatrix for the computations. It fails in DMR.times() with a CardinalityException. Looking at the source of the exception (again), it is comparing numRows != other.numRows(). Since matrix A is [15x39] and svdT is [12x39] shouldn't A.t[ransposeT]imes(svdT) be comparing the numCols instead? What exactly is DMR.times() doing?

  @Test
  public void testKmeansDSVD() throws Exception {
    DistanceMeasure measure = new EuclideanDistanceMeasure();
    Path output = getTestTempDirPath("output");
    Path tmp = getTestTempDirPath("tmp");
    Path eigenvectors = new Path(output, "eigenvectors");
    int desiredRank = 13;
    DistributedLanczosSolver solver = new DistributedLanczosSolver();
    Configuration config = new Configuration();
    solver.setConf(config);
    Path testData = getTestTempDirPath("testdata");
    int sampleDimension = sampleData.get(0).get().size();
solver.run(testData, tmp, eigenvectors, sampleData.size(), sampleDimension, false, desiredRank);

    // now multiply the testdata matrix and the eigenvector matrix
DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors, tmp, desiredRank - 1, sampleDimension);
    JobConf conf = new JobConf(config);
    svdT.configure(conf);
DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp, sampleData.size(), sampleDimension);
    a.configure(conf);
    // DMR.times() is really transposeTimes()? Then this should work.
    DistributedRowMatrix sData = a.times(svdT);
    sData.configure(conf);

    // now run the Canopy job to prime kMeans canopies
CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false, false);
    // now run the KMeans job
KMeansDriver.runJob(sData.getRowPath(), new Path(output, "clusters-0"), output, measure, 0.001, 10, 1, true, false);
    // run ClusterDumper
ClusterDumper clusterDumper = new ClusterDumper(new Path(output, "clusters-2"), new Path(output, "clusteredPoints"));
    clusterDumper.printClusters(termDictionary);
  }



On 9/2/10 10:10 PM, Ted Dunning wrote:
I think that the solver actually does an SVD, but most of what you say
follows.

THere is one strangeness, I think in that the DistributedRowMatrix.times is
doing a transposeTimes operation, not the normal times.

Jake should comment.

On Thu, Sep 2, 2010 at 8:28 PM, Jeff Eastman<[email protected]>wrote:

  On 9/2/10 7:41 PM, Jeff Eastman wrote:

Hopefully answering my own question here but ending up with another. The svd matrix I'd built from the eigenvectors is the wrong shape as I built it. Taking Jake's "column space" literally and building a matrix where each of the columns is one of the eigenvectors does give a matrix of the correct shape. The math works with DenseMatrix, producing a new data matrix which is
15x7; a significant dimensionality reduction from 15x39.

In this example, with 15 samples having 39 terms and 7 eigenvectors:
    A = [15x39]
    P = [39x7]
    A P = [15x7]
<snip>

Representing the eigen decomposition math in the above notation, A P is the
projection of the data set onto the eigenvector basis:

If:
A = original data matrix
P = eigenvector column matrix
D = eigenvalue diagonal matrix

Then:
A P = P D =>  A = P D P'

Since we have A and P is already calculated by DistributedLanczosSolver it is easy to compute A P and we don't need the eigenvalues at all. This is good because the DLS does not output them. Is this why it doesn't bother?




Reply via email to