Re: Using SVD with Canopy/KMeans

Jeff Eastman Mon, 20 Sep 2010 07:22:54 -0700

 Hi Derek,

I think this is caused by the fact that the SVD output seems to emitonly desiredRank-1 eigenvectors in the rawEigenvectors directory. Whenthat is transposed it would yield a p matrix with zero entries in thelast column that you have observed. The code that's doing this is inDistributedLanczosSolver.serializeOutput() and the line responsible is:


    for (int i = 0; i < eigenVectors.numRows() - 1; i++) {

I thought that curious but don't understand Lanczos well enough yet tobe too critical. Perhaps you could try removing the -1 and see if itimproves your results.



On 9/18/10 9:58 AM, Derek O'Callaghan wrote:

Hi Jeff,

I've been trying out the latest version of the svd code in TestClusterDumper 
this week (actually I'm using my modified version of it as I mentioned in my 
original post at the start of the thread, with your latest changes). I suspect 
there's a problem with the EigenVerificationJob called from the svd solver. 
Looking at TestClusterDumper.testKmeansSVD(), using:

solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false, 
desiredRank, 0.5, 0.0, true);

The generated 'p' matrix (read from the clean eigenvectors file) will always 
have the value 0 for the (desiredRank - 1) column in each row. E.g., here's the 
first row:

[-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932, 
0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4, 
-2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5, 
-4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4, 
-0.0025483366872868546, 0.0]

This then means that the sData matrix will have 0s in this column following 
multiplication. However, when I change testKmeansSVD() to run the solver 
without the clean step, and load the raw eigenvectors into 'p' i.e.
.
solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false, 
desiredRank);

'p' now has values other than 0 in the last column, e.g. here's the first row:

[-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932, 
0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4, 
-2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5, 
-4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4, 
-0.0025483366872868546, -0.04870849090364153]

I'm guessing there's a problem with the clean step here, or is this normal 
behaviour?

FYI I noticed the problem when running the solver + clean on my own data, and 
then running the Dirichlet clusterer on the reduced data. I found that after a 
couple of iterations, things started to go wrong with Dirichlet as the 
following code in UncommonDistribution.rMultinom() was being called:

     // can't happen except for round-off error so we don't care what we return 
here
     return 0;

I suspect this might be associated with the fact that the last column in my 
reduced data matrix is 0, although I haven't confirmed it yet.

Thanks,

Derek

Re: Using SVD with Canopy/KMeans

Reply via email to