Just rereading what I wrote, I think the question at the end should be:
Is there an issue given that one eigenvector is consistently generated which doesn't satisfy "Math.abs(1 - entry.getValue().getCosAngle()) < maxError" in EigenVerificationJob.pruneEigens() during the clean step, leaving you with desiredRank-2 eigenvectors?
Thanks, Derek On 20/09/10 18:41, Derek O'Callaghan wrote:
Hi Jeff, Jake,I think there may still be an issue with the clean step, even allowing for desiredRank-1 eigenvectors being created. If we run Jeff's TestClusterDumper.testKmeansSVD() as is:- desiredRank = 15 - desiredRank -1 (14) raw eigenvectors are created- After the clean step, the last (14th) clean eigenvector always contains 0s. This means that we're left with 13 eigenvectors containing non-zero values.If I change the following:solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false, desiredRank, 0.5, 0.0, true); Path cleanEigenvectors = new Path(output, EigenVerificationJob.CLEAN_EIGENVECTORS);to:solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false, desiredRank); Path cleanEigenvectors = new Path(output, DistributedLanczosSolver.RAW_EIGENVECTORS);with desiredRank = 14, I get 14 eigenvectors with non-zero values. This is what I expect, allowing for the fact that desiredRank - 1 eigenvectors are returned.It seems that after generating the raw, and then the clean eigenvectors, I consistently get desiredRank-2 eigenvectors with non-zero values, with the desiredRank-1 eigenvector having all zero values after the clean step. E.g. if I change desiredRank to 10, I'll get 8 eigenvectors with non-zero values after running the raw+clean, with eigenvector 9 containing all zeros, whereas I get 9 non-zero eigenvectors if I just run raw on its own. If you inspect 'p' at the "Matrix sData = a.times(p);" line, you'll see this.I get this result both with Jeff's test data and my own. Stepping through the code, it seems to always find one vector in the for loop of EigenVerificationJob.pruneEigens() for which "Math.abs(1 - entry.getValue().getCosAngle()) < maxError" is false, and so it isn't added to the prunedEigenMeta list.Is this expected behaviour? Or, is there an issue given that one eigenvector always contains 0s after the clean, leaving you with desiredRank-2 eigenvectors? Apologies if there's no issue here, and it's just my a lack of understanding on my part.Thanks, Derek On 20/09/10 16:21, Jake Mannix wrote:That last "eigenvector" is, for reasons not entirely clear even to me, *not* an eigenvector, as the output of EigenVerificationJob will show you if youremove that "-1".The most sensible patch is to take the user's "desiredRank" and add one toit, and leave the code otherwise unchanged. -jakeOn Mon, Sep 20, 2010 at 7:22 AM, Jeff Eastman<[email protected]>wrote:Hi Derek,I think this is caused by the fact that the SVD output seems to emit only desiredRank-1 eigenvectors in the rawEigenvectors directory. When that is transposed it would yield a p matrix with zero entries in the last columnthat you have observed. The code that's doing this is in DistributedLanczosSolver.serializeOutput() and the line responsible is: for (int i = 0; i< eigenVectors.numRows() - 1; i++) {I thought that curious but don't understand Lanczos well enough yet to be too critical. Perhaps you could try removing the -1 and see if it improvesyour results. On 9/18/10 9:58 AM, Derek O'Callaghan wrote:Hi Jeff, I've been trying out the latest version of the svd code inTestClusterDumper this week (actually I'm using my modified version of it as I mentioned in my original post at the start of the thread, with your latest changes). I suspect there's a problem with the EigenVerificationJob called from the svd solver. Looking at TestClusterDumper.testKmeansSVD(), using:solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false, desiredRank, 0.5, 0.0, true); The generated 'p' matrix (read from the clean eigenvectors file) willalways have the value 0 for the (desiredRank - 1) column in each row. E.g.,here's the first row: [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932, 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4, -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5, -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4, -0.0025483366872868546, 0.0] This then means that the sData matrix will have 0s in this columnfollowing multiplication. However, when I change testKmeansSVD() to run the solver without the clean step, and load the raw eigenvectors into 'p' i.e.. solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false, desiredRank);'p' now has values other than 0 in the last column, e.g. here's the firstrow: [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932, 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4, -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5, -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4, -0.0025483366872868546, -0.04870849090364153]I'm guessing there's a problem with the clean step here, or is this normalbehaviour?FYI I noticed the problem when running the solver + clean on my own data, and then running the Dirichlet clusterer on the reduced data. I found that after a couple of iterations, things started to go wrong with Dirichlet as the following code in UncommonDistribution.rMultinom() was being called:// can't happen except for round-off error so we don't care what wereturn here return 0;I suspect this might be associated with the fact that the last column inmy reduced data matrix is 0, although I haven't confirmed it yet. Thanks, Derek
