Hi Jake,That sounds good, I might get a bit more time tomorrow to look at it further myself, I'll let you know if I find anything.
Thanks, Derek On 20/09/10 18:48, Jake Mannix wrote:
Hey Derek, I'll have to look in in more detail. So far, I've gone with "if the eigenvectors pass the EigenVerifier's test, I'm down with them" style of black-box thinking (I mean, who am I to say what they should be like: if they satisfy A.v = a * v, some a high degree of accuracy, their an eigenvector!), but this does look pretty fishy. -jake On Mon, Sep 20, 2010 at 10:41 AM, Derek O'Callaghan<[email protected]wrote:Hi Jeff, Jake, I think there may still be an issue with the clean step, even allowing for desiredRank-1 eigenvectors being created. If we run Jeff's TestClusterDumper.testKmeansSVD() as is: - desiredRank = 15 - desiredRank -1 (14) raw eigenvectors are created - After the clean step, the last (14th) clean eigenvector always contains 0s. This means that we're left with 13 eigenvectors containing non-zero values. If I change the following: solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false, desiredRank, 0.5, 0.0, true); Path cleanEigenvectors = new Path(output, EigenVerificationJob.CLEAN_EIGENVECTORS); to: solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false, desiredRank); Path cleanEigenvectors = new Path(output, DistributedLanczosSolver.RAW_EIGENVECTORS); with desiredRank = 14, I get 14 eigenvectors with non-zero values. This is what I expect, allowing for the fact that desiredRank - 1 eigenvectors are returned. It seems that after generating the raw, and then the clean eigenvectors, I consistently get desiredRank-2 eigenvectors with non-zero values, with the desiredRank-1 eigenvector having all zero values after the clean step. E.g. if I change desiredRank to 10, I'll get 8 eigenvectors with non-zero values after running the raw+clean, with eigenvector 9 containing all zeros, whereas I get 9 non-zero eigenvectors if I just run raw on its own. If you inspect 'p' at the "Matrix sData = a.times(p);" line, you'll see this. I get this result both with Jeff's test data and my own. Stepping through the code, it seems to always find one vector in the for loop of EigenVerificationJob.pruneEigens() for which "Math.abs(1 - entry.getValue().getCosAngle())< maxError" is false, and so it isn't added to the prunedEigenMeta list. Is this expected behaviour? Or, is there an issue given that one eigenvector always contains 0s after the clean, leaving you with desiredRank-2 eigenvectors? Apologies if there's no issue here, and it's just my a lack of understanding on my part. Thanks, Derek On 20/09/10 16:21, Jake Mannix wrote:That last "eigenvector" is, for reasons not entirely clear even to me, *not* an eigenvector, as the output of EigenVerificationJob will show you if you remove that "-1". The most sensible patch is to take the user's "desiredRank" and add one to it, and leave the code otherwise unchanged. -jake On Mon, Sep 20, 2010 at 7:22 AM, Jeff Eastman<[email protected]wrote:Hi Derek, I think this is caused by the fact that the SVD output seems to emit only desiredRank-1 eigenvectors in the rawEigenvectors directory. When that is transposed it would yield a p matrix with zero entries in the last column that you have observed. The code that's doing this is in DistributedLanczosSolver.serializeOutput() and the line responsible is: for (int i = 0; i< eigenVectors.numRows() - 1; i++) { I thought that curious but don't understand Lanczos well enough yet to be too critical. Perhaps you could try removing the -1 and see if it improves your results. On 9/18/10 9:58 AM, Derek O'Callaghan wrote:Hi Jeff, I've been trying out the latest version of the svd code in TestClusterDumper this week (actually I'm using my modified version of it as I mentioned in my original post at the start of the thread, with your latest changes). I suspect there's a problem with the EigenVerificationJob called from the svd solver. Looking at TestClusterDumper.testKmeansSVD(), using: solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false, desiredRank, 0.5, 0.0, true); The generated 'p' matrix (read from the clean eigenvectors file) will always have the value 0 for the (desiredRank - 1) column in each row. E.g., here's the first row: [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932, 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4, -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5, -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4, -0.0025483366872868546, 0.0] This then means that the sData matrix will have 0s in this column following multiplication. However, when I change testKmeansSVD() to run the solver without the clean step, and load the raw eigenvectors into 'p' i.e. . solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false, desiredRank); 'p' now has values other than 0 in the last column, e.g. here's the first row: [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932, 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4, -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5, -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4, -0.0025483366872868546, -0.04870849090364153] I'm guessing there's a problem with the clean step here, or is this normal behaviour? FYI I noticed the problem when running the solver + clean on my own data, and then running the Dirichlet clusterer on the reduced data. I found that after a couple of iterations, things started to go wrong with Dirichlet as the following code in UncommonDistribution.rMultinom() was being called: // can't happen except for round-off error so we don't care what we return here return 0; I suspect this might be associated with the fact that the last column in my reduced data matrix is 0, although I haven't confirmed it yet. Thanks, Derek
