Hey Derek, I'll have to look in in more detail. So far, I've gone with "if the eigenvectors pass the EigenVerifier's test, I'm down with them" style of black-box thinking (I mean, who am I to say what they should be like: if they satisfy A.v = a * v, some a high degree of accuracy, their an eigenvector!), but this does look pretty fishy.
-jake On Mon, Sep 20, 2010 at 10:41 AM, Derek O'Callaghan <[email protected] > wrote: > Hi Jeff, Jake, > > I think there may still be an issue with the clean step, even allowing for > desiredRank-1 eigenvectors being created. If we run Jeff's > TestClusterDumper.testKmeansSVD() as is: > > - desiredRank = 15 > - desiredRank -1 (14) raw eigenvectors are created > - After the clean step, the last (14th) clean eigenvector always contains > 0s. This means that we're left with 13 eigenvectors containing non-zero > values. > > If I change the following: > > > solver.run(testData, output, tmp, sampleData.size(), sampleDimension, > false, desiredRank, 0.5, 0.0, true); > Path cleanEigenvectors = new Path(output, > EigenVerificationJob.CLEAN_EIGENVECTORS); > > to: > > solver.run(testData, output, tmp, sampleData.size(), sampleDimension, > false, desiredRank); > Path cleanEigenvectors = new Path(output, > DistributedLanczosSolver.RAW_EIGENVECTORS); > > with desiredRank = 14, I get 14 eigenvectors with non-zero values. This is > what I expect, allowing for the fact that desiredRank - 1 eigenvectors are > returned. > > It seems that after generating the raw, and then the clean eigenvectors, I > consistently get desiredRank-2 eigenvectors with non-zero values, with the > desiredRank-1 eigenvector having all zero values after the clean step. E.g. > if I change desiredRank to 10, I'll get 8 eigenvectors with non-zero values > after running the raw+clean, with eigenvector 9 containing all zeros, > whereas I get 9 non-zero eigenvectors if I just run raw on its own. If you > inspect 'p' at the "Matrix sData = a.times(p);" line, you'll see this. > > I get this result both with Jeff's test data and my own. Stepping through > the code, it seems to always find one vector in the for loop of > EigenVerificationJob.pruneEigens() for which "Math.abs(1 - > entry.getValue().getCosAngle()) < maxError" is false, and so it isn't added > to the prunedEigenMeta list. > > Is this expected behaviour? Or, is there an issue given that one > eigenvector always contains 0s after the clean, leaving you with > desiredRank-2 eigenvectors? Apologies if there's no issue here, and it's > just my a lack of understanding on my part. > > Thanks, > > Derek > > > On 20/09/10 16:21, Jake Mannix wrote: > >> That last "eigenvector" is, for reasons not entirely clear even to me, >> *not* >> an eigenvector, as the output of EigenVerificationJob will show you if you >> remove that "-1". >> >> The most sensible patch is to take the user's "desiredRank" and add one to >> it, and leave the code otherwise unchanged. >> >> -jake >> >> On Mon, Sep 20, 2010 at 7:22 AM, Jeff Eastman<[email protected] >> >wrote: >> >> >> >>> Hi Derek, >>> >>> I think this is caused by the fact that the SVD output seems to emit only >>> desiredRank-1 eigenvectors in the rawEigenvectors directory. When that is >>> transposed it would yield a p matrix with zero entries in the last column >>> that you have observed. The code that's doing this is in >>> DistributedLanczosSolver.serializeOutput() and the line responsible is: >>> >>> for (int i = 0; i< eigenVectors.numRows() - 1; i++) { >>> >>> I thought that curious but don't understand Lanczos well enough yet to be >>> too critical. Perhaps you could try removing the -1 and see if it >>> improves >>> your results. >>> >>> >>> >>> On 9/18/10 9:58 AM, Derek O'Callaghan wrote: >>> >>> >>> >>>> Hi Jeff, >>>> >>>> I've been trying out the latest version of the svd code in >>>> TestClusterDumper this week (actually I'm using my modified version of >>>> it as >>>> I mentioned in my original post at the start of the thread, with your >>>> latest >>>> changes). I suspect there's a problem with the EigenVerificationJob >>>> called >>>> from the svd solver. Looking at TestClusterDumper.testKmeansSVD(), >>>> using: >>>> >>>> solver.run(testData, output, tmp, sampleData.size(), sampleDimension, >>>> false, desiredRank, 0.5, 0.0, true); >>>> >>>> The generated 'p' matrix (read from the clean eigenvectors file) will >>>> always have the value 0 for the (desiredRank - 1) column in each row. >>>> E.g., >>>> here's the first row: >>>> >>>> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932, >>>> 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4, >>>> -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5, >>>> -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4, >>>> -0.0025483366872868546, 0.0] >>>> >>>> This then means that the sData matrix will have 0s in this column >>>> following multiplication. However, when I change testKmeansSVD() to run >>>> the >>>> solver without the clean step, and load the raw eigenvectors into 'p' >>>> i.e. >>>> . >>>> solver.run(testData, output, tmp, sampleData.size(), sampleDimension, >>>> false, desiredRank); >>>> >>>> 'p' now has values other than 0 in the last column, e.g. here's the >>>> first >>>> row: >>>> >>>> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932, >>>> 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4, >>>> -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5, >>>> -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4, >>>> -0.0025483366872868546, -0.04870849090364153] >>>> >>>> I'm guessing there's a problem with the clean step here, or is this >>>> normal >>>> behaviour? >>>> >>>> FYI I noticed the problem when running the solver + clean on my own >>>> data, >>>> and then running the Dirichlet clusterer on the reduced data. I found >>>> that >>>> after a couple of iterations, things started to go wrong with Dirichlet >>>> as >>>> the following code in UncommonDistribution.rMultinom() was being called: >>>> >>>> // can't happen except for round-off error so we don't care what we >>>> return here >>>> return 0; >>>> >>>> I suspect this might be associated with the fact that the last column in >>>> my reduced data matrix is 0, although I haven't confirmed it yet. >>>> >>>> Thanks, >>>> >>>> Derek >>>> >>>> >>>> >>> >>> >>> >> >> >> >
