Re: Using SVD with Canopy/KMeans

Derek O'Callaghan Mon, 20 Sep 2010 10:56:24 -0700

Hi Jake,

That sounds good, I might get a bit more time tomorrow to look at itfurther myself, I'll let you know if I find anything.


Thanks,

Derek

On 20/09/10 18:48, Jake Mannix wrote:

Hey Derek,

   I'll have to look in in more detail.  So far, I've gone with "if the
eigenvectors pass the EigenVerifier's test, I'm down with them" style of
black-box thinking (I mean, who am I to say what they should be like: if
they satisfy A.v = a * v, some a high degree of accuracy, their an
eigenvector!), but this does look pretty fishy.

   -jake

On Mon, Sep 20, 2010 at 10:41 AM, Derek O'Callaghan<[email protected]

wrote:

Hi Jeff, Jake,

I think there may still be an issue with the clean step, even allowing for
desiredRank-1 eigenvectors being created. If we run Jeff's
TestClusterDumper.testKmeansSVD() as is:

- desiredRank = 15
- desiredRank -1 (14) raw eigenvectors are created
- After the clean step, the last (14th) clean eigenvector always contains
0s. This means that we're left with 13 eigenvectors containing non-zero
values.

If I change the following:


solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
false, desiredRank, 0.5, 0.0, true);
Path cleanEigenvectors = new Path(output,
EigenVerificationJob.CLEAN_EIGENVECTORS);

to:

solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
false, desiredRank);
Path cleanEigenvectors = new Path(output,
DistributedLanczosSolver.RAW_EIGENVECTORS);

with desiredRank = 14, I get  14 eigenvectors with non-zero values. This is
what I expect, allowing for the fact that desiredRank - 1 eigenvectors are
returned.

It seems that after generating the raw, and then the clean eigenvectors, I
consistently get desiredRank-2 eigenvectors with non-zero values, with the
desiredRank-1 eigenvector having all zero values after the clean step. E.g.
if I change desiredRank to 10, I'll get 8 eigenvectors with non-zero values
after running the raw+clean, with eigenvector 9 containing all zeros,
whereas I get 9 non-zero eigenvectors if I just run raw on its own. If you
inspect 'p' at the "Matrix sData = a.times(p);" line, you'll see this.

I get this result both with Jeff's test data and my own. Stepping through
the code, it seems to always find one vector in the for loop of
EigenVerificationJob.pruneEigens() for which "Math.abs(1 -
entry.getValue().getCosAngle())<  maxError" is false, and so it isn't added
to the prunedEigenMeta list.

Is this expected behaviour? Or, is there an issue given that one
eigenvector always contains 0s after the clean, leaving you with
desiredRank-2 eigenvectors? Apologies if there's no issue here, and it's
just my a lack of understanding on my part.

Thanks,

Derek


On 20/09/10 16:21, Jake Mannix wrote:

That last "eigenvector" is, for reasons not entirely clear even to me,
*not*
an eigenvector, as the output of EigenVerificationJob will show you if you
remove that "-1".

The most sensible patch is to take the user's "desiredRank" and add one to
it, and leave the code otherwise unchanged.

   -jake

On Mon, Sep 20, 2010 at 7:22 AM, Jeff Eastman<[email protected]

wrote:

  Hi Derek,

I think this is caused by the fact that the SVD output seems to emit only
desiredRank-1 eigenvectors in the rawEigenvectors directory. When that is
transposed it would yield a p matrix with zero entries in the last column
that you have observed. The code that's doing this is in
DistributedLanczosSolver.serializeOutput() and the line responsible is:

    for (int i = 0; i<   eigenVectors.numRows() - 1; i++) {

I thought that curious but don't understand Lanczos well enough yet to be
too critical. Perhaps you could try removing the -1 and see if it
improves
your results.



On 9/18/10 9:58 AM, Derek O'Callaghan wrote:

Hi Jeff,

I've been trying out the latest version of the svd code in
TestClusterDumper this week (actually I'm using my modified version of
it as
I mentioned in my original post at the start of the thread, with your
latest
changes). I suspect there's a problem with the EigenVerificationJob
called
from the svd solver. Looking at TestClusterDumper.testKmeansSVD(),
using:

solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
false, desiredRank, 0.5, 0.0, true);

The generated 'p' matrix (read from the clean eigenvectors file) will
always have the value 0 for the (desiredRank - 1) column in each row.
E.g.,
here's the first row:

[-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932,
0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4,
-2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
-4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4,
-0.0025483366872868546, 0.0]

This then means that the sData matrix will have 0s in this column
following multiplication. However, when I change testKmeansSVD() to run
the
solver without the clean step, and load the raw eigenvectors into 'p'
i.e.
.
solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
false, desiredRank);

'p' now has values other than 0 in the last column, e.g. here's the
first
row:

[-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932,
0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4,
-2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
-4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4,
-0.0025483366872868546, -0.04870849090364153]

I'm guessing there's a problem with the clean step here, or is this
normal
behaviour?

FYI I noticed the problem when running the solver + clean on my own
data,
and then running the Dirichlet clusterer on the reduced data. I found
that
after a couple of iterations, things started to go wrong with Dirichlet
as
the following code in UncommonDistribution.rMultinom() was being called:

     // can't happen except for round-off error so we don't care what we
return here
     return 0;

I suspect this might be associated with the fact that the last column in
my reduced data matrix is 0, although I haven't confirmed it yet.

Thanks,

Derek

Re: Using SVD with Canopy/KMeans

Reply via email to