Re: Using SVD with Canopy/KMeans

Jake Mannix Mon, 20 Sep 2010 10:48:59 -0700

Hey Derek,

  I'll have to look in in more detail.  So far, I've gone with "if the
eigenvectors pass the EigenVerifier's test, I'm down with them" style of
black-box thinking (I mean, who am I to say what they should be like: if
they satisfy A.v = a * v, some a high degree of accuracy, their an
eigenvector!), but this does look pretty fishy.


  -jake

On Mon, Sep 20, 2010 at 10:41 AM, Derek O'Callaghan <[email protected]
> wrote:

> Hi Jeff, Jake,
>
> I think there may still be an issue with the clean step, even allowing for
> desiredRank-1 eigenvectors being created. If we run Jeff's
> TestClusterDumper.testKmeansSVD() as is:
>
> - desiredRank = 15
> - desiredRank -1 (14) raw eigenvectors are created
> - After the clean step, the last (14th) clean eigenvector always contains
> 0s. This means that we're left with 13 eigenvectors containing non-zero
> values.
>
> If I change the following:
>
>
> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
> false, desiredRank, 0.5, 0.0, true);
> Path cleanEigenvectors = new Path(output,
> EigenVerificationJob.CLEAN_EIGENVECTORS);
>
> to:
>
> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
> false, desiredRank);
> Path cleanEigenvectors = new Path(output,
> DistributedLanczosSolver.RAW_EIGENVECTORS);
>
> with desiredRank = 14, I get  14 eigenvectors with non-zero values. This is
> what I expect, allowing for the fact that desiredRank - 1 eigenvectors are
> returned.
>
> It seems that after generating the raw, and then the clean eigenvectors, I
> consistently get desiredRank-2 eigenvectors with non-zero values, with the
> desiredRank-1 eigenvector having all zero values after the clean step. E.g.
> if I change desiredRank to 10, I'll get 8 eigenvectors with non-zero values
> after running the raw+clean, with eigenvector 9 containing all zeros,
> whereas I get 9 non-zero eigenvectors if I just run raw on its own. If you
> inspect 'p' at the "Matrix sData = a.times(p);" line, you'll see this.
>
> I get this result both with Jeff's test data and my own. Stepping through
> the code, it seems to always find one vector in the for loop of
> EigenVerificationJob.pruneEigens() for which "Math.abs(1 -
> entry.getValue().getCosAngle()) < maxError" is false, and so it isn't added
> to the prunedEigenMeta list.
>
> Is this expected behaviour? Or, is there an issue given that one
> eigenvector always contains 0s after the clean, leaving you with
> desiredRank-2 eigenvectors? Apologies if there's no issue here, and it's
> just my a lack of understanding on my part.
>
> Thanks,
>
> Derek
>
>
> On 20/09/10 16:21, Jake Mannix wrote:
>
>> That last "eigenvector" is, for reasons not entirely clear even to me,
>> *not*
>> an eigenvector, as the output of EigenVerificationJob will show you if you
>> remove that "-1".
>>
>> The most sensible patch is to take the user's "desiredRank" and add one to
>> it, and leave the code otherwise unchanged.
>>
>>   -jake
>>
>> On Mon, Sep 20, 2010 at 7:22 AM, Jeff Eastman<[email protected]
>> >wrote:
>>
>>
>>
>>>  Hi Derek,
>>>
>>> I think this is caused by the fact that the SVD output seems to emit only
>>> desiredRank-1 eigenvectors in the rawEigenvectors directory. When that is
>>> transposed it would yield a p matrix with zero entries in the last column
>>> that you have observed. The code that's doing this is in
>>> DistributedLanczosSolver.serializeOutput() and the line responsible is:
>>>
>>>    for (int i = 0; i<  eigenVectors.numRows() - 1; i++) {
>>>
>>> I thought that curious but don't understand Lanczos well enough yet to be
>>> too critical. Perhaps you could try removing the -1 and see if it
>>> improves
>>> your results.
>>>
>>>
>>>
>>> On 9/18/10 9:58 AM, Derek O'Callaghan wrote:
>>>
>>>
>>>
>>>> Hi Jeff,
>>>>
>>>> I've been trying out the latest version of the svd code in
>>>> TestClusterDumper this week (actually I'm using my modified version of
>>>> it as
>>>> I mentioned in my original post at the start of the thread, with your
>>>> latest
>>>> changes). I suspect there's a problem with the EigenVerificationJob
>>>> called
>>>> from the svd solver. Looking at TestClusterDumper.testKmeansSVD(),
>>>> using:
>>>>
>>>> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
>>>> false, desiredRank, 0.5, 0.0, true);
>>>>
>>>> The generated 'p' matrix (read from the clean eigenvectors file) will
>>>> always have the value 0 for the (desiredRank - 1) column in each row.
>>>> E.g.,
>>>> here's the first row:
>>>>
>>>> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932,
>>>> 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4,
>>>> -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
>>>> -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4,
>>>> -0.0025483366872868546, 0.0]
>>>>
>>>> This then means that the sData matrix will have 0s in this column
>>>> following multiplication. However, when I change testKmeansSVD() to run
>>>> the
>>>> solver without the clean step, and load the raw eigenvectors into 'p'
>>>> i.e.
>>>> .
>>>> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
>>>> false, desiredRank);
>>>>
>>>> 'p' now has values other than 0 in the last column, e.g. here's the
>>>> first
>>>> row:
>>>>
>>>> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932,
>>>> 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4,
>>>> -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
>>>> -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4,
>>>> -0.0025483366872868546, -0.04870849090364153]
>>>>
>>>> I'm guessing there's a problem with the clean step here, or is this
>>>> normal
>>>> behaviour?
>>>>
>>>> FYI I noticed the problem when running the solver + clean on my own
>>>> data,
>>>> and then running the Dirichlet clusterer on the reduced data. I found
>>>> that
>>>> after a couple of iterations, things started to go wrong with Dirichlet
>>>> as
>>>> the following code in UncommonDistribution.rMultinom() was being called:
>>>>
>>>>     // can't happen except for round-off error so we don't care what we
>>>> return here
>>>>     return 0;
>>>>
>>>> I suspect this might be associated with the fact that the last column in
>>>> my reduced data matrix is 0, although I haven't confirmed it yet.
>>>>
>>>> Thanks,
>>>>
>>>> Derek
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>

Re: Using SVD with Canopy/KMeans

Reply via email to