Re: Using SVD with Canopy/KMeans

Jake Mannix Mon, 20 Sep 2010 08:21:58 -0700

That last "eigenvector" is, for reasons not entirely clear even to me, *not*
an eigenvector, as the output of EigenVerificationJob will show you if you
remove that "-1".


The most sensible patch is to take the user's "desiredRank" and add one to
it, and leave the code otherwise unchanged.

  -jake

On Mon, Sep 20, 2010 at 7:22 AM, Jeff Eastman <[email protected]>wrote:

>  Hi Derek,
>
> I think this is caused by the fact that the SVD output seems to emit only
> desiredRank-1 eigenvectors in the rawEigenvectors directory. When that is
> transposed it would yield a p matrix with zero entries in the last column
> that you have observed. The code that's doing this is in
> DistributedLanczosSolver.serializeOutput() and the line responsible is:
>
>    for (int i = 0; i < eigenVectors.numRows() - 1; i++) {
>
> I thought that curious but don't understand Lanczos well enough yet to be
> too critical. Perhaps you could try removing the -1 and see if it improves
> your results.
>
>
>
> On 9/18/10 9:58 AM, Derek O'Callaghan wrote:
>
>> Hi Jeff,
>>
>> I've been trying out the latest version of the svd code in
>> TestClusterDumper this week (actually I'm using my modified version of it as
>> I mentioned in my original post at the start of the thread, with your latest
>> changes). I suspect there's a problem with the EigenVerificationJob called
>> from the svd solver. Looking at TestClusterDumper.testKmeansSVD(), using:
>>
>> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
>> false, desiredRank, 0.5, 0.0, true);
>>
>> The generated 'p' matrix (read from the clean eigenvectors file) will
>> always have the value 0 for the (desiredRank - 1) column in each row. E.g.,
>> here's the first row:
>>
>> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932,
>> 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4,
>> -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
>> -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4,
>> -0.0025483366872868546, 0.0]
>>
>> This then means that the sData matrix will have 0s in this column
>> following multiplication. However, when I change testKmeansSVD() to run the
>> solver without the clean step, and load the raw eigenvectors into 'p' i.e.
>> .
>> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
>> false, desiredRank);
>>
>> 'p' now has values other than 0 in the last column, e.g. here's the first
>> row:
>>
>> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932,
>> 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4,
>> -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
>> -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4,
>> -0.0025483366872868546, -0.04870849090364153]
>>
>> I'm guessing there's a problem with the clean step here, or is this normal
>> behaviour?
>>
>> FYI I noticed the problem when running the solver + clean on my own data,
>> and then running the Dirichlet clusterer on the reduced data. I found that
>> after a couple of iterations, things started to go wrong with Dirichlet as
>> the following code in UncommonDistribution.rMultinom() was being called:
>>
>>     // can't happen except for round-off error so we don't care what we
>> return here
>>     return 0;
>>
>> I suspect this might be associated with the fact that the last column in
>> my reduced data matrix is 0, although I haven't confirmed it yet.
>>
>> Thanks,
>>
>> Derek
>>
>
>


-- 
  -jake

Re: Using SVD with Canopy/KMeans

Reply via email to