Re: Using SVD with Canopy/KMeans

Derek O'Callaghan Sat, 18 Sep 2010 06:58:42 -0700

Hi Jeff,

I've been trying out the latest version of the svd code in TestClusterDumper 
this week (actually I'm using my modified version of it as I mentioned in my 
original post at the start of the thread, with your latest changes). I suspect 
there's a problem with the EigenVerificationJob called from the svd solver. 
Looking at TestClusterDumper.testKmeansSVD(), using:


solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false, 
desiredRank, 0.5, 0.0, true);

The generated 'p' matrix (read from the clean eigenvectors file) will always 
have the value 0 for the (desiredRank - 1) column in each row. E.g., here's the 
first row:

[-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932, 
0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4, 
-2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5, 
-4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4, 
-0.0025483366872868546, 0.0]

This then means that the sData matrix will have 0s in this column following 
multiplication. However, when I change testKmeansSVD() to run the solver 
without the clean step, and load the raw eigenvectors into 'p' i.e.
.
solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false, 
desiredRank);

'p' now has values other than 0 in the last column, e.g. here's the first row:

[-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932, 
0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4, 
-2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5, 
-4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4, 
-0.0025483366872868546, -0.04870849090364153]

I'm guessing there's a problem with the clean step here, or is this normal 
behaviour?

FYI I noticed the problem when running the solver + clean on my own data, and 
then running the Dirichlet clusterer on the reduced data. I found that after a 
couple of iterations, things started to go wrong with Dirichlet as the 
following code in UncommonDistribution.rMultinom() was being called:

    // can't happen except for round-off error so we don't care what we return 
here
    return 0;

I suspect this might be associated with the fact that the last column in my 
reduced data matrix is 0, although I haven't confirmed it yet.

Thanks,

Derek

----- Original Message -----
From: Jeff Eastman <[email protected]>
Date: Tuesday, September 14, 2010 6:45 pm
Subject: Re: Using SVD with Canopy/KMeans
To: [email protected]

>  Here's the new set of mahout svd arguments. Entries --
> cleansvd, --maxError, --minEigenvalue and --inMemory have been 
> added in r997007. See the new tests in 
> TestDistributedLanczosSolverCLI for examples of both forms:
> 
>   --input (-i) 
> input                      Path to job input directory.
>   --output (-o) 
> output                    The directory pathname for output.
>   --numRows (-nr) 
> numRows                 Number of rows of the input matrix
>   --numCols (-nc) 
> numCols                 Number of columns of the input matrix
>   --rank (-r) 
> rank                        Desired decomposition rank (note:
>                                           only roughly 1/4 to 1/3 of these 
> will
>                                           have the top portion of the 
> spectrum)
>   --symmetric (-sym) 
> symmetric            Is the input matrix square and
>                                           symmetric?
>   --cleansvd (-cl) 
> cleansvd               Run the EigenVerificationJob to clean
>                                           the eigenvectors after SVD
>   --maxError (-err) 
> maxError              Maximum acceptable error
>   --minEigenvalue (-mev) minEigenvalue    
> Minimum eigenvalue to keep the vector
>                                           for
>   --inMemory (-mem) 
> inMemory              Buffer eigen matrix into memory (if
>                                           you have enough!)
>   --help (-
> h)                             Print out help
>   --tempDir 
> tempDir                       Intermediate output directory
>   --startPhase 
> startPhase                 First phase to run
>   --endPhase 
> endPhase                     Last phase to run
> 
> On 9/14/10 6:55 AM, Jake Mannix wrote:
> >I guess the main thing I'd want to happen in combining EVJ and 
> DLS is to
> >make sure that the final output (changing the semantics of the 
> CLI param is
> >ok) is clear, with it either being the output of EVJ (if that 
> is used), or
> >DLS (if EVJ is not used).  If that can be done, go for it!
> >
> >   -jake
> >
> >On Tue, Sep 14, 2010 at 6:30 AM, Jeff 
> Eastman<[email protected]>wrote:>
> >>  Jake, I see you are on line. I'm inclined to push 
> forward on this despite
> >>the adjustments to DLS --output semantics. Agreed?
> >>
> >>
> >>On 9/13/10 10:34 AM, Jeff Eastman wrote:
> >>
> >>>  r996599 completed the first part. Several additional 
> arguments to EVJ.run
> >>>need to be added to DLS (maxError, minEigenValue, inMemory, 
> also the
> >>>--cleansvn flag itself). Also DLS interprets --output as the
> >>>outputEigenVectorPath and not as the generic output directory 
> so DLS.run()
> >>>will need another argument too. Still want to do this?
> >>>
> >>>On 9/12/10 2:19 PM, Jake Mannix wrote:
> >>>
> >>>>+1 on folding EigenVerificationJob into 
> DistributedLanczosSolver. Or, at
> >>>>>least implement a job() method on EVJ.
> >>>>>
> >>>>>  +1 for having the latter, with a boolean flag in DLS 
> to optionally call
> >>>>EJV
> >>>>after it's done.
> >>>>
> >>>
>

Re: Using SVD with Canopy/KMeans

Reply via email to