Hi Jeff,
I've been trying out the latest version of the svd code in TestClusterDumper
this week (actually I'm using my modified version of it as I mentioned in my
original post at the start of the thread, with your latest changes). I suspect
there's a problem with the EigenVerificationJob called from the svd solver.
Looking at TestClusterDumper.testKmeansSVD(), using:
solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false,
desiredRank, 0.5, 0.0, true);
The generated 'p' matrix (read from the clean eigenvectors file) will always
have the value 0 for the (desiredRank - 1) column in each row. E.g., here's the
first row:
[-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932,
0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4,
-2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
-4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4,
-0.0025483366872868546, 0.0]
This then means that the sData matrix will have 0s in this column following
multiplication. However, when I change testKmeansSVD() to run the solver
without the clean step, and load the raw eigenvectors into 'p' i.e.
.
solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false,
desiredRank);
'p' now has values other than 0 in the last column, e.g. here's the first row:
[-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932,
0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4,
-2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
-4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4,
-0.0025483366872868546, -0.04870849090364153]
I'm guessing there's a problem with the clean step here, or is this normal
behaviour?
FYI I noticed the problem when running the solver + clean on my own data, and
then running the Dirichlet clusterer on the reduced data. I found that after a
couple of iterations, things started to go wrong with Dirichlet as the
following code in UncommonDistribution.rMultinom() was being called:
// can't happen except for round-off error so we don't care what we return
here
return 0;
I suspect this might be associated with the fact that the last column in my
reduced data matrix is 0, although I haven't confirmed it yet.
Thanks,
Derek
----- Original Message -----
From: Jeff Eastman <[email protected]>
Date: Tuesday, September 14, 2010 6:45 pm
Subject: Re: Using SVD with Canopy/KMeans
To: [email protected]
> Here's the new set of mahout svd arguments. Entries --
> cleansvd, --maxError, --minEigenvalue and --inMemory have been
> added in r997007. See the new tests in
> TestDistributedLanczosSolverCLI for examples of both forms:
>
> --input (-i)
> input Path to job input directory.
> --output (-o)
> output The directory pathname for output.
> --numRows (-nr)
> numRows Number of rows of the input matrix
> --numCols (-nc)
> numCols Number of columns of the input matrix
> --rank (-r)
> rank Desired decomposition rank (note:
> only roughly 1/4 to 1/3 of these
> will
> have the top portion of the
> spectrum)
> --symmetric (-sym)
> symmetric Is the input matrix square and
> symmetric?
> --cleansvd (-cl)
> cleansvd Run the EigenVerificationJob to clean
> the eigenvectors after SVD
> --maxError (-err)
> maxError Maximum acceptable error
> --minEigenvalue (-mev) minEigenvalue
> Minimum eigenvalue to keep the vector
> for
> --inMemory (-mem)
> inMemory Buffer eigen matrix into memory (if
> you have enough!)
> --help (-
> h) Print out help
> --tempDir
> tempDir Intermediate output directory
> --startPhase
> startPhase First phase to run
> --endPhase
> endPhase Last phase to run
>
> On 9/14/10 6:55 AM, Jake Mannix wrote:
> >I guess the main thing I'd want to happen in combining EVJ and
> DLS is to
> >make sure that the final output (changing the semantics of the
> CLI param is
> >ok) is clear, with it either being the output of EVJ (if that
> is used), or
> >DLS (if EVJ is not used). If that can be done, go for it!
> >
> > -jake
> >
> >On Tue, Sep 14, 2010 at 6:30 AM, Jeff
> Eastman<[email protected]>wrote:>
> >> Jake, I see you are on line. I'm inclined to push
> forward on this despite
> >>the adjustments to DLS --output semantics. Agreed?
> >>
> >>
> >>On 9/13/10 10:34 AM, Jeff Eastman wrote:
> >>
> >>> r996599 completed the first part. Several additional
> arguments to EVJ.run
> >>>need to be added to DLS (maxError, minEigenValue, inMemory,
> also the
> >>>--cleansvn flag itself). Also DLS interprets --output as the
> >>>outputEigenVectorPath and not as the generic output directory
> so DLS.run()
> >>>will need another argument too. Still want to do this?
> >>>
> >>>On 9/12/10 2:19 PM, Jake Mannix wrote:
> >>>
> >>>>+1 on folding EigenVerificationJob into
> DistributedLanczosSolver. Or, at
> >>>>>least implement a job() method on EVJ.
> >>>>>
> >>>>> +1 for having the latter, with a boolean flag in DLS
> to optionally call
> >>>>EJV
> >>>>after it's done.
> >>>>
> >>>
>