Re: Using SVD with Canopy/KMeans

Grant Ingersoll Tue, 14 Sep 2010 10:55:21 -0700

Can you update the wiki?

-Grant


On Sep 14, 2010, at 1:44 PM, Jeff Eastman wrote:

> Here's the new set of mahout svd arguments. Entries --cleansvd, --maxError, 
> --minEigenvalue and --inMemory have been added in r997007. See the new tests 
> in TestDistributedLanczosSolverCLI for examples of both forms:
> 
>  --input (-i) input                      Path to job input directory.
>  --output (-o) output                    The directory pathname for output.
>  --numRows (-nr) numRows                 Number of rows of the input matrix
>  --numCols (-nc) numCols                 Number of columns of the input matrix
>  --rank (-r) rank                        Desired decomposition rank (note:
>                                          only roughly 1/4 to 1/3 of these will
>                                          have the top portion of the spectrum)
>  --symmetric (-sym) symmetric            Is the input matrix square and
>                                          symmetric?
>  --cleansvd (-cl) cleansvd               Run the EigenVerificationJob to clean
>                                          the eigenvectors after SVD
>  --maxError (-err) maxError              Maximum acceptable error
>  --minEigenvalue (-mev) minEigenvalue    Minimum eigenvalue to keep the vector
>                                          for
>  --inMemory (-mem) inMemory              Buffer eigen matrix into memory (if
>                                          you have enough!)
>  --help (-h)                             Print out help
>  --tempDir tempDir                       Intermediate output directory
>  --startPhase startPhase                 First phase to run
>  --endPhase endPhase                     Last phase to run
> 
> On 9/14/10 6:55 AM, Jake Mannix wrote:
>> I guess the main thing I'd want to happen in combining EVJ and DLS is to
>> make sure that the final output (changing the semantics of the CLI param is
>> ok) is clear, with it either being the output of EVJ (if that is used), or
>> DLS (if EVJ is not used).  If that can be done, go for it!
>> 
>>   -jake
>> 
>> On Tue, Sep 14, 2010 at 6:30 AM, Jeff 
>> Eastman<[email protected]>wrote:
>> 
>>>  Jake, I see you are on line. I'm inclined to push forward on this despite
>>> the adjustments to DLS --output semantics. Agreed?
>>> 
>>> 
>>> On 9/13/10 10:34 AM, Jeff Eastman wrote:
>>> 
>>>>  r996599 completed the first part. Several additional arguments to EVJ.run
>>>> need to be added to DLS (maxError, minEigenValue, inMemory, also the
>>>> --cleansvn flag itself). Also DLS interprets --output as the
>>>> outputEigenVectorPath and not as the generic output directory so DLS.run()
>>>> will need another argument too. Still want to do this?
>>>> 
>>>> On 9/12/10 2:19 PM, Jake Mannix wrote:
>>>> 
>>>>> +1 on folding EigenVerificationJob into DistributedLanczosSolver. Or, at
>>>>>> least implement a job() method on EVJ.
>>>>>> 
>>>>>>  +1 for having the latter, with a boolean flag in DLS to optionally call
>>>>> EJV
>>>>> after it's done.
>>>>> 
>>>> 
> 

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

Re: Using SVD with Canopy/KMeans

Reply via email to