Hi Ted,

Regarding your comment: "For clustering purposes, you probably don't even
need SVD here ..."

I was just experimenting with vectors that have 20K dimensions, but my end
goal was to run the SVD on vectors with n-grams that have roughly 380K
dimensions. Do you still think SVD is not needed for this situation? My
thought was to get the n-gram vectors down to a more manageable size and SVD
seemed like what I needed?

Cheers,
Tim

On Mon, Mar 14, 2011 at 3:56 PM, Timothy Potter <thelabd...@gmail.com>wrote:

> Thanks for the clarification Jake.
>
> The end goal is to run the SVD against my n-gram vector, which have 380K
> dimensions.
>
> I'll update the wiki once I have this working.
>
> Tim
>
>
> On Mon, Mar 14, 2011 at 1:09 PM, Jake Mannix <jake.man...@gmail.com>wrote:
>
>>
>> On Sun, Mar 13, 2011 at 6:47 PM, Timothy Potter <thelabd...@gmail.com>wrote:
>>
>>> Looking for a little clarification with using SVD to reduce dimensions of
>>> my
>>> vectors for clustering ...
>>>
>>> Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf
>>> vectors
>>> with 20,444 dimensions. I successfully run Mahout SVD on the vectors
>>> using:
>>>
>>> bin/mahout svd -i
>>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \
>>>    -o /asf-mail-archives/mahout-0.4/svd \
>>>    --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true
>>>
>>> This produced 87 eigenvectors of size 20,444. I'm not clear as to why
>>> only
>>> 87, but I'm assuming that has something to do with Lanczos???
>>>
>>
>> Hi Timothy,
>>
>>   The LanczosSolver looks for 100 eigenvectors, but then does some cleanup
>> after
>> the fact: convergence issues and numeric overflow can cause some
>> eigenvectors
>> to show up twice - the last step in Mahout SVD is to remove these
>> spurious
>> eigenvectors (and also any which just don't appear to be "eigen" enough
>> (ie,
>> they don't satisfy the eigenvector criterion with high enough fidelity).
>>
>>   If you really need more eigenvectors, you can try re-running with
>> rank=150,
>> and then take the top 100 out of however many you get out.
>>
>> So then I proceeded to transpose the SVD output using:
>>>
>>> bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444
>>> --numRows 87
>>>
>>> Next, I tried to run transpose on my original vectors using:
>>>
>>> transpose -i
>>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
>>> --numCols 20444 --numRows 6076937
>>>
>>>
>> So the problems with this is that the tfidf-vectors is a
>> SequenceFile<Text,VectorWritable> - which is fine for input into
>> DistributedLanczosSolver (which just needs <Writable,VectorWritable>
>> pairs),
>> but not so fine for being really considered a "matrix" - you need to run
>> the
>> RowIdJob on these tfidf-vectors first.  This will normalize your
>> SequenceFIle<Text,VectorWritable> into a
>> SequenceFile<IntWritable,VectorWritable>
>> and a SequenceFIle<IntWritable,Text> (where original one is the join of
>> these new ones, on the new int key).
>>
>> Hope that helps.
>>
>>   -jake
>>
>
>

Reply via email to