Hi Ted, Regarding your comment: "For clustering purposes, you probably don't even need SVD here ..."
I was just experimenting with vectors that have 20K dimensions, but my end goal was to run the SVD on vectors with n-grams that have roughly 380K dimensions. Do you still think SVD is not needed for this situation? My thought was to get the n-gram vectors down to a more manageable size and SVD seemed like what I needed? Cheers, Tim On Mon, Mar 14, 2011 at 3:56 PM, Timothy Potter <thelabd...@gmail.com>wrote: > Thanks for the clarification Jake. > > The end goal is to run the SVD against my n-gram vector, which have 380K > dimensions. > > I'll update the wiki once I have this working. > > Tim > > > On Mon, Mar 14, 2011 at 1:09 PM, Jake Mannix <jake.man...@gmail.com>wrote: > >> >> On Sun, Mar 13, 2011 at 6:47 PM, Timothy Potter <thelabd...@gmail.com>wrote: >> >>> Looking for a little clarification with using SVD to reduce dimensions of >>> my >>> vectors for clustering ... >>> >>> Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf >>> vectors >>> with 20,444 dimensions. I successfully run Mahout SVD on the vectors >>> using: >>> >>> bin/mahout svd -i >>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \ >>> -o /asf-mail-archives/mahout-0.4/svd \ >>> --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true >>> >>> This produced 87 eigenvectors of size 20,444. I'm not clear as to why >>> only >>> 87, but I'm assuming that has something to do with Lanczos??? >>> >> >> Hi Timothy, >> >> The LanczosSolver looks for 100 eigenvectors, but then does some cleanup >> after >> the fact: convergence issues and numeric overflow can cause some >> eigenvectors >> to show up twice - the last step in Mahout SVD is to remove these >> spurious >> eigenvectors (and also any which just don't appear to be "eigen" enough >> (ie, >> they don't satisfy the eigenvector criterion with high enough fidelity). >> >> If you really need more eigenvectors, you can try re-running with >> rank=150, >> and then take the top 100 out of however many you get out. >> >> So then I proceeded to transpose the SVD output using: >>> >>> bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444 >>> --numRows 87 >>> >>> Next, I tried to run transpose on my original vectors using: >>> >>> transpose -i >>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors >>> --numCols 20444 --numRows 6076937 >>> >>> >> So the problems with this is that the tfidf-vectors is a >> SequenceFile<Text,VectorWritable> - which is fine for input into >> DistributedLanczosSolver (which just needs <Writable,VectorWritable> >> pairs), >> but not so fine for being really considered a "matrix" - you need to run >> the >> RowIdJob on these tfidf-vectors first. This will normalize your >> SequenceFIle<Text,VectorWritable> into a >> SequenceFile<IntWritable,VectorWritable> >> and a SequenceFIle<IntWritable,Text> (where original one is the join of >> these new ones, on the new int key). >> >> Hope that helps. >> >> -jake >> > >