You can restrict the term set by applying "minDf" & "maxDFPercent" filters.
Idea behind the parameters is that the terms occurring too frequently or too rarely are not very useful. If you set "minDf" parameter to 10, the term has to appear in at least 10 documents in the index. Similarly, if "maxDFPercent" is set to 50, all terms appearing in more than 50% documents are ignored. These two parameters prune the term set drastically. I wouldn't be suprised if the term set shrinks to less 10% of the original set. Since, the vector generation code keeps term->doc-freq map in memory, the memory footprint is now at a "manageable" level. Also, vector generation will be faster as there are fewer features features per vector. BTW, how slow is vector generation? I don't have exact figures with me, but on a single box, I recall it to be higher than 50 vectors per second. --shashi On Tue, Jul 21, 2009 at 12:10 AM, Florian Leibert<[email protected]> wrote: > Hi, > I'm trying to create vectors with Mahout as explained in > http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text, > however I keep running out of heap. My heap is set to 2 GB already and I use > these parameters: > "java org.apache.mahout.utils.vectors.Driver --dir /LUCENE/ind --output > /user/florian/index-vectors-01 --field content --dictOut > /user/florian/index-dict-01 --weight TF". > > My index currently is about 6 GB large. Is there any way to compute the > vectors in a distributed manner? What's the largest index someone has > created vectors from? > > Thanks! > > Florian >
