Hi Shashi, great - I'm trying the settings maxDFPercent 50 and minDF 4 - I have a lot of very short documents of which some can be very descriptive. I'm thinking I should have used the StopWordAnalyzer in Lucene when creating the index - that way the creation of the vectors would be much faster.
It took yesterday about 8 hours to process these vectors on a quad core machine with 4 GB of heap - using the sequence file writer - I assume that the bottleneck might have been the constant transfer into HDFS - that's why I'm using the file writer now. It's running on my 6 GB index since about 90 minutes now and while the vector sequence file yesterday was 3 GB large (without filtering) - the JSON file is already at 16 GB (with filtering) - which I attribute to the compression of the sequence file... I'm trying to allot some time to transform the vector creation process to M/R if nobody else is working on that at the moment... Florian On Mon, Jul 20, 2009 at 10:46 PM, Shashikant Kore <[email protected]>wrote: > You can restrict the term set by applying "minDf" & "maxDFPercent" > filters. > > Idea behind the parameters is that the terms occurring too frequently > or too rarely are not very useful. If you set "minDf" parameter to 10, > the term has to appear in at least 10 documents in the index. > Similarly, if "maxDFPercent" is set to 50, all terms appearing in more > than 50% documents are ignored. > > These two parameters prune the term set drastically. I wouldn't be > suprised if the term set shrinks to less 10% of the original set. > Since, the vector generation code keeps term->doc-freq map in memory, > the memory footprint is now at a "manageable" level. Also, vector > generation will be faster as there are fewer features features per > vector. > > BTW, how slow is vector generation? I don't have exact figures with > me, but on a single box, I recall it to be higher than 50 vectors per > second. > > --shashi > > On Tue, Jul 21, 2009 at 12:10 AM, Florian Leibert<[email protected]> wrote: > > Hi, > > I'm trying to create vectors with Mahout as explained in > > > http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text > , > > however I keep running out of heap. My heap is set to 2 GB already and I > use > > these parameters: > > "java org.apache.mahout.utils.vectors.Driver --dir /LUCENE/ind --output > > /user/florian/index-vectors-01 --field content --dictOut > > /user/florian/index-dict-01 --weight TF". > > > > My index currently is about 6 GB large. Is there any way to compute the > > vectors in a distributed manner? What's the largest index someone has > > created vectors from? > > > > Thanks! > > > > Florian > > >
