Re: Vector creation - out of memory error

Shashikant Kore Mon, 20 Jul 2009 22:46:57 -0700

You can restrict the term set by applying "minDf"  & "maxDFPercent" filters.

Idea behind the parameters is that the terms occurring too frequently
or too rarely are not very useful. If you set "minDf" parameter to 10,
the term has to appear in at least 10 documents in the index.
Similarly, if "maxDFPercent" is set to 50, all terms appearing in more
than 50% documents are ignored.

These two parameters prune the term set drastically. I wouldn't be
suprised if the term set shrinks to less 10% of the original set.
Since, the vector generation code keeps term->doc-freq map in memory,
the memory footprint is now at a "manageable" level. Also, vector
generation will be faster as there are fewer features features per
vector.

BTW, how slow is vector generation? I don't have exact figures with
me, but on a single box, I recall it to be higher than 50 vectors per
second.

--shashi

On Tue, Jul 21, 2009 at 12:10 AM, Florian Leibert<[email protected]> wrote:
> Hi,
> I'm trying to create vectors with Mahout as explained in
> http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text,
> however I keep running out of heap. My heap is set to 2 GB already and I use
> these parameters:
> "java org.apache.mahout.utils.vectors.Driver --dir /LUCENE/ind --output
> /user/florian/index-vectors-01 --field content --dictOut
> /user/florian/index-dict-01 --weight TF".
>
> My index currently is about 6 GB large. Is there any way to compute the
> vectors in a distributed manner? What's the largest index someone has
> created vectors from?
>
> Thanks!
>
> Florian
>

Re: Vector creation - out of memory error

Reply via email to