On Jul 20, 2009, at 2:40 PM, Florian Leibert wrote:

Hi,
I'm trying to create vectors with Mahout as explained in
http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text , however I keep running out of heap. My heap is set to 2 GB already and I use
these parameters:
"java org.apache.mahout.utils.vectors.Driver --dir /LUCENE/ind -- output
/user/florian/index-vectors-01 --field content --dictOut
/user/florian/index-dict-01 --weight TF".

Hmm, 6GB isn't all that large, but the primary memory usage is going to be due to the CachedTermInfo, which loads all the terms into memory. This is an interface that can be implemented in other, slower, ways, but we'll have to change the Driver program to allow for that.

How many unique terms do you have in the content field?

You have java -Xmx2000M set as the heap size?



My index currently is about 6 GB large. Is there any way to compute the
vectors in a distributed manner?

There will be, but there isn't yet, I suspect.


What's the largest index someone has
created vectors from?

It's pretty new code, I've only tested it on relatively small indexes (few 100 mgs) so far, but the only gating issue memory wise is the CachedTermInfo.

Sorry I don't have better answers, but I am willing to help improve. I will try to use some bigger indexes soon.

-Grant

Reply via email to