This just means "out of memory": the dictionary is too big. It's nothing in
particular to do with the number of size or rate of objects allocated. I
don't know if a different implementation is going to be appreciably
different in terms of size -- these are already primitive-based specialized
implementations.

My dumb question is, is this not just a consequence of the implementation
trying to store an unbounded (well, up to trillions) of entries in a map in
memory?

Beyond this I don't know anything about the implementation.


On Wed, Nov 7, 2012 at 3:17 PM, Grant Ingersoll <gsing...@apache.org> wrote:

> Hi,
>
> We're hitting OOMs while running vectorization during dictionary loading
> in TFPartialIndexVectorReducer.  We have the dictionary chunk size set to
> 100 (the minimum) and have about 11M items in the dictionary (bigrams are
> on) and our heap size is set to 12 GB.   We haven't debugged deeply yet,
> but the OOM routinely occurs in the rehash method:
> 2012-11-07 04:34:04,750 FATAL org.apache.hadoop.mapred.Child: Error
> running child : java.lang.OutOfMemoryError: Java heap space
>         at
> org.apache.mahout.math.map.OpenObjectIntHashMap.rehash(OpenObjectIntHashMap.java:430)
>         at
> org.apache.mahout.math.map.OpenObjectIntHashMap.put(OpenObjectIntHashMap.java:383)
>         at
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:131)
>         at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>         at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:416)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
> I'm also guessing (haven't turned on GC logs yet) that the GC simply can't
> keep up w/ the allocations, but perhaps there is also a bug somewhere in
> the dictionary code and the dictionary is corrupt and it's misreading the
> size.  I can share the dictionary privately if anyone wants to look at it,
> but I can't share it publicly.
>
> Has anyone else seen this?  Is my understanding correct?
>
> I can see a couple of remedies:
> 1.  Pass in an initial capacity and see if we can better control the size
> of the allocation
> 2. Switch to Lucene's FST for dictionaries:  The tradeoff would be a much
> smaller dictionary (10 GB of wikipedia in Lucene is roughly a 250K
> dictionary size) and very little  deserialization (the dictionary is all
> byte arrays) at the cost of lookups in a given mapper.  However, that
> latter cost would likely be more than made up for by the fact that in most
> situations, one would only need 1 dictionary chunk, thereby eliminating
> several MapReduce iterations.  The other downside/upside is that we would
> need to go to Lucene 4.
>
> Thoughts?
>
> -Grant

Reply via email to