This just means "out of memory": the dictionary is too big. It's nothing in particular to do with the number of size or rate of objects allocated. I don't know if a different implementation is going to be appreciably different in terms of size -- these are already primitive-based specialized implementations.
My dumb question is, is this not just a consequence of the implementation trying to store an unbounded (well, up to trillions) of entries in a map in memory? Beyond this I don't know anything about the implementation. On Wed, Nov 7, 2012 at 3:17 PM, Grant Ingersoll <gsing...@apache.org> wrote: > Hi, > > We're hitting OOMs while running vectorization during dictionary loading > in TFPartialIndexVectorReducer. We have the dictionary chunk size set to > 100 (the minimum) and have about 11M items in the dictionary (bigrams are > on) and our heap size is set to 12 GB. We haven't debugged deeply yet, > but the OOM routinely occurs in the rehash method: > 2012-11-07 04:34:04,750 FATAL org.apache.hadoop.mapred.Child: Error > running child : java.lang.OutOfMemoryError: Java heap space > at > org.apache.mahout.math.map.OpenObjectIntHashMap.rehash(OpenObjectIntHashMap.java:430) > at > org.apache.mahout.math.map.OpenObjectIntHashMap.put(OpenObjectIntHashMap.java:383) > at > org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:131) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:416) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > > I'm also guessing (haven't turned on GC logs yet) that the GC simply can't > keep up w/ the allocations, but perhaps there is also a bug somewhere in > the dictionary code and the dictionary is corrupt and it's misreading the > size. I can share the dictionary privately if anyone wants to look at it, > but I can't share it publicly. > > Has anyone else seen this? Is my understanding correct? > > I can see a couple of remedies: > 1. Pass in an initial capacity and see if we can better control the size > of the allocation > 2. Switch to Lucene's FST for dictionaries: The tradeoff would be a much > smaller dictionary (10 GB of wikipedia in Lucene is roughly a 250K > dictionary size) and very little deserialization (the dictionary is all > byte arrays) at the cost of lookups in a given mapper. However, that > latter cost would likely be more than made up for by the fact that in most > situations, one would only need 1 dictionary chunk, thereby eliminating > several MapReduce iterations. The other downside/upside is that we would > need to go to Lucene 4. > > Thoughts? > > -Grant