Yeah, good point. Will double check on that. On Nov 7, 2012, at 2:13 PM, Sean Owen wrote:
> Oh, 11M bigrams. Well I can't see how that would come near running through > 12GB of heap, even half of it. > Are you guys sure that the child workers are actually being allowed to use > 12GB heap? There are lots of places to put the "mapred.child.java.opts" > parameter that don't actually do anything, which I have learned by making > that mistake about 3 times every which way. > > > On Wed, Nov 7, 2012 at 7:04 PM, David Arthur <mum...@gmail.com> wrote: > >> I see the same type of exception later on in the KMeans driver >> >> https://gist.github.com/15c918acd2583e4ac54f >> >> This is using the same large dataset that Grant mentioned. I should >> clarify that it's not 11M terms, but 11M bigrams after pruning. >> >> 242,646 docs >> 172,502,741 tokens >> >> Cheers >> -David >> >> On Nov 7, 2012, at 12:06 PM, Grant Ingersoll wrote: >> >>> It's in throwing it in the config of the Reducer, so not likely the >> vector, but it could be. >>> >>> Once we went back to unigrams, the OOM in that spot went away. >>> >>> On Nov 7, 2012, at 12:00 PM, Robin Anil wrote: >>> >>>> Not seen the code in a while but AFAIR the reducer is not loading any >>>> dictionary. We chunk the dictionary to create partial vector. I think >> you >>>> just have a huge vector >>>> On Nov 7, 2012 10:50 AM, "Sean Owen" <sro...@gmail.com> wrote: >>>> >>>>> It's a trie? Yeah that could be a big win. It gets tricky with >> Unicode, but >>>>> imagine there is a lot of gain even so. >>>>> "Bigrams over 11M terms" jumped out too as a place to start. >>>>> (I don't see any particular backwards compatibility issue with Lucene >> 3 to >>>>> even worry about.) >>>>> >>> >>> -------------------------------------------- >>> Grant Ingersoll >>> http://www.lucidworks.com >>> >>> >>> >>> >> >>