Yeah, good point.  Will double check on that.

On Nov 7, 2012, at 2:13 PM, Sean Owen wrote:

> Oh, 11M bigrams. Well I can't see how that would come near running through
> 12GB of heap, even half of it.
> Are you guys sure that the child workers are actually being allowed to use
> 12GB heap? There are lots of places to put the "mapred.child.java.opts"
> parameter that don't actually do anything, which I have learned by making
> that mistake about 3 times every which way.
> 
> 
> On Wed, Nov 7, 2012 at 7:04 PM, David Arthur <mum...@gmail.com> wrote:
> 
>> I see the same type of exception later on in the KMeans driver
>> 
>> https://gist.github.com/15c918acd2583e4ac54f
>> 
>> This is using the same large dataset that Grant mentioned. I should
>> clarify that it's not 11M terms, but 11M bigrams after pruning.
>> 
>> 242,646 docs
>> 172,502,741 tokens
>> 
>> Cheers
>> -David
>> 
>> On Nov 7, 2012, at 12:06 PM, Grant Ingersoll wrote:
>> 
>>> It's in throwing it in the config of the Reducer, so not likely the
>> vector, but it could be.
>>> 
>>> Once we went back to unigrams, the OOM in that spot went away.
>>> 
>>> On Nov 7, 2012, at 12:00 PM, Robin Anil wrote:
>>> 
>>>> Not seen the code in a while but AFAIR the reducer is not loading any
>>>> dictionary. We chunk the dictionary to create partial vector. I think
>> you
>>>> just have a huge vector
>>>> On Nov 7, 2012 10:50 AM, "Sean Owen" <sro...@gmail.com> wrote:
>>>> 
>>>>> It's a trie? Yeah that could be a big win. It gets tricky with
>> Unicode, but
>>>>> imagine there is a lot of gain even so.
>>>>> "Bigrams over 11M terms" jumped out too as a place to start.
>>>>> (I don't see any particular backwards compatibility issue with Lucene
>> 3 to
>>>>> even worry about.)
>>>>> 
>>> 
>>> --------------------------------------------
>>> Grant Ingersoll
>>> http://www.lucidworks.com
>>> 
>>> 
>>> 
>>> 
>> 
>> 


Reply via email to