Re: Lucene 4.0 memory usage during indexing - is this expected?

Michael McCandless Wed, 03 Oct 2012 08:50:54 -0700

I wish I could remember/find the Jira issue here ... there was one
fairly recently.


Are you really sure your not turning over threads that are coming
through Lucene...?  High thread turnover causes challenges for
ThreadLocals ...

Do you have a lot of fields?  Are you using PerFieldAnalyzerWrapper...?

Mike McCandless

http://blog.mikemccandless.com

On Wed, Oct 3, 2012 at 10:45 AM,  <[email protected]> wrote:
> There's a fixed-sized thread pool involved in doing the indexing, of a size 
> that depends on the machine parameters.
> Karl
>
> -----Original Message-----
> From: ext Michael McCandless [mailto:[email protected]]
> Sent: Wednesday, October 03, 2012 10:43 AM
> To: Wright Karl (Nokia-LC/Boston)
> Subject: Re: Lucene 4.0 memory usage during indexing - is this expected?
>
> This is no good!
>
> Can you send an email to dev@?  This sounds very familiar ... and I had 
> thought we committed a fix for it ... hopefully Uwe or Robert can remember 
> what it was!
>
> Do you create new threads frequently, to do indexing?  Rather than pulling 
> from a fixed pool?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, Oct 3, 2012 at 8:32 AM,  <[email protected]> wrote:
>> Hi Mike,
>>
>>
>>
>> I've got a technical question for you.
>>
>>
>>
>> For background, we've been building a new address search engine on top
>> of Lucene 4.0.  The main customization involves a chain of custom
>> analyzers etc, and it all works quite well.  Or at least it did until
>> I added 7m more documents to the list.  At that point the indexing
>> process began to run out of memory, even though we were giving it some
>> 20GB.  Only some 12GB of that is accounted for in our part of the world.
>>
>>
>>
>> Looking at an eclipse MAT dump, the main thing that still seems to
>> grow over time is/are TokenStreamComponent objects that are being held
>> indirectly by org.apache.lucene.index.FieldInvertState objects.  The
>> number of FieldInvertState objects grows and grows.  By the middle of
>> the indexing process, there are 30 of these, and each one of these
>> seems to hold onto one TokenStreamComponent per field.  (Each
>> TokenStreamComponent in turn holds onto a whole pile of things like
>> ICU tokenizers etc, so there's a strong multiplicative factor
>> involved, which in the end winds up holding about 10GB of memory for
>> those 30 objects.)
>>
>>
>>
>> The question: Why does the number of FieldInvertState objects grow
>> over time during indexing?  Are these associated in some way with
>> segments?  Is this expected behavior?
>>
>>
>>
>> Thanks!
>>
>> Karl
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Lucene 4.0 memory usage during indexing - is this expected?

Reply via email to