Re: Lucene 4.0 memory usage during indexing - is this expected?

Michael McCandless Wed, 03 Oct 2012 13:55:24 -0700

Phew, thanks for bringing closure!

Mike McCandless


http://blog.mikemccandless.com

On Wed, Oct 3, 2012 at 2:12 PM,  <[email protected]> wrote:
> Mystery resolved; the problem was due to an ever-increasing record size, 
> which was in turn due to a record structure that was never being cleared.  
> This caused it to appear as if the total allocation of structures used for 
> analysis was steadily growing.  But the number of such entities did NOT grow, 
> which is what gave away the solution.
>
> Thanks for the hints, and sorry for the confusion.
>
> Karl
>
> -----Original Message-----
> From: Wright Karl (Nokia-LC/Boston)
> Sent: Wednesday, October 03, 2012 12:41 PM
> To: [email protected]
> Subject: RE: Lucene 4.0 memory usage during indexing - is this expected?
>
> Threads are managed via an executor service and are a fixed size thread pool, 
> of size 16 on this machine.
>
> There are not a lot of fields in the schema (a half dozen).  We do use 
> PerFieldAnalyzerWrapper.
>
> I'm still grappling with the mat reports; it's possible of course that we're 
> holding onto something unexpected, or even that we have a fragmentation 
> situation.  Stay tuned.
>
> Karl
>
> -----Original Message-----
> From: ext Michael McCandless [mailto:[email protected]]
> Sent: Wednesday, October 03, 2012 11:50 AM
> To: [email protected]
> Subject: Re: Lucene 4.0 memory usage during indexing - is this expected?
>
> I wish I could remember/find the Jira issue here ... there was one fairly 
> recently.
>
> Are you really sure your not turning over threads that are coming through 
> Lucene...?  High thread turnover causes challenges for ThreadLocals ...
>
> Do you have a lot of fields?  Are you using PerFieldAnalyzerWrapper...?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, Oct 3, 2012 at 10:45 AM,  <[email protected]> wrote:
>> There's a fixed-sized thread pool involved in doing the indexing, of a size 
>> that depends on the machine parameters.
>> Karl
>>
>> -----Original Message-----
>> From: ext Michael McCandless [mailto:[email protected]]
>> Sent: Wednesday, October 03, 2012 10:43 AM
>> To: Wright Karl (Nokia-LC/Boston)
>> Subject: Re: Lucene 4.0 memory usage during indexing - is this expected?
>>
>> This is no good!
>>
>> Can you send an email to dev@?  This sounds very familiar ... and I had 
>> thought we committed a fix for it ... hopefully Uwe or Robert can remember 
>> what it was!
>>
>> Do you create new threads frequently, to do indexing?  Rather than pulling 
>> from a fixed pool?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Wed, Oct 3, 2012 at 8:32 AM,  <[email protected]> wrote:
>>> Hi Mike,
>>>
>>>
>>>
>>> I've got a technical question for you.
>>>
>>>
>>>
>>> For background, we've been building a new address search engine on
>>> top of Lucene 4.0.  The main customization involves a chain of custom
>>> analyzers etc, and it all works quite well.  Or at least it did until
>>> I added 7m more documents to the list.  At that point the indexing
>>> process began to run out of memory, even though we were giving it
>>> some 20GB.  Only some 12GB of that is accounted for in our part of the 
>>> world.
>>>
>>>
>>>
>>> Looking at an eclipse MAT dump, the main thing that still seems to
>>> grow over time is/are TokenStreamComponent objects that are being
>>> held indirectly by org.apache.lucene.index.FieldInvertState objects.
>>> The number of FieldInvertState objects grows and grows.  By the
>>> middle of the indexing process, there are 30 of these, and each one
>>> of these seems to hold onto one TokenStreamComponent per field.
>>> (Each TokenStreamComponent in turn holds onto a whole pile of things
>>> like ICU tokenizers etc, so there's a strong multiplicative factor
>>> involved, which in the end winds up holding about 10GB of memory for
>>> those 30 objects.)
>>>
>>>
>>>
>>> The question: Why does the number of FieldInvertState objects grow
>>> over time during indexing?  Are these associated in some way with
>>> segments?  Is this expected behavior?
>>>
>>>
>>>
>>> Thanks!
>>>
>>> Karl
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected] For
>> additional commands, e-mail: [email protected]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected] For additional 
> commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Lucene 4.0 memory usage during indexing - is this expected?

Reply via email to