[ 
https://issues.apache.org/jira/browse/LUCENE-6814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906797#comment-14906797
 ] 

Alex Chow commented on LUCENE-6814:
-----------------------------------

{quote}
Tokenstream consumer workflow is clearly defined:
https://lucene.apache.org/core/5_3_1/core/org/apache/lucene/analysis/TokenStream.html
The last step is close(). There is nothing confusing, just RTFM.
{quote}

Yeah, just took a little bit to fully understand how things interact with reuse 
("expert mode").

bq. I think it's unlikely Lucene will have another 4.10.x release, and ES is 
releasing 2.0.0 (using Lucene 5.3.x) shortly.

That's pretty unfortunate. The tracker suggests that there are some patches 
slated for 4.10.5. Is 4.10 dead now?

I'm not able to find what the plan for 1.x is when 2.0 hits GA. 1.7 was cut in 
July, so that release is probably going to be supported through early 2017. 
I'll probably have to push for the PatternTokenizer fork to get it into 1.x?

bq. Can you describe what impact you're seeing from this bug? How many 
{{PatternTokenizer}} instances is ES keeping in your case, how large are your 
docs, etc.? You could probably lower the ES bulk indexing thread pool size (if 
you don't in fact need so much concurrency) to reduce the impact of the bug...

Our setup has 24 bulk indexing threads, and at peak we go through about 18 
tasks/s (8k documents indexed per bulk request) per node with 21 nodes and 168 
indices.

We've been making an effort to reduce heap sizes (from 22gb to 8gb using 3x 
nodes), and {{PatternTokenizer}} ends up retaining around 2gb before we get 
into a GC death spiral (CMS with instantiating occupancy=75%) but would 
otherwise grow a bit more. The biggest {{PatternTokenizer}} instances get to 
about 3-4mb. {{SegmentReaders}} occupy 3.5gb per node, so there's not too much 
space to work with unless we want to increase heap at least 1.5x. This pretty 
much destroys scaling horizontally and ends up being more... diagonally with 
volume depending on the data?

I'm pretty surprised nobody has really noticed this so far. Does everyone just 
like huge heaps (or just not use {{PatternAnalyzer}})?

bq. I think this bug means {{PatternTokenizer}} holds onto the max sized doc it 
ever saw in heap right? Does {{StringBuilder}} ever reduce its allocated space 
by itself...

I think this ends up growing to the max sized field. As far as I can tell 
{{trimToSize}} is the only for {{StringBuilder}} to shrink its buffer, and even 
then documentation suggests it isn't guaranteed (lots of "may")...

> PatternTokenizer should free heap after it's done
> -------------------------------------------------
>
>                 Key: LUCENE-6814
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6814
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: Trunk, 5.4
>
>         Attachments: LUCENE-6814.patch, LUCENE-6814.patch
>
>
> Caught by Alex Chow in this Elasticsearch issue: 
> https://github.com/elastic/elasticsearch/issues/13721
> Today, PatternTokenizer reuses a single StringBuilder, but it doesn't free 
> its heap usage after tokenizing is done.  We can either stop reusing, or ask 
> it to {{.trimToSize}} when we are done ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to