[
https://issues.apache.org/jira/browse/LUCENE-6814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906797#comment-14906797
]
Alex Chow commented on LUCENE-6814:
-----------------------------------
{quote}
Tokenstream consumer workflow is clearly defined:
https://lucene.apache.org/core/5_3_1/core/org/apache/lucene/analysis/TokenStream.html
The last step is close(). There is nothing confusing, just RTFM.
{quote}
Yeah, just took a little bit to fully understand how things interact with reuse
("expert mode").
bq. I think it's unlikely Lucene will have another 4.10.x release, and ES is
releasing 2.0.0 (using Lucene 5.3.x) shortly.
That's pretty unfortunate. The tracker suggests that there are some patches
slated for 4.10.5. Is 4.10 dead now?
I'm not able to find what the plan for 1.x is when 2.0 hits GA. 1.7 was cut in
July, so that release is probably going to be supported through early 2017.
I'll probably have to push for the PatternTokenizer fork to get it into 1.x?
bq. Can you describe what impact you're seeing from this bug? How many
{{PatternTokenizer}} instances is ES keeping in your case, how large are your
docs, etc.? You could probably lower the ES bulk indexing thread pool size (if
you don't in fact need so much concurrency) to reduce the impact of the bug...
Our setup has 24 bulk indexing threads, and at peak we go through about 18
tasks/s (8k documents indexed per bulk request) per node with 21 nodes and 168
indices.
We've been making an effort to reduce heap sizes (from 22gb to 8gb using 3x
nodes), and {{PatternTokenizer}} ends up retaining around 2gb before we get
into a GC death spiral (CMS with instantiating occupancy=75%) but would
otherwise grow a bit more. The biggest {{PatternTokenizer}} instances get to
about 3-4mb. {{SegmentReaders}} occupy 3.5gb per node, so there's not too much
space to work with unless we want to increase heap at least 1.5x. This pretty
much destroys scaling horizontally and ends up being more... diagonally with
volume depending on the data?
I'm pretty surprised nobody has really noticed this so far. Does everyone just
like huge heaps (or just not use {{PatternAnalyzer}})?
bq. I think this bug means {{PatternTokenizer}} holds onto the max sized doc it
ever saw in heap right? Does {{StringBuilder}} ever reduce its allocated space
by itself...
I think this ends up growing to the max sized field. As far as I can tell
{{trimToSize}} is the only for {{StringBuilder}} to shrink its buffer, and even
then documentation suggests it isn't guaranteed (lots of "may")...
> PatternTokenizer should free heap after it's done
> -------------------------------------------------
>
> Key: LUCENE-6814
> URL: https://issues.apache.org/jira/browse/LUCENE-6814
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: Trunk, 5.4
>
> Attachments: LUCENE-6814.patch, LUCENE-6814.patch
>
>
> Caught by Alex Chow in this Elasticsearch issue:
> https://github.com/elastic/elasticsearch/issues/13721
> Today, PatternTokenizer reuses a single StringBuilder, but it doesn't free
> its heap usage after tokenizing is done. We can either stop reusing, or ask
> it to {{.trimToSize}} when we are done ...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]