[ 
https://issues.apache.org/jira/browse/LUCENE-6814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906550#comment-14906550
 ] 

Michael McCandless commented on LUCENE-6814:
--------------------------------------------

Commenting on the stuff you edited away because they are good confusions!

bq. I'm curious why there shouldn't there be some trimming in `end()` as well? 
Or is a `TokenStream` meant to be used only once (no multiple `reset()`, 
`incrementToken()`, `end()` on the same `TokenStream`)?

The {{TokenStream}} API is confusing :)

I started with {{end}} here too (it seemed correct) but it turns out {{close}} 
is also called (internally, in Lucene's {{IndexWriter}}) after all tokens are 
iterated for a single input, but {{close}} is called even on exception (but 
{{end}} is not necessarily I think).

The {{TokenStream}} instance is typically thread-private, and re-used (for that 
one thread) for analysis on future docs.

bq. Elasticsearch seems to never reinstantiate Tokenizers and just reuses them 
for each field in an index, though I may be wrong. Or elasticsearch is using 
TokenStream the wrong way?

ES using Lucene's {{Analyzer}} (well, {{DelegatingAnalyzerWrapper}} I think), 
which (by default) reuses the {{Tokenizer}} instance, per thread.

bq. It'd be great if this can get added to 4.10 so elasticsearch 1.x can pull 
it in too.

I think it's unlikely Lucene will have another 4.10.x release, and ES is 
releasing 2.0.0 (using Lucene 5.3.x) shortly.

Can you describe what impact you're seeing from this bug?  How many 
{{PatternTokenizer}} instances is ES keeping in your case, how large are your 
docs, etc.?  You could probably lower the ES bulk indexing thread pool size (if 
you don't in fact need so much concurrency) to reduce the impact of the bug ...

I think this bug means {{PatternTokenizer}} holds onto the max sized doc it 
ever saw in heap right?  Does {{StringBuilder}} ever reduce its allocated space 
by itself...

> PatternTokenizer should free heap after it's done
> -------------------------------------------------
>
>                 Key: LUCENE-6814
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6814
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: Trunk, 5.4
>
>         Attachments: LUCENE-6814.patch, LUCENE-6814.patch
>
>
> Caught by Alex Chow in this Elasticsearch issue: 
> https://github.com/elastic/elasticsearch/issues/13721
> Today, PatternTokenizer reuses a single StringBuilder, but it doesn't free 
> its heap usage after tokenizing is done.  We can either stop reusing, or ask 
> it to {{.trimToSize}} when we are done ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to