[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Uwe Schindler (JIRA) Wed, 26 Aug 2009 11:34:22 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748072#action_12748072
 ]


Uwe Schindler commented on LUCENE-1859:
---------------------------------------

The problem is, that it may be possible to shrink the buffer once per document, 
when TokenStream's reset() is called (which is done before each new document). 
To achieve this, all TokenStreams must notify the termattribute in reset() to 
shrink its size, which is impractical.

On the other hand, the reallocation would always be for each token (you call 
that inner loop).

I agree, that normally, the tokens will not grow very large (if they do, you do 
something wrong during tokenization). Even things like KeywordTokenizer that 
only creates one token has an upper limit of the term size (as far as I know).

I would set this to minor and would not take care before 2.9. The problem of 
maybe large buffers was there even in older versions with Token as attribute 
implementation. It is the same problem like preserving an ArrayList for very 
long time, it also only grows but never automatically shrinks.

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be 
> able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" 
> tokens, however it seems that the TermAttributeImpl should have a reasonable 
> static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, 
> it will shrink back down to this size once the next token smaller than 
> MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it 
> seems like if you have multiple indexing threads, you could end up with a 
> char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is 
> currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Reply via email to