[
https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Meehl updated LUCENE-8651:
---------------------------------
Attachment: LUCENE-8650-2.patch
> Tokenizer implementations can't be reset
> ----------------------------------------
>
> Key: LUCENE-8651
> URL: https://issues.apache.org/jira/browse/LUCENE-8651
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Daniel Meehl
> Priority: Major
> Attachments: LUCENE-8650-2.patch
>
>
> The fine print here is that they can't be reset without calling setReader()
> every time before reset() is called. The reason for this is that Tokenizer
> violates the contract put forth by TokenStream.reset() which is the following:
> "Resets this stream to a clean state. Stateful implementations must implement
> this method so that they can be reused, just as if they had been created
> fresh."
> Tokenizer implementation's reset function can't reset in that manner because
> their Tokenizer.end() removes the reference to the underlying Reader because
> of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep
> around a Reader (memory leak) but we would like to be able to reset() if
> necessary.
> The patches include an integration test that attempts to use a
> ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer
> TokenStream. This test fails with an IllegalStateException thrown by
> Tokenizer.ILLEGAL_STATE_READER.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]