[
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alan Woodward resolved LUCENE-8650.
-----------------------------------
Resolution: Fixed
Fix Version/s: 7.7
> ConcatenatingTokenStream does not end() nor reset() properly
> ------------------------------------------------------------
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Reporter: Dan Meehl
> Assignee: Alan Woodward
> Priority: Major
> Fix For: 7.7
>
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch,
> LUCENE-8650-3.patch, LUCENE-8650.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling
> super.end() in their end() methods. ConcatenatingTokenStream fails to do
> this. Because of this, it's final offset is not readable and
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This
> results in problems with indexing which can include unsearchable content or
> IllegalStateExceptions.
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it
> does not set its currentSource and offsetIncrement back to 0. Because of
> this, copyField directives (in the schema) do not work and content becomes
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used
> for the testing that Lucene may or may not want to merge in. This patch adds
> an integration test that loads some content into the 'text' field. The schema
> then copies it to 'content' using a copyField directive. The test searches in
> the content field for the loaded text and fails to find it even though the
> field does contain the content. Flip the debug flag to see a nicer printout
> of the response and what's in the index. Notice that the added class I
> alluded to is KeywordTokenStream .This class had to be added because of
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot
> concatenate Tokenziers. This is because Tokenizer violates the contract put
> forth by TokenStream.reset(). This separate problem warrants its own ticket,
> though. However, ultimately KeywordTokenStream may be useful to others and
> could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting
> a finalOffset as the last task in the end() method, and resetting
> currentSource, offsetIncrement and finalOffset when reset() is called.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]