[ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746700#comment-16746700
 ] 

Daniel Meehl commented on LUCENE-8650:
--------------------------------------

Relates to LUCENE-2387 because that's the root cause of Tokenizer 
implementations not reset()'ing properly.

> ConcatenatingTokenStream does not end() nor reset() properly
> ------------------------------------------------------------
>
>                 Key: LUCENE-8650
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8650
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Daniel Meehl
>            Assignee: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
>  
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to