[jira] [Comment Edited] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-18 Thread Daniel Meehl (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746705#comment-16746705
 ] 

Daniel Meehl edited comment on LUCENE-8650 at 1/19/19 1:00 AM:
---

[~romseygeek] Yes I will. The core issue is that Tokenizer implementations end 
up clearing their Reader when they close() and thus can never reset() without 
setting a new Reader.


was (Author: dmeehl):
[~romseygeek] Yes I will. The core issue is that Tokenizer implementations end 
up clearing their Reader when they end() and thus can never reset() without 
setting a new Reader.

> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-18 Thread Daniel Meehl (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746751#comment-16746751
 ] 

Daniel Meehl edited comment on LUCENE-8650 at 1/18/19 10:52 PM:


[~romseygeek], I filed that ticket here: LUCENE-8651


was (Author: dmeehl):
[~romseygeek] Filed that ticket here: LUCENE-8651

> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
>  
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org