[jira] [Commented] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-28 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753821#comment-16753821
 ] 

ASF subversion and git services commented on LUCENE-8650:
-

Commit 7713a4f2458c77de08193dc548807b9e90214aaf in lucene-solr's branch 
refs/heads/master from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7713a4f ]

LUCENE-8650: Fix end() and reset() in ConcatenatingTokenStream


> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Dan Meehl
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch, LUCENE-8650.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-28 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753820#comment-16753820
 ] 

ASF subversion and git services commented on LUCENE-8650:
-

Commit f062a18ae71642b831af2026748e74a2e78b1e7b in lucene-solr's branch 
refs/heads/branch_8x from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f062a18 ]

LUCENE-8650: Fix end() and reset() in ConcatenatingTokenStream


> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Dan Meehl
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch, LUCENE-8650.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-28 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753819#comment-16753819
 ] 

ASF subversion and git services commented on LUCENE-8650:
-

Commit 874ff046dfd00cbbc618dfb696f44988adae14b3 in lucene-solr's branch 
refs/heads/branch_7x from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=874ff04 ]

LUCENE-8650: Fix end() and reset() in ConcatenatingTokenStream


> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Dan Meehl
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch, LUCENE-8650.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-21 Thread Alan Woodward (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748069#comment-16748069
 ] 

Alan Woodward commented on LUCENE-8650:
---

I've attached an updated patch that includes your fix for offsets, and also 
handles final position increments (to allow for things like stop words being 
removed at the end of a tokenstream).

> Why set the start and end to the same value?

Convention, mainly - the end offset is the only meaningful value here, but 
OffsetAttribute doesn't allow you to set start and end individually.

> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Dan Meehl
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch, LUCENE-8650.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-18 Thread Daniel Meehl (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746828#comment-16746828
 ] 

Daniel Meehl commented on LUCENE-8650:
--

I do have one more question related to the fix. I noticed that most of the 
other TokenStream implementations end up setting their final offset like the 
code below. Why do they do this? Why set the start and end to the same value? 
In my patch, I set the start to 0 and the end to finalOffset, because this 
seems like the correct thing to do.

{{offsetAtt.setOffset(finalOffset, finalOffset);}}

> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-18 Thread Daniel Meehl (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746751#comment-16746751
 ] 

Daniel Meehl commented on LUCENE-8650:
--

[~romseygeek] Filed that ticket here: LUCENE-8651

> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
>  
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-18 Thread Daniel Meehl (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746705#comment-16746705
 ] 

Daniel Meehl commented on LUCENE-8650:
--

[~romseygeek] Yes I will. The core issue is that Tokenizer implementations end 
up clearing their Reader when they end() and thus can never reset() without 
setting a new Reader.

> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
>  
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-18 Thread Daniel Meehl (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746700#comment-16746700
 ] 

Daniel Meehl commented on LUCENE-8650:
--

Relates to LUCENE-2387 because that's the root cause of Tokenizer 
implementations not reset()'ing properly.

> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
>  
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-18 Thread Alan Woodward (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746701#comment-16746701
 ] 

Alan Woodward commented on LUCENE-8650:
---

Thanks for opening the issue!  Your fix looks good, I'll commit it early next 
week.
{quote}ConcatenatingTokenStream cannot concatenate Tokenziers
{quote}
This surprises me - can you open another issue with a test illustrating the 
problem?

> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
>  
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org