subject:"\[jira\] \[Comment Edited\] \(LUCENE\-8651\) Tokenizer implementations can't be reset"

[jira] [Comment Edited] (LUCENE-8651) Tokenizer implementations can't be reset

2019-01-22 Thread Dan Meehl (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748798#comment-16748798
 ] 

Dan Meehl edited comment on LUCENE-8651 at 1/22/19 5:30 PM:


This is basically the same solution I came to in LUCENE-8650 (patch2). I ended 
up calling mine KeywordTokenStream to keep the naming in line because it 
matches what KeywordTokenizer does. 

Honestly though, the lifecycle of a Tokenizer still feels wrong to me. All 
other TokenStreams have a reset(), incrementToken(), end(), close() lifecycle. 
But Tokenizer has an extra setReader() in there, and the consumer must know 
that it's a Tokenizer and then call the extra step (assuming it even has access 
to the Reader). It feels to me like Tokenizer should have to conform to the 
same lifecycle steps as every other TokenStream. Or at least, if that can't be 
true, Tokenizer implementations should be able to set their reader by 
overriding reset(). This currently can't be done because inputPending and 
setReader() and ILLEGAL_STATE_READER are final. If this could be done then one 
could construct a Tokenizer implementation that conformed to the TokenStream 
lifecycle and then the consumer doesn't have to know anything about Tokenizer. 
After all, that is the point of an abstraction like this: If the consumer takes 
a TokenStream, then it knows what the lifecycle is. 

If the lifecycle of Tokenizer is to stay the same, I'd like to propose a 
documentation update on TokenStream and Tokenizer. I can take a swing at that 
and post a patch if you'd like.


was (Author: dmeehl):
This is basically the same solution I came to in LUCENE-8650 (patch2). I ended 
up calling mine KeywordTokenStream to keep the naming in line because it 
matches what KeywordTokenizer does. 

Honestly though, the lifecycle of a Tokenizer still feels wrong to me. All 
other TokenStreams have a reset(), incrementToken(), end(), close() lifecycle. 
But Tokenizer has an extra setReader() in there, and the consumer must know 
that it's a Tokenizer and therefore must call the extra step (assuming it even 
has access to the Reader). It feels to me like Tokenizer should have to conform 
to the same lifecycle steps as every other TokenStream. Or at least, if that 
can't be true, Tokenizer implementations should be able to set their reader by 
overriding reset(). This currently can't be done because inputPending and 
setReader() and ILLEGAL_STATE_READER are final. If this could be done then one 
could construct a Tokenizer implementation that conformed to the TokenStream 
lifecycle and then the consumer doesn't have to know anything about Tokenizer. 
After all, that is the point of an abstraction like this: If the consumer takes 
a TokenStream, then it knows what the lifecycle is. 

If the lifecycle of Tokenizer is to stay the same, I'd like to propose a 
documentation update on TokenStream and Tokenizer. I can take a swing at that 
and post a patch if you'd like.

> Tokenizer implementations can't be reset
> 
>
> Key: LUCENE-8651
> URL: https://issues.apache.org/jira/browse/LUCENE-8651
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Dan Meehl
>Priority: Major
> Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch, LUCENE-8651.patch
>
>
> The fine print here is that they can't be reset without calling setReader() 
> every time before reset() is called. The reason for this is that Tokenizer 
> violates the contract put forth by TokenStream.reset() which is the following:
> "Resets this stream to a clean state. Stateful implementations must implement 
> this method so that they can be reused, just as if they had been created 
> fresh."
> Tokenizer implementation's reset function can't reset in that manner because 
> their Tokenizer.close() removes the reference to the underlying Reader 
> because of LUCENE-2387. The catch-22 here is that we don't want to 
> unnecessarily keep around a Reader (memory leak) but we would like to be able 
> to reset() if necessary.
> The patches include an integration test that attempts to use a 
> ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer 
> TokenStream. This test fails with an IllegalStateException thrown by 
> Tokenizer.ILLEGAL_STATE_READER.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8651) Tokenizer implementations can't be reset

2019-01-21 Thread Dan Meehl (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748123#comment-16748123
 ] 

Dan Meehl edited comment on LUCENE-8651 at 1/21/19 5:46 PM:


The initial task was that I needed to wrap an incoming TokenStream with 
anchors. I tried to use ConcatenatingTokenStream with KeywordTokenizer as the 
anchors. My particular use case has a copyField directive on the field that I'm 
wrapping. Because of the copyField, DefaultIndexingChain is going to try to 
reset my TokenStream, which in turn resets the underlying TokenStream(s). This 
means that reset gets called on my KeywordTokenizer(s) and causes an 
IllegalStateException to be thrown.

It seems like ConcatenatingTokenStream should be able to use any type of 
TokenStream, otherwise I would have expected it to take a list of TokenFilter 
instead of a list of TokenStream. But, if the resulting 
ConcatenatingTokenStream is going to be used more than once, as it is in a 
copyField scenario, then it will fail if one of the underlying TokenStreams is 
a Tokenizer.

I'm not really sure what I'm proposing as a fix, to be honest. One one hand, it 
seems like TokenStreams were meant to be resettable. On the other I understand 
your point about Tokenizers and Readers.

Possible Solutions:

Perhaps Tokenizers should store captured state to be reused later? This would 
allow Tokenizer to fulfill the contract in TokenStream.

It looks like Tokenizers originating from the schema (WhiteSpaceTokenizer, 
StandardTokenizer, etc) end up turning into Field$StringTokenStream by the time 
they get to the DefaultIndexingChain. That is why they don't fail with the same 
IllegalStateException. Maybe ConcatenatingTokenStream needs to convert 
underlying Tokenizers into a non Reader based TokenStream?

 


was (Author: dmeehl):
The initial task was that I needed to wrap an incoming TokenStream with 
anchors. I tried to use KeywordTokenizer as the anchors. My particular use case 
has a copyField directive on the field that I'm wrapping. Because of the 
copyField, DefaultIndexingChain is going to try to reset my TokenStream, which 
in turn resets the underlying TokenStream(s). This means that reset gets called 
on my KeywordTokenizer(s) and causes an IllegalStateException to be thrown.

It seems like ConcatenatingTokenStream should be able to use any type of 
TokenStream, otherwise I would have expected it to take a list of TokenFilter 
instead of a list of TokenStream. But, if the resulting 
ConcatenatingTokenStream is going to be used more than once, as it is in a 
copyField scenario, then it will fail if one of the underlying TokenStreams is 
a Tokenizer.

I'm not really sure what I'm proposing as a fix, to be honest. One one hand, it 
seems like TokenStreams were meant to be resettable. On the other I understand 
your point about Tokenizers and Readers.

Possible Solutions:

Perhaps Tokenizers should store captured state to be reused later? This would 
allow Tokenizer to fulfill the contract in TokenStream.

It looks like Tokenizers originating from the schema (WhiteSpaceTokenizer, 
StandardTokenizer, etc) end up turning into Field$StringTokenStream by the time 
they get to the DefaultIndexingChain. That is why they don't fail with the same 
IllegalStateException. Maybe ConcatenatingTokenStream needs to convert 
underlying Tokenizers into a non Reader based TokenStream?

 

> Tokenizer implementations can't be reset
> 
>
> Key: LUCENE-8651
> URL: https://issues.apache.org/jira/browse/LUCENE-8651
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Dan Meehl
>Priority: Major
> Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch
>
>
> The fine print here is that they can't be reset without calling setReader() 
> every time before reset() is called. The reason for this is that Tokenizer 
> violates the contract put forth by TokenStream.reset() which is the following:
> "Resets this stream to a clean state. Stateful implementations must implement 
> this method so that they can be reused, just as if they had been created 
> fresh."
> Tokenizer implementation's reset function can't reset in that manner because 
> their Tokenizer.close() removes the reference to the underlying Reader 
> because of LUCENE-2387. The catch-22 here is that we don't want to 
> unnecessarily keep around a Reader (memory leak) but we would like to be able 
> to reset() if necessary.
> The patches include an integration test that attempts to use a 
> ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer 
> TokenStream. This test fails with an IllegalStateException thrown by 
> Tokenizer.ILLEGAL_STATE_READER.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (LUCENE-8651) Tokenizer implementations can't be reset

2019-01-21 Thread Dan Meehl (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748123#comment-16748123
 ] 

Dan Meehl edited comment on LUCENE-8651 at 1/21/19 5:38 PM:


The initial task was that I needed to wrap an incoming TokenStream with 
anchors. I tried to use KeywordTokenizer as the anchors. My particular use case 
has a copyField directive on the field that I'm wrapping. Because of the 
copyField, DefaultIndexingChain is going to try to reset my TokenStream, which 
in turn resets the underlying TokenStream(s). This means that reset gets called 
on my KeywordTokenizer(s) and causes an IllegalStateException to be thrown.

It seems like ConcatenatingTokenStream should be able to use any type of 
TokenStream, otherwise I would have expected it to take a list of TokenFilter 
instead of a list of TokenStream. But, if the resulting 
ConcatenatingTokenStream is going to be used more than once, as it is in a 
copyField scenario, then it will fail if one of the underlying TokenStreams is 
a Tokenizer.

I'm not really sure what I'm proposing as a fix, to be honest. One one hand, it 
seems like TokenStreams were meant to be resettable. On the other I understand 
your point about Tokenizers and Readers.

Possible Solutions:

Perhaps Tokenizers should store captured state to be reused later? This would 
allow Tokenizer to fulfill the contract in TokenStream.

It looks like Tokenizers originating from the schema (WhiteSpaceTokenizer, 
StandardTokenizer, etc) end up turning into Field$StringTokenStream by the time 
they get to the DefaultIndexingChain. That is why they don't fail with the same 
IllegalStateException. Maybe ConcatenatingTokenStream needs to convert 
underlying Tokenizers into a non Reader based TokenStream?

 


was (Author: dmeehl):
The initial task was that I needed to wrap an incoming TokenStream with 
anchors. I tried to use KeywordTokenizer as the anchors. My particular use case 
has a copyField directive on the field that I'm wrapping. Because of the 
copyField, DefaultIndexingChain is going to try to reset my TokenStream, which 
in turn resets the underlying TokenStream(s). This means that reset gets called 
on my KeywordTokenizer(s) and causes an IllegalStateException to be thrown.

 

It seems like ConcatenatingTokenStream should be able to use any type of 
TokenStream, otherwise I would have expected it to take a list of TokenFilter 
instead of a list of TokenStream. But, if the resulting 
ConcatenatingTokenStream is going to be used more than once, as it is in a 
copyField scenario, then it will fail if one of the underlying TokenStreams is 
a Tokenizer.

I'm not really sure what I'm proposing as a fix, to be honest. One one hand, it 
seems like TokenStreams were meant to be resettable. On the other I understand 
your point about Tokenizers and Readers.

Possible Solutions:

Perhaps Tokenizers should store captured state to be reused later? This would 
allow Tokenizer to fulfill the contract in TokenStream.

It looks like Tokenizers originating from the schema (WhiteSpaceTokenizer, 
StandardTokenizer, etc) end up turning into Field$StringTokenStream by the time 
they get to the DefaultIndexingChain. That is why they don't fail with the same 
IllegalStateException. Maybe ConcatenatingTokenStream needs to convert 
underlying Tokenizers into a non Reader based TokenStream?

 

> Tokenizer implementations can't be reset
> 
>
> Key: LUCENE-8651
> URL: https://issues.apache.org/jira/browse/LUCENE-8651
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Dan Meehl
>Priority: Major
> Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch
>
>
> The fine print here is that they can't be reset without calling setReader() 
> every time before reset() is called. The reason for this is that Tokenizer 
> violates the contract put forth by TokenStream.reset() which is the following:
> "Resets this stream to a clean state. Stateful implementations must implement 
> this method so that they can be reused, just as if they had been created 
> fresh."
> Tokenizer implementation's reset function can't reset in that manner because 
> their Tokenizer.close() removes the reference to the underlying Reader 
> because of LUCENE-2387. The catch-22 here is that we don't want to 
> unnecessarily keep around a Reader (memory leak) but we would like to be able 
> to reset() if necessary.
> The patches include an integration test that attempts to use a 
> ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer 
> TokenStream. This test fails with an IllegalStateException thrown by 
> Tokenizer.ILLEGAL_STATE_READER.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (LUCENE-8651) Tokenizer implementations can't be reset

2019-01-18 Thread Daniel Meehl (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746759#comment-16746759
 ] 

Daniel Meehl edited comment on LUCENE-8651 at 1/19/19 12:58 AM:


As a little more of an explanation, all I did here was to replace the 
KeywordTokenStream (from the 1st patch) to a KeywordTokenizer. This causes the 
test to fail with an IllegalStateException because the KeywordTokenizer has 
it's close() and then reset() methods called which swaps out the previously set 
reader for the Tokenizer.ILLEGAL_STATE_READER.


was (Author: dmeehl):
As a little more of an explanation, all I did here was to replace the 
KeywordTokenStream (from the 1st patch) to a KeywordTokenizer. This causes the 
test to fail with an IllegalStateException because the KeywordTokenizer has 
it's end and then reset methods called which swaps out the previously set 
reader for the Tokenizer.ILLEGAL_STATE_READER.

> Tokenizer implementations can't be reset
> 
>
> Key: LUCENE-8651
> URL: https://issues.apache.org/jira/browse/LUCENE-8651
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Priority: Major
> Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch
>
>
> The fine print here is that they can't be reset without calling setReader() 
> every time before reset() is called. The reason for this is that Tokenizer 
> violates the contract put forth by TokenStream.reset() which is the following:
> "Resets this stream to a clean state. Stateful implementations must implement 
> this method so that they can be reused, just as if they had been created 
> fresh."
> Tokenizer implementation's reset function can't reset in that manner because 
> their Tokenizer.close() removes the reference to the underlying Reader 
> because of LUCENE-2387. The catch-22 here is that we don't want to 
> unnecessarily keep around a Reader (memory leak) but we would like to be able 
> to reset() if necessary.
> The patches include an integration test that attempts to use a 
> ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer 
> TokenStream. This test fails with an IllegalStateException thrown by 
> Tokenizer.ILLEGAL_STATE_READER.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8651) Tokenizer implementations can't be reset

2019-01-18 Thread Daniel Meehl (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746748#comment-16746748
 ] 

Daniel Meehl edited comment on LUCENE-8651 at 1/18/19 10:57 PM:


Since this was related to LUCENE-8650, I piggy-backed on the 2nd patch in that 
ticket to make things easier. I hope that's not a problem. This means that to 
run this test, you should apply both patches: 8650 first then 8651.


was (Author: dmeehl):
Since this was related to LUCENE-8650, I piggybacked on the 2nd patch in that 
ticket to make things easier. I hope that's not a problem.

> Tokenizer implementations can't be reset
> 
>
> Key: LUCENE-8651
> URL: https://issues.apache.org/jira/browse/LUCENE-8651
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Daniel Meehl
>Priority: Major
> Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch
>
>
> The fine print here is that they can't be reset without calling setReader() 
> every time before reset() is called. The reason for this is that Tokenizer 
> violates the contract put forth by TokenStream.reset() which is the following:
> "Resets this stream to a clean state. Stateful implementations must implement 
> this method so that they can be reused, just as if they had been created 
> fresh."
> Tokenizer implementation's reset function can't reset in that manner because 
> their Tokenizer.end() removes the reference to the underlying Reader because 
> of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep 
> around a Reader (memory leak) but we would like to be able to reset() if 
> necessary.
> The patches include an integration test that attempts to use a 
> ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer 
> TokenStream. This test fails with an IllegalStateException thrown by 
> Tokenizer.ILLEGAL_STATE_READER.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8651) Tokenizer implementations can't be reset

[jira] [Comment Edited] (LUCENE-8651) Tokenizer implementations can't be reset

[jira] [Comment Edited] (LUCENE-8651) Tokenizer implementations can't be reset

[jira] [Comment Edited] (LUCENE-8651) Tokenizer implementations can't be reset

[jira] [Comment Edited] (LUCENE-8651) Tokenizer implementations can't be reset

5 matches

Site Navigation

Mail list logo

Footer information