inconsistency of tokenstream.end() with OffsetLimitTokenFilter and 
LimitTokenCountFilter
----------------------------------------------------------------------------------------

                 Key: LUCENE-3088
                 URL: https://issues.apache.org/jira/browse/LUCENE-3088
             Project: Lucene - Java
          Issue Type: Bug
            Reporter: Robert Muir


In LUCENE-3064, we added some state and checks to MockTokenizer to validate 
that consumers
are properly using the tokenstream workflow (described here: 
http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/analysis/TokenStream.html)

One inconsistency is the following steps:
4. The consumer calls incrementToken() until it returns false consuming the 
attributes after each call.
5. The consumer calls end() so that any end-of-stream operations can be 
performed.

In the case of these limitingfilters, end() is called on the Tokenizer *before* 
incrementToken() returns false. This is a little strange for a few reasons: one 
is that the tokenizer might not even be "ready" for end(), e.g. it might be 
coded where end() only works correctly if its entirely consumed. The other 
problem of course is that the finalOffset, the general use of end(), will most 
often be wrong in this case, so multi-valued field highlighting will not work.

We should probably figure out a way to address the inconsistency, some ideas 
are:
# fixing the javadocs, perhaps documenting that end() could be called at any 
time, and accepting the fact that the finalOffset will be wrong.
# the limiting filters could consume the rest of the tokens in a while 
(incrementToken()) loop to ensure totally proper behavior.
# the limiting filters could do something tricky like override end() so that 
its not invoked on the Tokenizer in a surprising state. This is still evil but 
perhaps less evil than calling it "out of order".
# ...


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to