[
https://issues.apache.org/jira/browse/LUCENE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478494#comment-13478494
]
Hoss Man commented on LUCENE-4489:
----------------------------------
In SOLR-3961 rmuir and i were discussing the tests for LimitTokenCountFilter.
it seems to me like there are at least 2 things that should be improved in the
tests, and one possible improvement to the code itself...
A) fix TestLimitTokenCountAnalyzer to use MockTokenizer...
{quote}
bq. 3) the closest thing i see to a test of LimitTokenCountFilter is
TestLimitTokenCountAnalyzer - i realize now the reason it's
testLimitTokenCountAnalyzer doesn't get the same failure is because it's
wrapping WhitespaceAnalyzer, StandardAnalyzer - should those be changed to use
MockTokenizer?
I think we should always do this!
{quote}
B) testLimitTokenCountAnalyzer.testLimitTokenCountIndexWriter smells fishy to
me, still discussing with rmuir...
{quote}
bq. 4) TestLimitTokenCountAnalyzer also has a testLimitTokenCountIndexWriter
that uses MockAnalyzer w/o calling setEnableChecks(false) which seems like it
should trigger the same failure i got since it uses MockTokenizer, but in
general that test looks suspicious, as it seems to add the exact number of
tokens that the limit is configured for, and then asserts that the last token
is in the index - but never actaully triggers the limiting logic since exactly
the allowed umber of tokens are used.
Then thats fine, because when LimitTokenCountFilter consumes the whole stream,
its a "good consumer". its only when it actually truncates that it breaks the
tokenstream contract.
{quote}
My concern wasn't that i thought that test was fishy just because it didn't
call setEnableChecks(false) -- my concern was that i don't see the point of the
test at all as it's currently written, because it doesn't actually trigger the
token limiting behavior (if it did it would _have_ to call
setEnableChecks(false) ... so what exactly is it suppose to be testing?
C) going back to a comment rmuir made earlier in SOLR-3961...
bq. Thats because LimitTokenCountFilter violates the workflow of the standard
tokenstream API ... it cuts off its consumer and calls end() even while that
consumer still has more tokens: this could easily cause unexpected side effects
and bugs. ... But we knew about this anyway
This makes me wonder if it would be worth while to add an option to
LimitTokenCountFilter to make it always consume all the tokens from the stream
it wraps -- it could be off by default so it would be havle exactly like today,
but if you are wrapping a TokenStream where it's really important to you that
every token is consumed before end() is called, you could set it appropriately.
would that make sense?
> improve LimitTokenCountFilter and/or it's tests
> -----------------------------------------------
>
> Key: LUCENE-4489
> URL: https://issues.apache.org/jira/browse/LUCENE-4489
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Hoss Man
>
> spinning off a discussion about LimitTokenCountFilter and it's tests from
> SOLR-3961 (which was about a specific bug in the LimitTokenCountFilterFactory)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]