On Wed, Apr 27, 2022 at 11:58 AM Michael McCandless
<[email protected]> wrote:
>
> Maybe we should make a dedicated exception class (instead of the generic
> IllegalArgumentException) for this situation and catch it in this test? Or
> change this test to index synthetic (randomly generated) text instead? But
> all other tests that pull from LineFileDocs will also face this same risk ...
>
> Or I'm also fine with also purging all such insanely long terms from all of
> our LineFileDocs too. But I do think that's stepping away from a realistic
> problem our users do sometimes encounter.
>
> Another option is to fix the LineFileDocs.java test class to take an optional
> boolean to filter out such insanely long terms, and some tests could
> explicitly choose to still include them and catch the exception.
>
I don't think we need to add an option to LineFileDocs, we should just
fix our indexing. The text to this big term starts with something like
'}}{{{{{substc|}}}{{{1' (sorry for typos).
But text doesn't get split up at all because of MockAnalyzer ("act
like whitespace with lowercasing"). Except, unlike our *REAL
ANALYZERS*, MockTokenizer has no term limits. If a user was indexing
this crap with WhitespaceTokenizer or StandardTokenizer then they
wouldn't experience this issue.
I think we should fix MockTokenizer.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]