On Wed, Apr 27, 2022 at 11:58 AM Michael McCandless
<[email protected]> wrote:
>
> Maybe we should make a dedicated exception class (instead of the generic 
> IllegalArgumentException) for this situation and catch it in this test?  Or 
> change this test to index synthetic (randomly generated) text instead?  But 
> all other tests that pull from LineFileDocs will also face this same risk ...
>
> Or I'm also fine with also purging all such insanely long terms from all of 
> our LineFileDocs too.  But I do think that's stepping away from a realistic 
> problem our users do sometimes encounter.
>
> Another option is to fix the LineFileDocs.java test class to take an optional 
> boolean to filter out such insanely long terms, and some tests could 
> explicitly choose to still include them and catch the exception.
>

I don't think we need to add an option to LineFileDocs, we should just
fix our indexing. The text to this big term starts with something like
'}}{{{{{substc|}}}{{{1' (sorry for typos).

But text doesn't get split up at all because of MockAnalyzer ("act
like whitespace with lowercasing"). Except, unlike our *REAL
ANALYZERS*, MockTokenizer has no term limits. If a user was indexing
this crap with WhitespaceTokenizer or StandardTokenizer then they
wouldn't experience this issue.

I think we should fix MockTokenizer.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to