On Wed, Apr 27, 2022 at 12:34 PM Robert Muir <[email protected]> wrote:

> On Wed, Apr 27, 2022 at 11:58 AM Michael McCandless
> <[email protected]> wrote:
> >
> > Maybe we should make a dedicated exception class (instead of the generic
> IllegalArgumentException) for this situation and catch it in this test?  Or
> change this test to index synthetic (randomly generated) text instead?  But
> all other tests that pull from LineFileDocs will also face this same risk
> ...
> >
> > Or I'm also fine with also purging all such insanely long terms from all
> of our LineFileDocs too.  But I do think that's stepping away from a
> realistic problem our users do sometimes encounter.
> >
> > Another option is to fix the LineFileDocs.java test class to take an
> optional boolean to filter out such insanely long terms, and some tests
> could explicitly choose to still include them and catch the exception.
> >
>
> I don't think we need to add an option to LineFileDocs, we should just
> fix our indexing. The text to this big term starts with something like
> '}}{{{{{substc|}}}{{{1' (sorry for typos).
>
> But text doesn't get split up at all because of MockAnalyzer ("act
> like whitespace with lowercasing"). Except, unlike our *REAL
> ANALYZERS*, MockTokenizer has no term limits. If a user was indexing
> this crap with WhitespaceTokenizer or StandardTokenizer then they
> wouldn't experience this issue.
>
> I think we should fix MockTokenizer.
>

+1 to fix MockTokenizer!

Mike McCandless

http://blog.mikemccandless.com

Reply via email to