On Wed, Apr 27, 2022 at 12:34 PM Robert Muir <[email protected]> wrote:
> On Wed, Apr 27, 2022 at 11:58 AM Michael McCandless > <[email protected]> wrote: > > > > Maybe we should make a dedicated exception class (instead of the generic > IllegalArgumentException) for this situation and catch it in this test? Or > change this test to index synthetic (randomly generated) text instead? But > all other tests that pull from LineFileDocs will also face this same risk > ... > > > > Or I'm also fine with also purging all such insanely long terms from all > of our LineFileDocs too. But I do think that's stepping away from a > realistic problem our users do sometimes encounter. > > > > Another option is to fix the LineFileDocs.java test class to take an > optional boolean to filter out such insanely long terms, and some tests > could explicitly choose to still include them and catch the exception. > > > > I don't think we need to add an option to LineFileDocs, we should just > fix our indexing. The text to this big term starts with something like > '}}{{{{{substc|}}}{{{1' (sorry for typos). > > But text doesn't get split up at all because of MockAnalyzer ("act > like whitespace with lowercasing"). Except, unlike our *REAL > ANALYZERS*, MockTokenizer has no term limits. If a user was indexing > this crap with WhitespaceTokenizer or StandardTokenizer then they > wouldn't experience this issue. > > I think we should fix MockTokenizer. > +1 to fix MockTokenizer! Mike McCandless http://blog.mikemccandless.com
