I opened https://issues.apache.org/jira/browse/LUCENE-10541 to figure out WTF we can do about this tricky situation!!!
Thank you Dawid and Rob for trying to iterate here. Let's continue our discussion on the issue? Mike McCandless http://blog.mikemccandless.com On Wed, Apr 27, 2022 at 12:59 PM Michael McCandless < [email protected]> wrote: > On Wed, Apr 27, 2022 at 12:34 PM Robert Muir <[email protected]> wrote: > >> On Wed, Apr 27, 2022 at 11:58 AM Michael McCandless >> <[email protected]> wrote: >> > >> > Maybe we should make a dedicated exception class (instead of the >> generic IllegalArgumentException) for this situation and catch it in this >> test? Or change this test to index synthetic (randomly generated) text >> instead? But all other tests that pull from LineFileDocs will also face >> this same risk ... >> > >> > Or I'm also fine with also purging all such insanely long terms from >> all of our LineFileDocs too. But I do think that's stepping away from a >> realistic problem our users do sometimes encounter. >> > >> > Another option is to fix the LineFileDocs.java test class to take an >> optional boolean to filter out such insanely long terms, and some tests >> could explicitly choose to still include them and catch the exception. >> > >> >> I don't think we need to add an option to LineFileDocs, we should just >> fix our indexing. The text to this big term starts with something like >> '}}{{{{{substc|}}}{{{1' (sorry for typos). >> >> But text doesn't get split up at all because of MockAnalyzer ("act >> like whitespace with lowercasing"). Except, unlike our *REAL >> ANALYZERS*, MockTokenizer has no term limits. If a user was indexing >> this crap with WhitespaceTokenizer or StandardTokenizer then they >> wouldn't experience this issue. >> >> I think we should fix MockTokenizer. >> > > +1 to fix MockTokenizer! > > Mike McCandless > > http://blog.mikemccandless.com > >
