On Tue, Apr 26, 2022 at 9:27 AM Dawid Weiss <[email protected]> wrote:

> > Yeah I agree: unit tests should strive to only and precisely fail for
> true bugs, not false alarms like this.  I'm sorry this failure wasted your
> time Dawid!
>
> Time's not wasted if we decide how to proceed on this.
>

+1, thanks :)


> > So we should fix this failure somehow: Remove all too-big terms from all
> test LineFileDocs sources (the small version in git and the large nightly
> version)?;  Switch this unit test to not use LineFileDocs?;  Make this
> exception a checked exception (not good to make all users pay the price of
> the rare exception)?  Change IndexWriter to silently drop these terms?
>
> I've seen people indexing weird files, even binary files - this
> exception does make sense to me - an insane input yields a sane
> response... I don't know what the right fix should be though. If we
> keep the current code as is then perhaps we can at least detect this
> particular type of exception and not fail the test? Alternatively,
> clean up the input data so that it doesn't have such enormous terms?
>

Maybe we should make a dedicated exception class (instead of the generic
IllegalArgumentException) for this situation and catch it in this test?  Or
change this test to index synthetic (randomly generated) text instead?  But
all other tests that pull from LineFileDocs will also face this same risk
...

Or I'm also fine with also purging all such insanely long terms from all of
our LineFileDocs too.  But I do think that's stepping away from a realistic
problem our users do sometimes encounter.

Another option is to fix the LineFileDocs.java test class to take an
optional boolean to filter out such insanely long terms, and some tests
could explicitly choose to still include them and catch the exception.

I do think it is incredible that it took soooooo many years to uncover this
lurking massive term in the nightly Wikipedia LineFileDocs!!  I wonder how
many such massive terms are lurking in this file :)  This is like the
search for Nessie.

Also, I wonder if we should lower this limit -- are there really users that
need to index such massive (up to ~32 KB) terms?  Maybe it should be an
option on IWC, that defaults to something more sane, but apps could
increase it if they really must index incredibly long terms?

I'll open an issue and let's continue discussing there?

Mike McCandless

http://blog.mikemccandless.com

Reply via email to