LOL thanks for getting to the root cause Dawid!

The thing is, such screwed up text is a fact of life for many Lucene
applications -- they accidentally try to ingest massive terms, or index
base64 as if it were text, etc.  I think it's healthy for us to also test
Lucene on such content and make sure we don't have some other bug creep in
where Lucene reacts badly, e.g. say causing index corruption because this
IllegalArgumentException was thrown?

This seems to be quite rare -- maybe our (large, nightly) enwiki sample has
only a few such too-massive terms, and this particular test + random seed
hit the jackpot.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Apr 22, 2022 at 6:04 PM Dawid Weiss <[email protected]> wrote:

> And, for the record - indeed enwiki contains an odd field with a
> super-long term that looks like this:
>
> 13:24:08.000
> {{{{{substc|}}}{{{1}}}|{{{p1n|}}}={{{p1v|}}}|{{{p2n|}}}={{{p2v|}}}|{{{p3n|}}}={{{p3v|}}}|{{{p4n|}}}={{{p4v|}}}|{{{p5n|}}}={{{p5v|}}}|{{{p6n|}}}={{{p6v|}}}|{{{p7n|}}}={{{p7v|}}}|{{{p8n|}}}={{{p8v|}}}|{{{p9n|}}}={{{p9v|}}}|{{{p10n|}}}={{{p10v|}}}|{{{mun|1}}}=1680}}{{{{{substc|}}}{{{1}}}|{{{p1n|}}}={{{p1v|}}}|{{{p2n|}}}={{{p2v|}}}|{{{p3n|}}}={{{p3v|}}}|{{{p4n|}}}={{{p4v|}}}|{{{p5n|}}}={{{p5v|}}}|{{{p6n|}}}={{{p6v|}}}|{{{p7n|}}}={{{p7v|}}}|{{{p8n|}}}={{{p8v|}}}|{{{p9n|}}}={{{p9v|}}}|{{{p10n|}}}={{{p10v|}}}|{{{mun|1}}}=738}}{{{{{substc|}}}{{{1}}}|{{{p1n|}}}={{{p1v|}}}|{{{p2n|}}}={{{p2v|}}}|{{{p3n|}}}={{{p3v|}}}|{{{p4n|}}}={{{p4v|}}}|{{{p5n|}}}={{{p5v|}}}|{{{p6n|}}}={{{p6v|}}}|{{{p7n|}}}={{{p7v|}}}|{{{p8n|}}}={{{p8v|}}}|{{{p9n|}}}={{{p9v|}}}|{{{p10n|}}}={{{p10v|}}}|{{{mun|1}}}=358}}{{{{{substc|}}}{{{1}}}|{{{p1n|}}}={{{p1v|}}}|{{{p2n|}}}={{{p2v|}}}|{{{p3n|}}}={{{p3v|}}}|{{{p4n|}}}={{{p4v|}}}|{{{p5n|}}}={{{p5v|}}}|{{{p6n|}}}={{{p6v|}}}|{{{p7n|}}}={{{p7v|}}}|{{{p8n|}}}={{{p8v|}}}|{{{p9n|}}}={{{p9v|}}}|{{{p10n|}}}={{{p10v|}}}|{{{mun|1}}}=197}}{{{{{substc|}}}{{{1}}}|{{{p1n|}}}={{{p1v|}}}|{{{p2n|}}}={{{p2v|}}}|{{{p3n|}}}={{{p3v|}}}|{{{p4n|}}}={{{p4v|}}}|{{{p5n|}}}={{{p5v|}}}|{{{p6n|}}}={{{p6v|}}}|{{{p7n|}}}={{{p7v|}}}|{{{p8n|}}}={{{p8v|}}}|{{{p9n|}}}={{{p9v|}}}|{{{p10n|}}}={{{p10v|}}}|{{{mun|1}}}=305}}{{{{{substc|}}}{{{1}}}|{{{p1n|}}}={{{p1v|}}}|{{{p2n|}}}={{{p2v|}}}|{{{p3n|}}}={{{p3v|}}}|{{{p4n|}}}={{{p4v|}}}|{{{p5n|}}}={{{p5v|}}}|{{{p6n|}}}={{{p6v|}}}|{{{p7n|}}}={{{p7v|}}}|{{{p8n|}}}={{{p8v|}}}|{{{p9n|}}}={{{p9v|}}}|{{{p10n|}}}={{{p10v|}}}|{{{mun|1}}}=59}}{{{{{substc|}}}{{{1}}}|{{{p1n|}}}={{{p1v|}}}|{{{p2n|}}}={{{p2v|}}}|{{{p3n|}}}={{{p3v|}}}|{{{p4n|}}}={{{p4v|}}}|{{{p5n|}}}={{{p5v|}}}|{{{p6n|}}}={{{p6v|}}}|{{{p7n|}}}={{{p7v|}}}|{{{p8n|}}}={{{p8v|}}}|{{{p9n|}}}={{{p9v|}}}|{{{p10n|}}}={{{p10v|}}}|{{{mun|1}}}=482}}{{{{{substc|}}}{{{1}}}|{{{p1n|}}}={{{p1v|}}}|{{{p2n|}}}={{{p2v|}}}|{{{p3n|}}}={{{p3v|}}}|{{{p4n|}}}={{{p4v|}}}|{{{p5n|}}}={{{p5v|}}}|{{{p6n|}}}={{{p6v|}}}|{{{p7n|}}}={{{p7v|}}}|{{{p8n|}}}={{{p8v|}}}|{{{p9n|}}}={{{p9v|}}}|{{{p10n|}}}={{{p10v|}}}|{{{mun|1}}}=613}}{{{{{substc|}}}{{{1}}}|{{{p1n|}}}={{{p1v|}}}|{{{p2n|}}}={{{p2v|}}}|{{{p3n|}}}={{{p3v|}}}|{{{p4n|}}}={{{p4v|}}}|{{{p5n|}}}={{{p5v|}}}|{{{p6n|}}}={{{p6v|}}}|{{{p7n|}}}={{{p7v|}}}|{{{p8n|}}}={{{p8v|}}}|{{{p9n|}}}={{{p9v|}}}|{{{p10n|}}}={{{p10v|}}}|{{{mun|1}}}=361}}{{{{{substc|}}}{{{1}}}|{{{p1n|}}}={{{p1v|}}}|{{{p2n|}}}={{{p2v|}}}|{{{p3n|}}}={{{p3v|}}}|{{{p4n|}}}={{{p4v|}}}|{{{p5n|}}}={{{p5v|}}}|{{{p6n|}}}={{{p6v|}}}|{{{p7n|}}}={{{p7v|}}}|{{{p8n|}}}={{{p8v|}}}|{{{p9n|}}}={{{p9v|}}}|{{{p10n|}}}={{{p10v|}}}|{{{mun|1}}}=141}}{{{{{substc|}}}{{{1}}}|{{{p1n|}}}={{{p1v|}}}|{{{p2n|}}}={{{p2v|}}}|{{{p3n|}}}={{{p3v|}}}|{{{p4n|}}}={{{p4v|}}}|{{{p5n|}}}={{{p5v|}}}|{{{p6n|}}}={{{p6v|}}}|{{{p7n|}}}={{{p7v|}}}|{{{p8n|}}}={{{p8v|}}}|{{{p9n|}}}={{{p9v|}}}|{{{p10n|}}}={{{p10v|}}}|{{{mun|1}}}=34}}{{{{{substc|}}}{{{1}}}|{{{p1n|}}}={{{p1v|}}}|{{{p2n|}}}={{{p2v|}}}|{{{p3n|}}}={{{p3v|}}}|{{{p4n|}}}={{{p4v|}}}|{{{p5n|}}}={{{p5v|}}}|{{{p6n|}}}={{{p6v|}}}|{{{p7n|}}}={{{p7v|}}}|{{{p8n|}}}={{{p8v|}}}|{{{p9n|}}}={{{p9v|}}}|{{{p10n|}}}={{{p10v|}}}|{{{mun|1}}}=484}}{{{{{substc|}}}{{{1}}}|{{{p1n|}}}={{{p1v|}}}|{{{p2n|}}}={{{p2v|}}}|{{{p3n|}}}={{{p3v|}}}|{{{p4n|}}}={{{p4v|}}}|{{{p5n|}}}={{{p5v|}}}|{{{p6n|}}}={{{p6v|}}}|{{{p7n|}}}={{{p7v|}}}|{{{p8n|}}}={{{p8v|}}}|{{{p9n|}}}={{{p9v|}}}|{{{p10n|}}}={{{p10v|}}}|{{{mun|1}}}=1723}}{{{{{substc|}}}{{{1}}}|{{{p1n|}}}=
> [snip]
>
>
>
> On Fri, Apr 22, 2022 at 11:57 PM Dawid Weiss <[email protected]>
> wrote:
> >
> > This actually reproduces (if you download enwiki). I wonder if we
> > should tune LineFileDocs so that it avoids trying to add humongous
> > terms.
> >
> > D.
> >
> > On Wed, Apr 20, 2022 at 3:42 AM Apache Jenkins Server
> > <[email protected]> wrote:
> > >
> > > Build:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-9.1/42/
> > >
> > > 1 tests failed.
> > > FAILED:  org.apache.lucene.index.TestAllFilesCheckIndexHeader.test
> > >
> > > Error Message:
> > > java.lang.IllegalArgumentException: Document contains at least one
> immense term in field="body" (whose UTF8 encoding is longer than the max
> length 32766), all of which were skipped.  Please correct the analyzer to
> not produce such terms.  The prefix of the first immense term is: '[125,
> 125, 123, 123, 123, 123, 123, 115, 117, 98, 115, 116, 99, 124, 125, 125,
> 125, 123, 123, 123, 49, 125, 125, 125, 124, 123, 123, 123, 112, 49]...',
> original message: bytes can be at most 32766 in length; got 94384
> > >
> > > Stack Trace:
> > > java.lang.IllegalArgumentException: Document contains at least one
> immense term in field="body" (whose UTF8 encoding is longer than the max
> length 32766), all of which were skipped.  Please correct the analyzer to
> not produce such terms.  The prefix of the first immense term is: '[125,
> 125, 123, 123, 123, 123, 123, 115, 117, 98, 115, 116, 99, 124, 125, 125,
> 125, 123, 123, 123, 49, 125, 125, 125, 124, 123, 123, 123, 112, 49]...',
> original message: bytes can be at most 32766 in length; got 94384
> > >         at
> __randomizedtesting.SeedInfo.seed([34ECEDA648B62DC2:BCB8D27CE64A403A]:0)
> > >         at
> org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1242)
> > >         at
> org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729)
> > >         at
> org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620)
> > >         at
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241)
> > >         at
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432)
> > >         at
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1531)
> > >         at
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1816)
> > >         at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1469)
> > >         at
> org.apache.lucene.tests.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:222)
> > >         at
> org.apache.lucene.index.TestAllFilesCheckIndexHeader.test(TestAllFilesCheckIndexHeader.java:58)
> > >         at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> > >         at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > >         at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > >         at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> > >         at
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
> > >         at
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
> > >         at
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
> > >         at
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
> > >         at
> org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
> > >         at
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> > >         at
> org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
> > >         at
> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> > >         at
> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> > >         at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> > >         at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> > >         at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
> > >         at
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
> > >         at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
> > >         at
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
> > >         at
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
> > >         at
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
> > >         at
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
> > >         at
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> > >         at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> > >         at
> org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
> > >         at
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> > >         at
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> > >         at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> > >         at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> > >         at
> org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> > >         at
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> > >         at
> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> > >         at
> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> > >         at
> org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
> > >         at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> > >         at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> > >         at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
> > >         at
> com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:826)
> > >         at java.base/java.lang.Thread.run(Thread.java:834)
> > >         Suppressed: java.lang.IllegalStateException: close() called in
> wrong state: INCREMENT
> > >                 at
> org.apache.lucene.tests.analysis.MockTokenizer.fail(MockTokenizer.java:136)
> > >                 at
> org.apache.lucene.tests.analysis.MockTokenizer.close(MockTokenizer.java:327)
> > >                 at
> org.apache.lucene.analysis.TokenFilter.close(TokenFilter.java:58)
> > >                 at
> org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1136)
> > >                 ... 48 more
> > > Caused by:
> org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes
> can be at most 32766 in length; got 94384
> > >         at
> app//org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:258)
> > >         at
> app//org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:193)
> > >         at
> app//org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224)
> > >         ... 48 more
> > >
> > >
> > >
> > >
> > > Build Log:
> > > [...truncated 573 lines...]
> > > ERROR: The following test(s) have failed:
> > >   - org.apache.lucene.index.TestAllFilesCheckIndexHeader.test
> (:lucene:core)
> > >     Test output:
> /home/jenkins/jenkins-slave/workspace/Lucene/Lucene-NightlyTests-9.1/checkout/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestAllFilesCheckIndexHeader.txt
> > >     Reproduce with: gradlew :lucene:core:test --tests
> "org.apache.lucene.index.TestAllFilesCheckIndexHeader.test" -Ptests.jvms=4
> -Ptests.haltonfailure=false -Ptests.jvmargs=-XX:TieredStopAtLevel=1
> -Ptests.seed=34ECEDA648B62DC2 -Ptests.multiplier=2 -Ptests.nightly=true
> -Ptests.badapples=false -Ptests.file.encoding=ISO-8859-1
> -Ptests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene/Lucene-NightlyTests-9.1/test-data/enwiki.random.lines.txt
> > >
> > >
> > > BUILD SUCCESSFUL in 1h 49m 55s
> > > 223 actionable tasks: 223 executed
> > > Build step 'Invoke Gradle script' changed build result to SUCCESS
> > > Archiving artifacts
> > > java.lang.InterruptedException: no matches found within 10000
> > >         at
> hudson.FilePath$ValidateAntFileMask.hasMatch(FilePath.java:3079)
> > >         at
> hudson.FilePath$ValidateAntFileMask.invoke(FilePath.java:2958)
> > >         at
> hudson.FilePath$ValidateAntFileMask.invoke(FilePath.java:2939)
> > >         at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3329)
> > > Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to
> lucene-solr-2
> > >                 at
> hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1797)
> > >                 at
> hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356)
> > >                 at hudson.remoting.Channel.call(Channel.java:1001)
> > >                 at hudson.FilePath.act(FilePath.java:1165)
> > >                 at hudson.FilePath.act(FilePath.java:1154)
> > >                 at
> hudson.FilePath.validateAntFileMask(FilePath.java:2937)
> > >                 at
> hudson.tasks.ArtifactArchiver.perform(ArtifactArchiver.java:268)
> > >                 at
> hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:78)
> > >                 at
> hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
> > >                 at
> hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:806)
> > >                 at
> hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:755)
> > >                 at
> hudson.model.Build$BuildExecution.post2(Build.java:178)
> > >                 at
> hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:699)
> > >                 at hudson.model.Run.execute(Run.java:1913)
> > >                 at
> hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
> > >                 at
> hudson.model.ResourceController.execute(ResourceController.java:99)
> > >                 at hudson.model.Executor.run(Executor.java:432)
> > > Caused: hudson.FilePath$TunneledInterruptedException
> > >         at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3331)
> > >         at hudson.remoting.UserRequest.perform(UserRequest.java:211)
> > >         at hudson.remoting.UserRequest.perform(UserRequest.java:54)
> > >         at hudson.remoting.Request$2.run(Request.java:376)
> > >         at
> hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
> > >         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > >         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > >         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > >         at java.lang.Thread.run(Thread.java:748)
> > > Caused: java.lang.InterruptedException:
> java.lang.InterruptedException: no matches found within 10000
> > >         at hudson.FilePath.act(FilePath.java:1167)
> > >         at hudson.FilePath.act(FilePath.java:1154)
> > >         at hudson.FilePath.validateAntFileMask(FilePath.java:2937)
> > >         at
> hudson.tasks.ArtifactArchiver.perform(ArtifactArchiver.java:268)
> > >         at
> hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:78)
> > >         at
> hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
> > >         at
> hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:806)
> > >         at
> hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:755)
> > >         at hudson.model.Build$BuildExecution.post2(Build.java:178)
> > >         at
> hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:699)
> > >         at hudson.model.Run.execute(Run.java:1913)
> > >         at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
> > >         at
> hudson.model.ResourceController.execute(ResourceController.java:99)
> > >         at hudson.model.Executor.run(Executor.java:432)
> > > No artifacts found that match the file pattern
> "**/*.events,heapdumps/**,**/hs_err_pid*". Configuration error?
> > > Recording test results
> > > [Checks API] No suitable checks publisher found.
> > > Build step 'Publish JUnit test result report' changed build result to
> UNSTABLE
> > > Email was triggered for: Unstable (Test Failures)
> > > Sending email for trigger: Unstable (Test Failures)
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to