[
https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529839#comment-17529839
]
ASF subversion and git services commented on LUCENE-10541:
----------------------------------------------------------
Commit e7684708935a859c2255912457f95a616010cfea in lucene's branch
refs/heads/branch_9x from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e7684708935 ]
LUCENE-10541: Test-framework: limit the default length of MockTokenizer tokens
to 255.
> What to do about massive terms in our Wikipedia EN LineFileDocs?
> ----------------------------------------------------------------
>
> Key: LUCENE-10541
> URL: https://issues.apache.org/jira/browse/LUCENE-10541
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Priority: Major
> Time Spent: 3h
> Remaining Estimate: 0h
>
> Spinoff from this fun build failure that [~dweiss] root caused:
> [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of
> the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's
> ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the
> test.
> It's crazy that it took so long for Lucene's randomized tests to discover
> this too-massive term in Lucene's nightly benchmarks. It's like searching
> for Nessie, or
> [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple
> options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive"
> terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix
> {{MockTokenizer}} to trim such ridiculous terms (I think this is the best
> option?), ...
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]