[
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866514#comment-15866514
]
Steve Rowe commented on LUCENE-7465:
------------------------------------
My Jenkins found a reproducing seed on master for a TestRandomChains failure
that implicates the new tokenizer:
{noformat}
[junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
[junit4] 2> TEST FAIL: useCharFilter=false text='puzoh
\u6a8b\u59e2\u96aa\u85f0\u614a\u9010\u7782\u5547
\uef27\uda09\uddd2\u9b9c\u056e\u33f0 W\udb24\udce6>
\u2d12\u2d23\u2d05\u2d1c\u2d23 *\ud9f0\udc74\uea94\ub9c6 pev trjrbvcwb tzzntfd
y|)]){1 </p> gmabf'
[junit4] 2> TEST FAIL: useCharFilter=false text='puzoh
\u6a8b\u59e2\u96aa\u85f0\u614a\u9010\u7782\u5547
\uef27\uda09\uddd2\u9b9c\u056e\u33f0 W\udb24\udce6>
\u2d12\u2d23\u2d05\u2d1c\u2d23 *\ud9f0\udc74\uea94\ub9c6 pev trjrbvcwb tzzntfd
y|)]){1 </p> gmabf'
[junit4] 2> TEST FAIL: useCharFilter=false text='puzoh
\u6a8b\u59e2\u96aa\u85f0\u614a\u9010\u7782\u5547
\uef27\uda09\uddd2\u9b9c\u056e\u33f0 W\udb24\udce6>
\u2d12\u2d23\u2d05\u2d1c\u2d23 *\ud9f0\udc74\uea94\ub9c6 pev trjrbvcwb tzzntfd
y|)]){1 </p> gmabf'
[junit4] 2> feb 14, 2017 2:13:13 P.M.
com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
uncaughtException
[junit4] 2> WARNING: Uncaught exception in thread:
Thread[Thread-17,5,TGRP-TestRandomChains]
[junit4] 2> java.lang.AssertionError: finalOffset expected:<79> but
was:<65>
[junit4] 2> at
__randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
[junit4] 2> at org.junit.Assert.fail(Assert.java:93)
[junit4] 2> at org.junit.Assert.failNotEquals(Assert.java:647)
[junit4] 2> at org.junit.Assert.assertEquals(Assert.java:128)
[junit4] 2> at org.junit.Assert.assertEquals(Assert.java:472)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)
[junit4] 2>
[junit4] 2> feb 14, 2017 2:13:13 P.M.
com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
uncaughtException
[junit4] 2> WARNING: Uncaught exception in thread:
Thread[Thread-18,5,TGRP-TestRandomChains]
[junit4] 2> java.lang.AssertionError: finalOffset expected:<79> but
was:<65>
[junit4] 2> at
__randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
[junit4] 2> at org.junit.Assert.fail(Assert.java:93)
[junit4] 2> at org.junit.Assert.failNotEquals(Assert.java:647)
[junit4] 2> at org.junit.Assert.assertEquals(Assert.java:128)
[junit4] 2> at org.junit.Assert.assertEquals(Assert.java:472)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)
[junit4] 2>
[junit4] 2> feb 14, 2017 2:13:13 P.M.
com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
uncaughtException
[junit4] 2> WARNING: Uncaught exception in thread:
Thread[Thread-19,5,TGRP-TestRandomChains]
[junit4] 2> java.lang.AssertionError: finalOffset expected:<79> but
was:<65>
[junit4] 2> at
__randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
[junit4] 2> at org.junit.Assert.fail(Assert.java:93)
[junit4] 2> at org.junit.Assert.failNotEquals(Assert.java:647)
[junit4] 2> at org.junit.Assert.assertEquals(Assert.java:128)
[junit4] 2> at org.junit.Assert.assertEquals(Assert.java:472)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
[junit4] 2> at
org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)
[junit4] 2>
[junit4] 2> Exception from random analyzer:
[junit4] 2> charfilters=
[junit4] 2>
org.apache.lucene.analysis.MockCharFilter(java.io.StringReader@754b8c24)
[junit4] 2>
org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(org.apache.lucene.analysis.MockCharFilter@6f0a7841)
[junit4] 2> tokenizer=
[junit4] 2>
org.apache.lucene.analysis.pattern.SimplePatternSplitTokenizer(org.apache.lucene.util.automaton.Automaton@aa8d0c)
[junit4] 2> filters=
[junit4] 2> offsetsAreCorrect=true
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains
-Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=3ABEF2F287EE4968
-Dtests.slow=true -Dtests.locale=es-US -Dtests.timezone=America/Montreal
-Dtests.asserts=true -Dtests.file.encoding=UTF-8
[junit4] ERROR 1.17s J1 |
TestRandomChains.testRandomChainsWithLargeStrings <<<
[junit4] > Throwable #1: java.lang.RuntimeException: some thread(s) failed
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:562)
[junit4] > at
org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:880)
[junit4] > at java.lang.Thread.run(Thread.java:745)Throwable #2:
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught
exception in thread: Thread[id=70, name=Thread-17, state=RUNNABLE,
group=TGRP-TestRandomChains]
[junit4] > Caused by: java.lang.AssertionError: finalOffset expected:<79>
but was:<65>
[junit4] > at
__randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)Throwable
#3: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an
uncaught exception in thread: Thread[id=72, name=Thread-19, state=RUNNABLE,
group=TGRP-TestRandomChains]
[junit4] > Caused by: java.lang.AssertionError: finalOffset expected:<79>
but was:<65>
[junit4] > at
__randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)Throwable
#4: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an
uncaught exception in thread: Thread[id=71, name=Thread-18, state=RUNNABLE,
group=TGRP-TestRandomChains]
[junit4] > Caused by: java.lang.AssertionError: finalOffset expected:<79>
but was:<65>
[junit4] > at
__randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)
[junit4] 2> NOTE: test params are: codec=CheapBastard,
sim=RandomSimilarity(queryNorm=false): {}, locale=es-US,
timezone=America/Montreal
[junit4] 2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation
1.8.0_77 (64-bit)/cpus=16,threads=1,free=358995304,total=524288000
{noformat}
> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---------------------------------------------------------------
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that
> uses Lucene's RegExp impl instead of the JDK's:
> * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp
> is attempted the user discovers it up front instead of later on when a
> "lucky" document arrives
> * It processes the incoming characters as a stream, only pulling 128
> characters at a time, vs the existing {{PatternTokenizer}} which currently
> reads the entire string up front (this has caused heap problems in the past)
> * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps
> don't yet implement sub group capture. I think we could add that at some
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we
> did that we should maybe name it differently
> ({{SimplePatternSplitTokenizer}}?).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]