[
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486886#comment-16486886
]
Alan Woodward commented on LUCENE-8273:
---------------------------------------
The elastic CI has found some reproducing seeds in TestRandomChains that look
like the following:
{code}
Suite: org.apache.lucene.analysis.core.TestRandomChains
01:47:39 [junit4] 2> Exception from random analyzer:
01:47:39 [junit4] 2> charfilters=
01:47:39 [junit4] 2>
org.apache.lucene.analysis.fa.PersianCharFilter(java.io.StringReader@36de1051)
01:47:39 [junit4] 2>
org.apache.lucene.analysis.charfilter.MappingCharFilter(org.apache.lucene.analysis.charfilter.NormalizeCharMap@31483c67,
org.apache.lucene.analysis.fa.PersianCharFilter@51a9d324)
01:47:39 [junit4] 2> tokenizer=
01:47:39 [junit4] 2>
org.apache.lucene.analysis.core.UnicodeWhitespaceTokenizer(org.apache.lucene.util.AttributeFactory$1@27232fb3,
35)
01:47:39 [junit4] 2> filters=ConditionalTokenFilter:
01:47:39 [junit4] 2>
org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter(OneTimeWrapper@5f621e45
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,
org.apache.lucene.analysis.compound.hyphenation.HyphenationTree@40cdd67e)ConditionalTokenFilter:
01:47:39 [junit4] 2>
org.apache.lucene.analysis.in.IndicNormalizationFilter(OneTimeWrapper@2de2e47c
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1)ConditionalTokenFilter:
01:47:39 [junit4] 2>
org.apache.lucene.analysis.MockRandomLookaheadTokenFilter(java.util.Random@4ced13ac,
OneTimeWrapper@7d30a80d
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1)
01:47:39 [junit4] 2> NOTE: reproduce with: ant test
-Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings
-Dtests.seed=72E157E8E16C0F79 -Dtests.slow=true -Dtests.badapples=true
-Dtests.locale=en-US -Dtests.timezone=America/Anguilla -Dtests.asserts=true
-Dtests.file.encoding=US-ASCII
01:47:39 [junit4] FAILURE 0.57s J0 |
TestRandomChains.testRandomChainsWithLargeStrings <<<
01:47:39 [junit4] > Throwable #1: java.lang.AssertionError
01:47:39 [junit4] > at
__randomizedtesting.SeedInfo.seed([72E157E8E16C0F79:18BAE8F9B8222F8A]:0)
01:47:39 [junit4] > at
org.apache.lucene.analysis.LookaheadTokenFilter.peekToken(LookaheadTokenFilter.java:140)
{code}
The root cause is that LookaheadTokenFilter doesn't play well with
ConditionalTokenFilter when we have stacked tokens:
- CTF works by presenting the underlying TokenStream to its wrapped filter as a
series of snippets, demarcated by tokens that don't pass the {{shouldFilter()}}
test. When a new snippet is started (i.e. when a token that passes
{{shouldFilter()}} appears after one that doesn't) then {{reset()}} is called
on the delegate, and when it stops (i.e. when a token that doesn't pass
{{shouldFilter()}} appears) then {{end()}} is called.
- This means that if we have stacked tokens, with the first not passing
{{shouldFilter()}} and the second passing it, the wrapped filter can see a
TokenStream that has an initial position increment of 0
- LookaheadTokenFilter has an explicit assertion that checks we don't have an
initial posInc of 0
I think this can be fixed by having a posInc adjustment when we're delegating,
so that the delegated snippet starts with a posInc of 1, but this is then
adjusted downwards by the CTF before it's emitted.
> Add a ConditionalTokenFilter
> ----------------------------
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Alan Woodward
> Assignee: Alan Woodward
> Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch,
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch,
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch,
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch,
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter
> in such a way that it could optionally be bypassed based on the current state
> of the TokenStream. This could be used to, for example, only apply
> WordDelimiterFilter to terms that contain hyphens.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]