[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

Alan Woodward (JIRA) Wed, 23 May 2018 01:15:16 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486886#comment-16486886
 ]


Alan Woodward commented on LUCENE-8273:
---------------------------------------

The elastic CI has found some reproducing seeds in TestRandomChains that look 
like the following:
{code}
Suite: org.apache.lucene.analysis.core.TestRandomChains
01:47:39    [junit4]   2> Exception from random analyzer: 
01:47:39    [junit4]   2> charfilters=
01:47:39    [junit4]   2>   
org.apache.lucene.analysis.fa.PersianCharFilter(java.io.StringReader@36de1051)
01:47:39    [junit4]   2>   
org.apache.lucene.analysis.charfilter.MappingCharFilter(org.apache.lucene.analysis.charfilter.NormalizeCharMap@31483c67,
 org.apache.lucene.analysis.fa.PersianCharFilter@51a9d324)
01:47:39    [junit4]   2> tokenizer=
01:47:39    [junit4]   2>   
org.apache.lucene.analysis.core.UnicodeWhitespaceTokenizer(org.apache.lucene.util.AttributeFactory$1@27232fb3,
 35)
01:47:39    [junit4]   2> filters=ConditionalTokenFilter: 
01:47:39    [junit4]   2>   
org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter(OneTimeWrapper@5f621e45
 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,
 
org.apache.lucene.analysis.compound.hyphenation.HyphenationTree@40cdd67e)ConditionalTokenFilter:
 
01:47:39    [junit4]   2>   
org.apache.lucene.analysis.in.IndicNormalizationFilter(OneTimeWrapper@2de2e47c 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1)ConditionalTokenFilter:
 
01:47:39    [junit4]   2>   
org.apache.lucene.analysis.MockRandomLookaheadTokenFilter(java.util.Random@4ced13ac,
 OneTimeWrapper@7d30a80d 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1)
01:47:39    [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings 
-Dtests.seed=72E157E8E16C0F79 -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=en-US -Dtests.timezone=America/Anguilla -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
01:47:39    [junit4] FAILURE 0.57s J0 | 
TestRandomChains.testRandomChainsWithLargeStrings <<<
01:47:39    [junit4]    > Throwable #1: java.lang.AssertionError
01:47:39    [junit4]    >       at 
__randomizedtesting.SeedInfo.seed([72E157E8E16C0F79:18BAE8F9B8222F8A]:0)
01:47:39    [junit4]    >       at 
org.apache.lucene.analysis.LookaheadTokenFilter.peekToken(LookaheadTokenFilter.java:140)
{code}

The root cause is that LookaheadTokenFilter doesn't play well with 
ConditionalTokenFilter when we have stacked tokens:
- CTF works by presenting the underlying TokenStream to its wrapped filter as a 
series of snippets, demarcated by tokens that don't pass the {{shouldFilter()}} 
test.  When a new snippet is started (i.e. when a token that passes 
{{shouldFilter()}} appears after one that doesn't) then {{reset()}} is called 
on the delegate, and when it stops (i.e. when a token that doesn't pass 
{{shouldFilter()}} appears) then {{end()}} is called.
- This means that if we have stacked tokens, with the first not passing 
{{shouldFilter()}} and the second passing it, the wrapped filter can see a 
TokenStream that has an initial position increment of 0
- LookaheadTokenFilter has an explicit assertion that checks we don't have an 
initial posInc of 0

I think this can be fixed by having a posInc adjustment when we're delegating, 
so that the delegated snippet starts with a posInc of 1, but this is then 
adjusted downwards by the CTF before it's emitted.

> Add a ConditionalTokenFilter
> ----------------------------
>
>                 Key: LUCENE-8273
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8273
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>             Fix For: 7.4
>
>         Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

Reply via email to