[ 
https://issues.apache.org/jira/browse/LUCENE-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102122#comment-15102122
 ] 

Markus Jelsma commented on LUCENE-6977:
---------------------------------------

Well it is a bit difficult to explain. I've written a unit test that sends 
loads random strings in the filter. Each input has a fixed output, named 
'stuff'. Looks like this filter config file:

{code}
random_input_a stuff
random_input_b stuff
random_input_c stuff
random_input_d stuff
random_input_e stuff
...
...
뮌?ﬡ뵄졂♪佞ፉ㥍薥 stuff
..
..
etc etc
{code}
All inputs listed should output stuff, no matter what. But it doesn't work for 
some random strings, they are soemhow not recognized. For example, the random 
generated input 뮌?ﬡ뵄졂♪佞ፉ㥍薥 is not recognized, i do not get 'stuff' as output. 
Please run the testCrappyInputFailure() unit test, it inputs 50.000 random 
strings, it usually fails at some point.

> Possible bug in StemmerOverrideFilter / FST
> -------------------------------------------
>
>                 Key: LUCENE-6977
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6977
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 5.4
>            Reporter: Markus Jelsma
>            Priority: Minor
>             Fix For: 5.5
>
>         Attachments: LUCENE-6977.patch
>
>
> We ran across an issue in a custom token filter that like the 
> StemmerOverrideFilter relies on the FST. The issue is reproducible in the 
> StemmerOverrideFilter. I am not sure whether it is a real problem in the FST.
> Attached a patch with a unit test that is going to fail. It uses random input 
> with some code from commons-lang3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to