[
https://issues.apache.org/jira/browse/LUCENE-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102122#comment-15102122
]
Markus Jelsma commented on LUCENE-6977:
---------------------------------------
Well it is a bit difficult to explain. I've written a unit test that sends
loads random strings in the filter. Each input has a fixed output, named
'stuff'. Looks like this filter config file:
{code}
random_input_a stuff
random_input_b stuff
random_input_c stuff
random_input_d stuff
random_input_e stuff
...
...
뮌?ﬡ뵄졂♪佞ፉ㥍薥 stuff
..
..
etc etc
{code}
All inputs listed should output stuff, no matter what. But it doesn't work for
some random strings, they are soemhow not recognized. For example, the random
generated input 뮌?ﬡ뵄졂♪佞ፉ㥍薥 is not recognized, i do not get 'stuff' as output.
Please run the testCrappyInputFailure() unit test, it inputs 50.000 random
strings, it usually fails at some point.
> Possible bug in StemmerOverrideFilter / FST
> -------------------------------------------
>
> Key: LUCENE-6977
> URL: https://issues.apache.org/jira/browse/LUCENE-6977
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: 5.4
> Reporter: Markus Jelsma
> Priority: Minor
> Fix For: 5.5
>
> Attachments: LUCENE-6977.patch
>
>
> We ran across an issue in a custom token filter that like the
> StemmerOverrideFilter relies on the FST. The issue is reproducible in the
> StemmerOverrideFilter. I am not sure whether it is a real problem in the FST.
> Attached a patch with a unit test that is going to fail. It uses random input
> with some code from commons-lang3.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]