The StandardAnalyzer tokenizer doesn't tokenize on all tokens when numbers are
present in the original string
-------------------------------------------------------------------------------------------------------------
Key: LUCENENET-354
URL: https://issues.apache.org/jira/browse/LUCENENET-354
Project: Lucene.Net
Issue Type: Bug
Environment: Lucene.Net 2.9.1
Reporter: Matt Dufrasne
The StandardAnalyzer tokenizer doesn't tokenize on all tokens when numbers are
present in the original string.
I think there is a bug in the tokenizer for Lucene 2.9.1 and it was probably
there before. When indexing "BB_HHH_FFFF5_SSSS", when there is a number, the
following tokens are returned:
"bb hhh_ffff5_ssss"
After some testing, I've found that this is because of the number. If I input
"BB_HHH_FFFF_SSSS", I get
"bb hhh ffff ssss"
At this point, I'm leaning towards a tokenizer bug unless the presence of the
number is supposed to have this behavior but I fail to see why.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.