LowerCaseTokenizer Does Not Behave As One Might Expect (or Desire)--Given Its
Name
----------------------------------------------------------------------------------
Key: LUCENE-2644
URL: https://issues.apache.org/jira/browse/LUCENE-2644
Project: Lucene - Java
Issue Type: Bug
Components: Analysis
Affects Versions: 3.0.2
Reporter: Scott Gonyea
Fix For: 3.0.3, 3.1, Realtime Branch, 4.0
While I understand some of the reasons for its design, the original
LowerCaseTokenizer should have been named LowerCaseLetterTokenizer.
I feel that LowerCaseTokenizer makes too many assumptions about what too
tokenize, and I have therefore patched it. The *default* behavior will remain
as it always has--to avoid breaking any implementations for which it's being
used.
I have changed LowerCaseTokenizer to extend CharTokenizer (rather than
LetterTokenizer). LetterTokenizer's functionality was merged into the default
behavior of LowerCaseTokenizer.
Getter/Setter methods have been added to the LowerCaseTokenizer Class, allowing
you to turn on / off tokenizing by white space, numbers, and special
(Non-Alpha/Numeric) characters.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]