Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer
---------------------------------------------------------------
Key: LUCENE-2763
URL: https://issues.apache.org/jira/browse/LUCENE-2763
Project: Lucene - Java
Issue Type: Improvement
Components: Analysis
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
Fix For: 3.1, 4.0
Currently, in addition to implementing the UAX#29 word boundary rules,
StandardTokenizer recognizes email adresses and URLs, but doesn't provide a way
to turn this behavior off and/or provide overlapping tokens with the components
(username from email address, hostname from URL, etc.).
UAX29Tokenizer should become StandardTokenizer, and current StandardTokenizer
should be renamed to something like UAX29TokenizerPlusPlus (or something like
that).
For rationale, see [the discussion at the reopened
LUCENE-2167|https://issues.apache.org/jira/browse/LUCENE-2167?focusedCommentId=12929325&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12929325].
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]