some valid email address characters not correctly recognized
------------------------------------------------------------
Key: LUCENE-1556
URL: https://issues.apache.org/jira/browse/LUCENE-1556
Project: Lucene - Java
Issue Type: Bug
Components: Analysis
Affects Versions: 2.4.1
Reporter: Paul Nilsson
Priority: Trivial
the EMAIL expression in StandardTokenizerImpl.jflex misses some unusual but
valid characters in the left-hand-side of the email address. This causes an
address to be broken into several tokens, for example:
[email protected] gets broken into "somename" and "[email protected]"
husband&[email protected] gets broken into "husband" and "[email protected]"
These seem to be occurring more often. The first seems to be because of an
anti-spam trick you can use with google (see:
http://labnol.blogspot.com/2007/08/gmail-plus-smart-trick-to-find-block.html).
I see the second in several domains but a disproportionate amount are from
talktalk.net, so I expect it's a signup suggestion from the service.
Perhaps a fix would be to change line 102 of StandardTokenizerImpl.jflex from:
EMAIL = {ALPHANUM} (("."|"-"|"_") {ALPHANUM})* "@" {ALPHANUM} (("."|"-")
{ALPHANUM})+
to
EMAIL = {ALPHANUM} (("."|"-"|"_"|"+"|"&") {ALPHANUM})* "@" {ALPHANUM}
(("."|"-") {ALPHANUM})+
I'm aware that the StandardTokenizer is meant to be more of a basic
implementation rather than an implementation the full standard, but it is quite
useful in places and hopefully this would improve it slightly.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]