some valid email address characters not correctly recognized ------------------------------------------------------------
Key: LUCENE-1556 URL: https://issues.apache.org/jira/browse/LUCENE-1556 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.4.1 Reporter: Paul Nilsson Priority: Trivial the EMAIL expression in StandardTokenizerImpl.jflex misses some unusual but valid characters in the left-hand-side of the email address. This causes an address to be broken into several tokens, for example: somename+s...@gmail.com gets broken into "somename" and "s...@gmail.com" husband&w...@talktalk.net gets broken into "husband" and "w...@talktalk.net" These seem to be occurring more often. The first seems to be because of an anti-spam trick you can use with google (see: http://labnol.blogspot.com/2007/08/gmail-plus-smart-trick-to-find-block.html). I see the second in several domains but a disproportionate amount are from talktalk.net, so I expect it's a signup suggestion from the service. Perhaps a fix would be to change line 102 of StandardTokenizerImpl.jflex from: EMAIL = {ALPHANUM} (("."|"-"|"_") {ALPHANUM})* "@" {ALPHANUM} (("."|"-") {ALPHANUM})+ to EMAIL = {ALPHANUM} (("."|"-"|"_"|"+"|"&") {ALPHANUM})* "@" {ALPHANUM} (("."|"-") {ALPHANUM})+ I'm aware that the StandardTokenizer is meant to be more of a basic implementation rather than an implementation the full standard, but it is quite useful in places and hopefully this would improve it slightly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org