StandardAnalyzer and Email Addresses

Charlie Hubbard Thu, 16 Feb 2012 09:19:11 -0800

This is a pretty simple question to answer, but I have customers asking me
how this is suppose to work and I'm having trouble explaining it.  I have
an app that indexes emails so there are plenty of email addresses in there.
 Reading the StandardAnalyzer javadoc it says it "recognizes" email
addresses when it is creating the token list.  What tokens will it produce
exactly?  What I'm seeing when I perform searches is the email address
looks like its being tokenized into its parts.  Searching by an email
address like:


to:charlie.hubb...@gmail.com

pulls back more hits that haven't been addressed to
charlie.hubb...@gmail.com.  Other messages with gmail.com in them are
returned.  If I use the following:

to:charlie.hubbard

in them.  It also finds gmail.com, and other domains.  And I can search for
strings like

to:"charlie.hubb...@gmail.com"

it will pull back only emails addressed to that address.  Further proof it
seems to token the parts of an email is if I search for a very specific
email address like:

to:"charlie.hubbard+sometag"

That will pull back only emails addressed to that email, but it's not a
full email address.  Which leads me to think it will parse parts of the
email addresses.  Can someone explain this a little more?

I'm having trouble with some emails that can't be pulled back using the
username like searching for to:chubbard where the email was addressed to
chubb...@somedomain.com, but it fails to show up in the search results.  I
can't explain why that's happening.  In all of my tests I can't reproduce
it and I think I might have to reindex everything because this was an index
built with 2.4 and I upgraded to 3.1 so I'm worried it might be corrupted.

Thoughts?

StandardAnalyzer and Email Addresses

Reply via email to