This is a pretty simple question to answer, but I have customers asking me how this is suppose to work and I'm having trouble explaining it. I have an app that indexes emails so there are plenty of email addresses in there. Reading the StandardAnalyzer javadoc it says it "recognizes" email addresses when it is creating the token list. What tokens will it produce exactly? What I'm seeing when I perform searches is the email address looks like its being tokenized into its parts. Searching by an email address like:
to:charlie.hubb...@gmail.com pulls back more hits that haven't been addressed to charlie.hubb...@gmail.com. Other messages with gmail.com in them are returned. If I use the following: to:charlie.hubbard in them. It also finds gmail.com, and other domains. And I can search for strings like to:"charlie.hubb...@gmail.com" it will pull back only emails addressed to that address. Further proof it seems to token the parts of an email is if I search for a very specific email address like: to:"charlie.hubbard+sometag" That will pull back only emails addressed to that email, but it's not a full email address. Which leads me to think it will parse parts of the email addresses. Can someone explain this a little more? I'm having trouble with some emails that can't be pulled back using the username like searching for to:chubbard where the email was addressed to chubb...@somedomain.com, but it fails to show up in the search results. I can't explain why that's happening. In all of my tests I can't reproduce it and I think I might have to reindex everything because this was an index built with 2.4 and I upgraded to 3.1 so I'm worried it might be corrupted. Thoughts?