Hi,
We have a very large lucene index that we're developing that has a
field of email addresses. (Actually mulitple fields with multiple
emails addresses, but I'll simplify here)
Each document will have one "email" field containing multiple email addresses.
I am indexing email addresses only using WhitespaceAnalyzer, so to
preserve the exact adresses and store multiple emails for one
document.
Example...
doc.add(new Field("email", "[email protected] [email protected] [email protected]",
Field.Store.YES, Field.Index.ANALYZED ));
Terms for this document will then be...
email:[email protected]
email:[email protected]
email:[email protected]
The problem I having is that these terms are rarely re-used in other
documents. There is little overlap with email usage, and there is a
lot of very long emails addresses. Because of this, the number of
terms in my index is very big and I think it's is causing performance
issues and bloating the index.
I think I'm not using Lucene optimally here.
A couple of questions...
1) Is there a way I can analyze these emails down to smaller terms but
still search for the exact email address? For instance, if I used a
different analyzer and broke these down to the terms "foo", "bar", and
"com", is Lucene able to find "email:[email protected]" without matching
"email:[email protected]"?
2) Does Lucene retain the positional information of tokens in the
index? Knowing this will help me anwer question 1.
Thanks,
Phil
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]