: As far as indexing goes index each address in a separate un-tokenized : field not space delimited in a single field. It is also useful to put : the To; CC and BCC in a single field to enable you to search to email
INdexing email isn't something i've had to think about a lot in my life .. but if i were going to do it i would certianly have both header specific fields as well as a "recpipients" field containing To/Cc/Bcc and a "participants" field that also contained the From/Sender/X-Sender. I would add each address as a seperate Field instance, using a custom "EmailAnalyzer" with a really high position incriment gap. The Analyzer should index both the full input with no tokenization, as well as the input split on the @ symbol, and the input tokenized on any character in the set "_-.+" to the left of the @ and on "." to the right of the @ ... BUT: not the last "." So for the input "java-user@lucene.apache.org" the following tokenstream would be created... java-user@lucene.apache.org java user java-user lucene.apache.org lucene apache.org : you have sent to a person. I would also recommend you do some processing : on the Subject field to remove FW and RE this will allow you to search : by subject and pick up all emails in the thread. There are essays and essays and more essays on detecting/infering threads in email ... as I recall, JWZ has really written the definitive guide for this. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]