RE: indexing emails

Rob Staveley (Tom) Sat, 17 Jun 2006 00:42:42 -0700

Having spent a lot of time getting this wrong myself in an e-mail
indexer(!), I urge you to consider whether in your query interface you will
need to look for mail to "john*" rather than [EMAIL PROTECTED], because "john*"
may have been addressed to [EMAIL PROTECTED] or [EMAIL PROTECTED] If you index
only [EMAIL PROTECTED] (untokenised) you will have to use a PrefixQuery to look
for "john*", and you are liable to hit BooleanQuery.TooManyClauses problems,
if you have more than 1024 (or BooleanQuery.getMaxClauseCount()) e-mail
addresses in your index starting with "john".

I'm trying to figure out a good design for this now for my own e-mail
indexing application, considering also whether I should cater for searches
for "*smith*". I'm coming round to the realisation that WildCardQuery and
PrefixQuery are not great things to depend upon for getting e-mail addresses
from an index and the right thing to do is to break the address up into
natural tokens ('.' or '-') in one field and leave them intact in another
field. It isn't ideal; e-mail addresses with no separator between initials
or first names and last name still need a PrefixQuery or WildcardQuery, if
you want to search for last names, but it does make some queries possible
which would otherwise blow up.

-----Original Message-----
From: karl wettin [mailto:[EMAIL PROTECTED] 
Sent: 16 June 2006 21:13
To: java-user@lucene.apache.org
Subject: Re: indexing emails

On Fri, 2006-06-16 at 15:20 -0400, Michael J. Prichard wrote:
> I am working on indexing emails and want to have a "to" field.  I am 
> currently putting all the emails on one line seperated w/
spaces...example:
> 
> [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED]
> 
> Then i index that with a StandardAnalyzer as follows:
> 
> doc.add(new Field("to", (String) itemContent.get("to"), 
> Field.Store.YES, Field.Index.UN_TOKENIZED));
> 
> Question is...is this the best way to do it?  I want to be able to 
> search for [EMAIL PROTECTED] and pick out just those Documents, etc.

You can either do it as above (but you want to TOKENIZE the field) or you
could create a new UN_TOKENIZED field for each email address.

The second will require less CPU as it does not involve any lexical
analysis. It will also create larger distance between the addresses in the
index (see span queries and term positions).

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

smime.p7s
Description: S/MIME cryptographic signature

RE: indexing emails

Reply via email to