Hi John,

thanks for the detailed answer.

You wrote:
If you're indexing a
multipart/alternative bodypart then index all the MIME headers, but only
index the content of the *first* bodypart.

Does this mean you index just the first file-attachment?
What do you advice, if you have to index mulitpart bodys (== more then one
file-attachment)?
One lucene-document for each part (==file)?
How do you handle the queries?

Greetings
lude



On 8/15/06, John Haxby <[EMAIL PROTECTED]> wrote:

lude wrote:
> does anybody has an idea what is the best design approch for realizing
> the following:
>
> The goal is to index emails and their corresponding file attachments.
> One email could contain for example:
I put a fair amount of thought into this when I was doing the design for
our mail server -- I know about mail :-)   After a little trial and
error I came up with the following scheme:

  1. All header fields indexed under their own name with the name
     converted to lower case.
  2. Almost all bodyparts indexed in a single field called BODY (in
     upper case)
  3. Meta-data such as SIZE, DELIVERY-DATE and similar indexed with
     uppercase fields
  4. Extensions for other bodypart-specific or application-specific
     fields indexed as something with an initial uppercase letter and
     at least one lowercase letter

That gives an extensible set of fields and does require that the index
knows ahead of time what header fields will be present or relevant.   It
means that there are potentially a lot of fields: we're running at about
60 depending on the user.

Some header fields are special.   The various message-id fields
(Message-Id, Resent-Message-Id, In-Reply-To and References) need to have
their mesage-ids carefully extracted and then indexed untokenized.
Recipient fields (to, cc, from, etc) need to parsed and then have their
addresses re-assembled as a friendly-name and an RFC822 address -- the
reason for the re-assembly is that addresses can be presented in
equivalent but odd fashions.   Most header fields can have RFC2047
encoded text which needs to be decoded.

When indexing the bodyparts you need to be a little careful.   In
general, the MIME headers for each part are all indexed as other message
headers (content-id is a messge id field) and I also indexed the
canonical content type under a CONTENT-TYPE field, again to get rid of
fluff so that I can search for, say,
CONTENT-TYPE:application/x-vnd-powerpoint to find all those annoyingly
huge messages :-)  An attached message probably doesn't want all its
headers indexed: subject is good; recipients are probably bad as it'll
confuse the normal search and give unexpected results; message-id fields
are almost certainly a bad idea.  If you're indexing a
multipart/alternative bodypart then index all the MIME headers, but only
index the content of the *first* bodypart.

Does that all make sense?  Javamail is great for this, it's good at
parsing and extracting the content of messages.  However, it's not
enough to just read what I've said and the javamail doc.   If you're not
intimately familiar with the MIME RFCs (I think the first one is
RFC2045, but their not difficult to find as their all around RFC2047)
and RFC2822, the message structure RFC itself.   If you just guess
because the structure is "obvious" you'll come unstuck.

jch

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reply via email to