Database mbox index design for headers containing email addresses

sebb Fri, 12 Nov 2021 07:19:40 -0800

Email addresses in headers may consist of two parts: the email address
itself, and the (optional) textual real name.


These headers are currently stored as a single string, derived
directly from the email source.
This string may include multiple addresses, separated by commas and
(usually) new lines.

I think we are agreed that these headers should be stored as an array
of single emails, rather than a single string.

Given that we would need to parse the string in order to generate the
arrays, I think it would make sense to further split the emails into
mail address and real name. This would simplify anonymisation and
there is already a Python method to do exactly this.

Note, splitting the header value is not just a matter of looking for
commas, as they may appear in quoted real names. It is a complicated
syntax, so is best left to the mail library.

I think this would mean a change to the database to treat the headers
as multiple fields.
- real name (if present)
- email address (excluding <> wrapper)

Not sure it is worth storing the re-combined address with both parts,
unless it is needed for searching. What would be the best ES structure
to use?

This would affect the following headers:
from
cc
to
sender (not currently stored in mbox, but I think it is needed)

There are some other headers that contain email addresses, but they
are not currently stored in the mbox index. If it becomes necessary,
they should use the same structure.

WDYT?

Sebb.

Database mbox index design for headers containing email addresses

Reply via email to