Email addresses in headers may consist of two parts: the email address itself, and the (optional) textual real name.
These headers are currently stored as a single string, derived directly from the email source. This string may include multiple addresses, separated by commas and (usually) new lines. I think we are agreed that these headers should be stored as an array of single emails, rather than a single string. Given that we would need to parse the string in order to generate the arrays, I think it would make sense to further split the emails into mail address and real name. This would simplify anonymisation and there is already a Python method to do exactly this. Note, splitting the header value is not just a matter of looking for commas, as they may appear in quoted real names. It is a complicated syntax, so is best left to the mail library. I think this would mean a change to the database to treat the headers as multiple fields. - real name (if present) - email address (excluding <> wrapper) Not sure it is worth storing the re-combined address with both parts, unless it is needed for searching. What would be the best ES structure to use? This would affect the following headers: from cc to sender (not currently stored in mbox, but I think it is needed) There are some other headers that contain email addresses, but they are not currently stored in the mbox index. If it becomes necessary, they should use the same structure. WDYT? Sebb.
