Re: Are all emails properly archived?

Daniel Gruno Sat, 19 Mar 2022 06:39:07 -0700

On 19/03/2022 12.48, sebb wrote:

AFAICT, all distinct email sources are currently stored in the
database, because the the id is derived from a hash of the source. (*)


However, that does not mean that they are recoverable.

The current database design requires that source entries are retrieved
via the corresponding mbox entry.
If a second email is received that hashes to the same mbox index, the
pointer back to the existing source entry will be overwritten.

Such duplicates are not unknown; mail transport glitches can result in
duplication of email content (but different ezmlm archive numbers and
some other headers).

In such cases, it is no longer possible to recover the original source.

I think we discussed this before. One solution is to change the behaviorathttps://github.com/apache/incubator-ponymail-foal/blob/master/tools/archiver.py#L728- if an email is found to already exist with the same DKIM ID, we shouldfetch it and append the new mbox_source ID to the existing document. AsElasticSearch doesn't care if something is a string or an array ofstrings, this should be fine. A check for the source could then perhapsresult in a HTTP 300 Multiple Choices response?


I think this could be fixed, but until it is, I don't think Pony Mail
can be considered as a complete archival application, as it does not
give access to all the emails received by a mailing list.

We have to manage expectations and define what we mean by "archive". Inmy world, Pony Mail exists as a searchable/interactive archive for usersto find _content_ and _intentions_, not necessarily as a bit-for-bitverbatim backup for system administrators. If people wish to insureagainst disasters, there are ways of doing that.

I find it sufficient that as long as you can find your email in theright place, it does not matter if it's technically a "de-duplicatedduplicate".


Sebb
(*) discounting hash collisions, which should be vanishingly small

Re: Are all emails properly archived?

Reply via email to