On 19/03/2022 12.48, sebb wrote:
AFAICT, all distinct email sources are currently stored in the
database, because the the id is derived from a hash of the source. (*)

However, that does not mean that they are recoverable.

The current database design requires that source entries are retrieved
via the corresponding mbox entry.
If a second email is received that hashes to the same mbox index, the
pointer back to the existing source entry will be overwritten.

Such duplicates are not unknown; mail transport glitches can result in
duplication of email content (but different ezmlm archive numbers and
some other headers).

In such cases, it is no longer possible to recover the original source.

I think we discussed this before. One solution is to change the behavior at https://github.com/apache/incubator-ponymail-foal/blob/master/tools/archiver.py#L728 - if an email is found to already exist with the same DKIM ID, we should fetch it and append the new mbox_source ID to the existing document. As ElasticSearch doesn't care if something is a string or an array of strings, this should be fine. A check for the source could then perhaps result in a HTTP 300 Multiple Choices response?


I think this could be fixed, but until it is, I don't think Pony Mail
can be considered as a complete archival application, as it does not
give access to all the emails received by a mailing list.

We have to manage expectations and define what we mean by "archive". In my world, Pony Mail exists as a searchable/interactive archive for users to find _content_ and _intentions_, not necessarily as a bit-for-bit verbatim backup for system administrators. If people wish to insure against disasters, there are ways of doing that.

I find it sufficient that as long as you can find your email in the right place, it does not matter if it's technically a "de-duplicated duplicate".



Sebb
(*) discounting hash collisions, which should be vanishingly small

Reply via email to