On Sat, 19 Mar 2022 at 13:39, Daniel Gruno <[email protected]> wrote:
>
> On 19/03/2022 12.48, sebb wrote:
> > AFAICT, all distinct email sources are currently stored in the
> > database, because the the id is derived from a hash of the source. (*)
> >
> > However, that does not mean that they are recoverable.
> >
> > The current database design requires that source entries are retrieved
> > via the corresponding mbox entry.
> > If a second email is received that hashes to the same mbox index, the
> > pointer back to the existing source entry will be overwritten.
> >
> > Such duplicates are not unknown; mail transport glitches can result in
> > duplication of email content (but different ezmlm archive numbers and
> > some other headers).
> >
> > In such cases, it is no longer possible to recover the original source.
>
> I think we discussed this before. One solution is to change the behavior
> at
> https://github.com/apache/incubator-ponymail-foal/blob/master/tools/archiver.py#L728
> - if an email is found to already exist with the same DKIM ID, we should
> fetch it and append the new mbox_source ID to the existing document. As
> ElasticSearch doesn't care if something is a string or an array of
> strings, this should be fine. A check for the source could then perhaps
> result in a HTTP 300 Multiple Choices response?

Yes, something like that should fix it.

> >
> > I think this could be fixed, but until it is, I don't think Pony Mail
> > can be considered as a complete archival application, as it does not
> > give access to all the emails received by a mailing list.
>
> We have to manage expectations and define what we mean by "archive". In

Exactly.

I expect an archive of an email list to contain the same emails as I
receive as a subscriber.

> my world, Pony Mail exists as a searchable/interactive archive for users
> to find _content_ and _intentions_, not necessarily as a bit-for-bit
> verbatim backup for system administrators. If people wish to insure
> against disasters, there are ways of doing that.

That is not the point.

> I find it sufficient that as long as you can find your email in the
> right place, it does not matter if it's technically a "de-duplicated
> duplicate".
>

I don't find it sufficient.

If I cannot find all my emails in the archive, I don't consider it complete.

> >
> > Sebb
> > (*) discounting hash collisions, which should be vanishingly small
>

Reply via email to