On 7 October 2016 at 00:44, sebb <seb...@gmail.com> wrote:
> The id generator is used to create a key for the message database, and
> also to create a Permalink.
>
> Therefore, an id generator needs to fulfil the following design goals
> as a minimum:
> A) different messages have different IDs
> B) the same id is generated if the same message is re-processed
> C) equivalent messages have the same ID
>
> Goal A is needed to ensure that the database can contain every different 
> message
> Goal B is needed to ensure that the database can be reloaded from the
> original source if necessary
> Goal C is needed to ensure that the database can be reloaded from an
> equivalent source, and to ensure that Permalinks are stable.
>
> None of the current id generator algorithms meet all of the above goals.
>
> The original and medium generators fail to meet goal A.
> The full generator fails to meet goal B (and therefore C).
>
> A sender can easily generate two messages with identical content; it
> is important to distinguish these.
>
> The Message-Id should help here.
>
> Message-ID is supposed to be unique, in practice it may not be, so
> some additional fields need to be used to create the database id.
>
> For mailing lists the Return-Path will normally contain a unique id
> which is used to identify bounces.
> In theory this might be sufficient on its own. Indeed the path might
> be usable without hashing. However the early ASF mailing list software
> did not use unique Return-Paths.
>
> The existing mod_mbox solution uses a combination of Message-Id plus
> YYYYMM plus a list identifier. Have there ever been any collisions?

It's certainly possible for mod_mbox to contain multiple messages with
the same id.

For example:

>From 
>dev-return-15575-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
 Tue Jun  7 20:08:04 2016
and
>From 
>dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
 Tue Jun  7 20:24:24 2016

both have the same id:

Message-Id: <bcd87d1a-bfb6-4f2a-92cc-af8f1d16b...@apache.org>

It's exactly the same message which arrived twice in the same mailing
list a few seconds apart.
Maybe it was sent using Bcc as well as To: ?

It's not exactly a collision, but at present mod_mbox is able to store
both whereas Pony Mail cannot (except if using the full generator,
which has other problems)

I think it's important to store the full message history in the database.
For example, if one of the messages bounces, it would be odd if the
source of the bounce were not in the database.
Also the message sequences will be incomplete.
This is the case for lists.a.o, the mbox

https://lists.apache.org/api/mbox.lua?list=d...@airavata.apache.org&date=2016-6

does not have the message sequence number 15575

> How about using the following:
>
> Message-Id
> Date
> Return-Path

In this case the return-path can be used to distinguish the messages.

> List-Id

I think this must be the original List-Id, not any override.
Otherwise there may be problems with permalinks if a list name is
updated - the old permalink will no longer work.

> Whatever new algorithm is chosen, I think it's important that the
> format looks different from the existing ones. e.g. one could drop the
> <> around the list id.

Reply via email to