On 13 October 2016 at 21:28, sebb <seb...@gmail.com> wrote:
> On 7 October 2016 at 00:44, sebb <seb...@gmail.com> wrote:
>> The id generator is used to create a key for the message database, and
>> also to create a Permalink.
>> Therefore, an id generator needs to fulfil the following design goals
>> as a minimum:
>> A) different messages have different IDs
>> B) the same id is generated if the same message is re-processed
>> C) equivalent messages have the same ID
>> Goal A is needed to ensure that the database can contain every different
>> Goal B is needed to ensure that the database can be reloaded from the
>> original source if necessary
>> Goal C is needed to ensure that the database can be reloaded from an
>> equivalent source, and to ensure that Permalinks are stable.
>> None of the current id generator algorithms meet all of the above goals.
>> The original and medium generators fail to meet goal A.
>> The full generator fails to meet goal B (and therefore C).
>> A sender can easily generate two messages with identical content; it
>> is important to distinguish these.
>> The Message-Id should help here.
>> Message-ID is supposed to be unique, in practice it may not be, so
>> some additional fields need to be used to create the database id.
>> For mailing lists the Return-Path will normally contain a unique id
>> which is used to identify bounces.
>> In theory this might be sufficient on its own. Indeed the path might
>> be usable without hashing. However the early ASF mailing list software
>> did not use unique Return-Paths.
>> The existing mod_mbox solution uses a combination of Message-Id plus
>> YYYYMM plus a list identifier. Have there ever been any collisions?
> It's certainly possible for mod_mbox to contain multiple messages with
> the same id.
> For example:
> Tue Jun 7 20:08:04 2016
> Tue Jun 7 20:24:24 2016
> both have the same id:
> Message-Id: <bcd87d1a-bfb6-4f2a-92cc-af8f1d16b...@apache.org>
> It's exactly the same message which arrived twice in the same mailing
> list a few seconds apart.
> Maybe it was sent using Bcc as well as To: ?
> It's not exactly a collision, but at present mod_mbox is able to store
> both whereas Pony Mail cannot (except if using the full generator,
> which has other problems)
> I think it's important to store the full message history in the database.
> For example, if one of the messages bounces, it would be odd if the
> source of the bounce were not in the database.
> Also the message sequences will be incomplete.
> This is the case for lists.a.o, the mbox
> does not have the message sequence number 15575
>> How about using the following:
> In this case the return-path can be used to distinguish the messages.
Note that the path depends on where the message is stored:
So it will need adjusting to extract the parts that are the same for
all message sources, whether that is mod_mbox or Pony Mail or sent to
The format will presumably depend on the mailing list software that is used.
> I think this must be the original List-Id, not any override.
> Otherwise there may be problems with permalinks if a list name is
> updated - the old permalink will no longer work.
>> Whatever new algorithm is chosen, I think it's important that the
>> format looks different from the existing ones. e.g. one could drop the
>> <> around the list id.