On 13 October 2016 at 21:28, sebb <[email protected]> wrote: > On 7 October 2016 at 00:44, sebb <[email protected]> wrote: >> The id generator is used to create a key for the message database, and >> also to create a Permalink. >> >> Therefore, an id generator needs to fulfil the following design goals >> as a minimum: >> A) different messages have different IDs >> B) the same id is generated if the same message is re-processed >> C) equivalent messages have the same ID >> >> Goal A is needed to ensure that the database can contain every different >> message >> Goal B is needed to ensure that the database can be reloaded from the >> original source if necessary >> Goal C is needed to ensure that the database can be reloaded from an >> equivalent source, and to ensure that Permalinks are stable. >> >> None of the current id generator algorithms meet all of the above goals. >> >> The original and medium generators fail to meet goal A. >> The full generator fails to meet goal B (and therefore C). >> >> A sender can easily generate two messages with identical content; it >> is important to distinguish these. >> >> The Message-Id should help here. >> >> Message-ID is supposed to be unique, in practice it may not be, so >> some additional fields need to be used to create the database id. >> >> For mailing lists the Return-Path will normally contain a unique id >> which is used to identify bounces. >> In theory this might be sufficient on its own. Indeed the path might >> be usable without hashing. However the early ASF mailing list software >> did not use unique Return-Paths. >> >> The existing mod_mbox solution uses a combination of Message-Id plus >> YYYYMM plus a list identifier. Have there ever been any collisions? > > It's certainly possible for mod_mbox to contain multiple messages with > the same id. > > For example: > > From > dev-return-15575-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org > Tue Jun 7 20:08:04 2016 > and > From > dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org > Tue Jun 7 20:24:24 2016 > > both have the same id: > > Message-Id: <[email protected]> > > It's exactly the same message which arrived twice in the same mailing > list a few seconds apart. > Maybe it was sent using Bcc as well as To: ? > > It's not exactly a collision, but at present mod_mbox is able to store > both whereas Pony Mail cannot (except if using the full generator, > which has other problems) > > I think it's important to store the full message history in the database. > For example, if one of the messages bounces, it would be odd if the > source of the bounce were not in the database. > Also the message sequences will be incomplete. > This is the case for lists.a.o, the mbox > > https://lists.apache.org/api/[email protected]&date=2016-6 > > does not have the message sequence number 15575 > >> How about using the following: >> >> Message-Id >> Date >> Return-Path > > In this case the return-path can be used to distinguish the messages.
Note that the path depends on where the message is stored: Return-Path: <dev-return-15577-archive-asf-public=cust-asf.ponee...@airavata.apache.org> as against Return-Path: <dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org> So it will need adjusting to extract the parts that are the same for all message sources, whether that is mod_mbox or Pony Mail or sent to another subscriber. The format will presumably depend on the mailing list software that is used. >> List-Id > > I think this must be the original List-Id, not any override. > Otherwise there may be problems with permalinks if a list name is > updated - the old permalink will no longer work. > >> Whatever new algorithm is chosen, I think it's important that the >> format looks different from the existing ones. e.g. one could drop the >> <> around the list id.
