It just occurs to me that there is another aspect to be considered: list renaming. This might affect both the unique id and Permalinks.
If the list id is embedded in the Permalink hash, I think it would be very difficult to honour existing links. This is in contrast with the link format as used by mod_mbox, where the list id is exposed in the URL, for example: http://mail-archives.apache.org/mod_mbox/incubator-any23-dev/201209.mbox/<message-id> http://mail-archives.apache.org/mod_mbox/any23-dev/201209.mbox/<message-id> It would be easy enough to redirect one URL to the other - if necessary. (In this case, the original lists have been retained) How can list renaming be managed in Pony Mail? Or is it not allowed? On 14 October 2016 at 00:21, sebb <[email protected]> wrote: > On 13 October 2016 at 23:14, Daniel Gruno <[email protected]> wrote: >> On 10/14/2016 12:12 AM, sebb wrote: >>> On 13 October 2016 at 21:28, sebb <[email protected]> wrote: >>>> On 7 October 2016 at 00:44, sebb <[email protected]> wrote: >>>>> The id generator is used to create a key for the message database, and >>>>> also to create a Permalink. >>>>> >>>>> Therefore, an id generator needs to fulfil the following design goals >>>>> as a minimum: >>>>> A) different messages have different IDs >>>>> B) the same id is generated if the same message is re-processed >>>>> C) equivalent messages have the same ID >>>>> >>>>> Goal A is needed to ensure that the database can contain every different >>>>> message >>>>> Goal B is needed to ensure that the database can be reloaded from the >>>>> original source if necessary >>>>> Goal C is needed to ensure that the database can be reloaded from an >>>>> equivalent source, and to ensure that Permalinks are stable. >>>>> >>>>> None of the current id generator algorithms meet all of the above goals. >>>>> >>>>> The original and medium generators fail to meet goal A. >>>>> The full generator fails to meet goal B (and therefore C). >>>>> >>>>> A sender can easily generate two messages with identical content; it >>>>> is important to distinguish these. >>>>> >>>>> The Message-Id should help here. >>>>> >>>>> Message-ID is supposed to be unique, in practice it may not be, so >>>>> some additional fields need to be used to create the database id. >>>>> >>>>> For mailing lists the Return-Path will normally contain a unique id >>>>> which is used to identify bounces. >>>>> In theory this might be sufficient on its own. Indeed the path might >>>>> be usable without hashing. However the early ASF mailing list software >>>>> did not use unique Return-Paths. >>>>> >>>>> The existing mod_mbox solution uses a combination of Message-Id plus >>>>> YYYYMM plus a list identifier. Have there ever been any collisions? >>>> >>>> It's certainly possible for mod_mbox to contain multiple messages with >>>> the same id. >>>> >>>> For example: >>>> >>>> From >>>> dev-return-15575-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org >>>> Tue Jun 7 20:08:04 2016 >>>> and >>>> From >>>> dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org >>>> Tue Jun 7 20:24:24 2016 >>>> >>>> both have the same id: >>>> >>>> Message-Id: <[email protected]> >>>> >>>> It's exactly the same message which arrived twice in the same mailing >>>> list a few seconds apart. >>>> Maybe it was sent using Bcc as well as To: ? >>>> >>>> It's not exactly a collision, but at present mod_mbox is able to store >>>> both whereas Pony Mail cannot (except if using the full generator, >>>> which has other problems) >>>> >>>> I think it's important to store the full message history in the database. >>>> For example, if one of the messages bounces, it would be odd if the >>>> source of the bounce were not in the database. >>>> Also the message sequences will be incomplete. >>>> This is the case for lists.a.o, the mbox >>>> >>>> https://lists.apache.org/api/[email protected]&date=2016-6 >>>> >>>> does not have the message sequence number 15575 >>>> >>>>> How about using the following: >>>>> >>>>> Message-Id >>>>> Date >>>>> Return-Path >>>> >>>> In this case the return-path can be used to distinguish the messages. >>> >>> Note that the path depends on where the message is stored: >>> >>> Return-Path: >>> <dev-return-15577-archive-asf-public=cust-asf.ponee...@airavata.apache.org> >>> as against >>> Return-Path: >>> <dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org> >>> >>> So it will need adjusting to extract the parts that are the same for >>> all message sources, whether that is mod_mbox or Pony Mail or sent to >>> another subscriber. >> >> >> How about we: >> 1) select a larger list of headers that we know to be the same across >> the same email sent to different MTAs, and combine them for the ID >> generation. > > Sort of. > > AFAICT the *only* value that can distinguish multiple postings (of > which there seem to be quite a lot) is the mailing list sequence > number, which in ezmlm is only available in the Return-Path. > > However the sequence could perhaps be reset - so it seems risky to > rely on that plus the list id alone. > Also the earliest messages did not have sequence numbers. > > So there needs to be another way to identify distinct messages. > In theory, that is the Message-Id. > Even if it is not completely unique, it should be OK in combination > with the sequence number. > > If the Message-Id does not exist or is very poor, then I think the > only solution is to look at most - if not all - the parts of a message > that a user can vary. > This is what the full id generation does, except that it also includes > the MTA-specific headers, causing problems with repeatability. > > There is another aspect to this. At present the generation is done > using the parsed mail, rather than the original. > If there is any chance that the format can change between releases of > the library, then this may destroy repeatability. > Likewise if a different library is ever used. > We may need to additionally ensure that all headers are in a canonical > form before use, in case MTAs decide to vary the layout. > >> 2) Store some more headers in the doc :) > > Those might be useful for searching, but each posting needs its own > unique id otherwise it cannot be stored in the first place. > >>> >>> The format will presumably depend on the mailing list software that is used. >>> >>>>> List-Id >>>> >>>> I think this must be the original List-Id, not any override. >>>> Otherwise there may be problems with permalinks if a list name is >>>> updated - the old permalink will no longer work. >>>> >>>>> Whatever new algorithm is chosen, I think it's important that the >>>>> format looks different from the existing ones. e.g. one could drop the >>>>> <> around the list id. >>
