On 18 October 2016 at 10:49, Daniel Gruno <[email protected]> wrote: > On 10/18/2016 11:43 AM, sebb wrote: >> It just occurs to me that there is another aspect to be considered: >> list renaming. >> This might affect both the unique id and Permalinks. >> >> If the list id is embedded in the Permalink hash, I think it would be >> very difficult to honour existing links. >> >> This is in contrast with the link format as used by mod_mbox, where >> the list id is exposed in the URL, for example: >> >> http://mail-archives.apache.org/mod_mbox/incubator-any23-dev/201209.mbox/<message-id> >> http://mail-archives.apache.org/mod_mbox/any23-dev/201209.mbox/<message-id> >> >> It would be easy enough to redirect one URL to the other - if necessary. >> (In this case, the original lists have been retained) > > Generally not an issue. > when you rename a list in Pony Mail, you keep the original ID and > source, you only touch the list value in elasticsearch. permalinks will > remain the same, but point to the new list name (and keep the original > in the source). > > There are of course edge cases where you have to reimport and you force > a list ID on the command line, which may change the generated ID. > > If we scrap the list id for a moment, we are left with, at least, the > following constants: > > Sender > Date > Subject > Body > Message ID > > The problem with the above is, that an email with the same values in all > fields may have been sent to multiple lists, so ideally we'd keep > separate copies for each list....which gets tricky without adding the > list ID to this. A workaround would be to force the ID generator to > always use the original list ID in the source for generating the ID, and > only use the forced list ID if no list was found. > > Thoughts?
Need to think more about this, but it occurs to me now that the ES id and Permalink have similar, but not identical, requirements. I think this is where some of the issues originate. One might need to consider using separate algorithms. >> >> How can list renaming be managed in Pony Mail? >> Or is it not allowed? >> >> >> On 14 October 2016 at 00:21, sebb <[email protected]> wrote: >>> On 13 October 2016 at 23:14, Daniel Gruno <[email protected]> wrote: >>>> On 10/14/2016 12:12 AM, sebb wrote: >>>>> On 13 October 2016 at 21:28, sebb <[email protected]> wrote: >>>>>> On 7 October 2016 at 00:44, sebb <[email protected]> wrote: >>>>>>> The id generator is used to create a key for the message database, and >>>>>>> also to create a Permalink. >>>>>>> >>>>>>> Therefore, an id generator needs to fulfil the following design goals >>>>>>> as a minimum: >>>>>>> A) different messages have different IDs >>>>>>> B) the same id is generated if the same message is re-processed >>>>>>> C) equivalent messages have the same ID >>>>>>> >>>>>>> Goal A is needed to ensure that the database can contain every >>>>>>> different message >>>>>>> Goal B is needed to ensure that the database can be reloaded from the >>>>>>> original source if necessary >>>>>>> Goal C is needed to ensure that the database can be reloaded from an >>>>>>> equivalent source, and to ensure that Permalinks are stable. >>>>>>> >>>>>>> None of the current id generator algorithms meet all of the above goals. >>>>>>> >>>>>>> The original and medium generators fail to meet goal A. >>>>>>> The full generator fails to meet goal B (and therefore C). >>>>>>> >>>>>>> A sender can easily generate two messages with identical content; it >>>>>>> is important to distinguish these. >>>>>>> >>>>>>> The Message-Id should help here. >>>>>>> >>>>>>> Message-ID is supposed to be unique, in practice it may not be, so >>>>>>> some additional fields need to be used to create the database id. >>>>>>> >>>>>>> For mailing lists the Return-Path will normally contain a unique id >>>>>>> which is used to identify bounces. >>>>>>> In theory this might be sufficient on its own. Indeed the path might >>>>>>> be usable without hashing. However the early ASF mailing list software >>>>>>> did not use unique Return-Paths. >>>>>>> >>>>>>> The existing mod_mbox solution uses a combination of Message-Id plus >>>>>>> YYYYMM plus a list identifier. Have there ever been any collisions? >>>>>> >>>>>> It's certainly possible for mod_mbox to contain multiple messages with >>>>>> the same id. >>>>>> >>>>>> For example: >>>>>> >>>>>> From >>>>>> dev-return-15575-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org >>>>>> Tue Jun 7 20:08:04 2016 >>>>>> and >>>>>> From >>>>>> dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org >>>>>> Tue Jun 7 20:24:24 2016 >>>>>> >>>>>> both have the same id: >>>>>> >>>>>> Message-Id: <[email protected]> >>>>>> >>>>>> It's exactly the same message which arrived twice in the same mailing >>>>>> list a few seconds apart. >>>>>> Maybe it was sent using Bcc as well as To: ? >>>>>> >>>>>> It's not exactly a collision, but at present mod_mbox is able to store >>>>>> both whereas Pony Mail cannot (except if using the full generator, >>>>>> which has other problems) >>>>>> >>>>>> I think it's important to store the full message history in the database. >>>>>> For example, if one of the messages bounces, it would be odd if the >>>>>> source of the bounce were not in the database. >>>>>> Also the message sequences will be incomplete. >>>>>> This is the case for lists.a.o, the mbox >>>>>> >>>>>> https://lists.apache.org/api/[email protected]&date=2016-6 >>>>>> >>>>>> does not have the message sequence number 15575 >>>>>> >>>>>>> How about using the following: >>>>>>> >>>>>>> Message-Id >>>>>>> Date >>>>>>> Return-Path >>>>>> >>>>>> In this case the return-path can be used to distinguish the messages. >>>>> >>>>> Note that the path depends on where the message is stored: >>>>> >>>>> Return-Path: >>>>> <dev-return-15577-archive-asf-public=cust-asf.ponee...@airavata.apache.org> >>>>> as against >>>>> Return-Path: >>>>> <dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org> >>>>> >>>>> So it will need adjusting to extract the parts that are the same for >>>>> all message sources, whether that is mod_mbox or Pony Mail or sent to >>>>> another subscriber. >>>> >>>> >>>> How about we: >>>> 1) select a larger list of headers that we know to be the same across >>>> the same email sent to different MTAs, and combine them for the ID >>>> generation. >>> >>> Sort of. >>> >>> AFAICT the *only* value that can distinguish multiple postings (of >>> which there seem to be quite a lot) is the mailing list sequence >>> number, which in ezmlm is only available in the Return-Path. >>> >>> However the sequence could perhaps be reset - so it seems risky to >>> rely on that plus the list id alone. >>> Also the earliest messages did not have sequence numbers. >>> >>> So there needs to be another way to identify distinct messages. >>> In theory, that is the Message-Id. >>> Even if it is not completely unique, it should be OK in combination >>> with the sequence number. >>> >>> If the Message-Id does not exist or is very poor, then I think the >>> only solution is to look at most - if not all - the parts of a message >>> that a user can vary. >>> This is what the full id generation does, except that it also includes >>> the MTA-specific headers, causing problems with repeatability. >>> >>> There is another aspect to this. At present the generation is done >>> using the parsed mail, rather than the original. >>> If there is any chance that the format can change between releases of >>> the library, then this may destroy repeatability. >>> Likewise if a different library is ever used. >>> We may need to additionally ensure that all headers are in a canonical >>> form before use, in case MTAs decide to vary the layout. >>> >>>> 2) Store some more headers in the doc :) >>> >>> Those might be useful for searching, but each posting needs its own >>> unique id otherwise it cannot be stored in the first place. >>> >>>>> >>>>> The format will presumably depend on the mailing list software that is >>>>> used. >>>>> >>>>>>> List-Id >>>>>> >>>>>> I think this must be the original List-Id, not any override. >>>>>> Otherwise there may be problems with permalinks if a list name is >>>>>> updated - the old permalink will no longer work. >>>>>> >>>>>>> Whatever new algorithm is chosen, I think it's important that the >>>>>>> format looks different from the existing ones. e.g. one could drop the >>>>>>> <> around the list id. >>>> >
