On 12/10/2016 12:46 PM, sebb wrote: > Another issue I have noticed - From_ mangling. > > Exported mbox files must use From_ mangling to be usable as input. > However the commonest method (mboxo) only mangles a plain From_, so > one cannot then distinguish From_ from >From_. > [It looks like the ASF mod_mbox archive use mboxo] > > This means that the live message as received by the archiver may not > be the same as the unmangled mbox entry. > > The current importer does not unmangle >From_ lines, which means that > re-importation of a live message will change the MID (and the > Permalink)
That would only be in the case of using the full source for generating the ID, right? I don't see the medium/short generators being affected here. On that note, yeah, we need to figure out how to improve the medium generator just a smidgen :) > Likewise once the importer is fixed to do the unmangling, > re-importation of previously imported messages will change the MID > (and the Permalink) > There will still be the issue of >From_ lines in original messages, > but that should hopefully only affect a few messages (quoted lines > have a space after the >) > > > On 8 December 2016 at 13:25, sebb <seb...@gmail.com> wrote: >> I have just discovered that there are some duplicate ezmlm message >> index numbers in the announce@a.o archives. >> This means we cannot rely on the number being unique, though the From >> line is almost certainly unique id the date is included. >> But unless the same data is present in all copies of the message, >> hashes derived from it won't be stable. >> >> 200103.mbox:From >> announce-return-32-apmail-announce-archive=apache....@apache.org Fri >> Mar 09 20:26:49 2001 >> 200103.mbox:From >> announce-return-33-apmail-announce-archive=apache....@apache.org Mon >> Mar 12 12:19:04 2001 >> 200103.mbox:From >> announce-return-34-apmail-announce-archive=apache....@apache.org Mon >> Mar 12 19:09:39 2001 >> 200103.mbox:From >> announce-return-35-apmail-announce-archive=apache....@apache.org Wed >> Mar 28 17:46:40 2001 >> 200104.mbox:From >> announce-return-36-apmail-announce-archive=apache....@apache.org Mon >> Apr 09 19:25:39 2001 >> 200105.mbox:From >> announce-return-37-apmail-announce-archive=apache....@apache.org Tue >> May 01 21:14:21 2001 >> 200105.mbox:From >> announce-return-38-apmail-announce-archive=apache....@apache.org Tue >> May 01 21:59:08 2001 >> 200105.mbox:From >> announce-return-39-apmail-announce-archive=apache....@apache.org Sat >> May 12 22:11:46 2001 >> 200105.mbox:From >> announce-return-40-apmail-announce-archive=apache....@apache.org Mon >> May 21 22:51:04 2001 >> 200105.mbox:From >> announce-return-41-apmail-announce-archive=apache....@apache.org Tue >> May 22 17:12:40 2001 >> 200105.mbox:From >> announce-return-42-apmail-announce-archive=apache....@apache.org Thu >> May 31 06:14:01 2001 >> 200106.mbox:From >> announce-return-43-apmail-announce-archive=apache....@apache.org Fri >> Jun 01 14:31:52 2001 >> 200107.mbox:From >> announce-return-44-apmail-announce-archive=apache....@apache.org Tue >> Jul 17 18:56:30 2001 >> 200110.mbox:From >> announce-return-45-apmail-announce-archive=apache....@apache.org Sun >> Oct 14 03:23:08 2001 >> 200111.mbox:From >> announce-return-46-apmail-announce-archive=apache....@apache.org Fri >> Nov 16 00:13:49 2001 >> >> 200210.mbox:From >> announce-return-32-apmail-announce-archive=apache....@apache.org Thu >> Oct 03 19:20:20 2002 >> 200210.mbox:From >> announce-return-33-apmail-announce-archive=apache....@apache.org Fri >> Oct 04 19:01:01 2002 >> 200211.mbox:From >> announce-return-34-apmail-announce-archive=apache....@apache.org Tue >> Nov 05 20:29:51 2002 >> 200211.mbox:From >> announce-return-35-apmail-announce-archive=apache....@apache.org Wed >> Nov 27 19:39:07 2002 >> 200301.mbox:From >> announce-return-36-apmail-announce-archive=apache....@apache.org Mon >> Jan 20 23:34:23 2003 >> 200301.mbox:From >> announce-return-37-apmail-announce-archive=apache....@apache.org Sun >> Jan 26 08:21:21 2003 >> 200302.mbox:From >> announce-return-38-apmail-announce-archive=apache....@apache.org Mon >> Feb 24 16:35:58 2003 >> 200303.mbox:From >> announce-return-39-apmail-announce-archive=apache....@apache.org Mon >> Mar 03 15:44:33 2003 >> 200303.mbox:From >> announce-return-40-apmail-announce-archive=apache....@apache.org Mon >> Mar 17 15:55:44 2003 >> 200305.mbox:From >> announce-return-41-apmail-announce-archive=apache....@apache.org Wed >> May 28 16:40:34 2003 >> 200307.mbox:From >> announce-return-42-apmail-announce-archive=apache....@apache.org Wed >> Jul 09 12:06:06 2003 >> 200307.mbox:From >> announce-return-43-apmail-announce-archive=apache....@apache.org Fri >> Jul 18 13:48:09 2003 >> 200308.mbox:From >> announce-return-44-apmail-announce-archive=apache....@apache.org Tue >> Aug 05 16:10:28 2003 >> 200308.mbox:From >> announce-return-45-apmail-announce-archive=apache....@apache.org Wed >> Aug 13 10:26:40 2003 >> 200308.mbox:From >> announce-return-46-apmail-announce-archive=apache....@apache.org Fri >> Aug 15 10:21:03 2003 >> >> I don't know what happened to create duplicates partway through the sequence. >> The sequence proper started here: >> >> 200201.mbox:From >> announce-return-1-apmail-announce-archive=apache....@apache.org Fri >> Jan 11 20:30:21 2002 >> >> Prior to Jan 2002 AFAICT there are only numbers 32-46 as shown above >> in the 2001 archives. >> >> >> On 16 November 2016 at 21:50, sebb <seb...@gmail.com> wrote: >>> Just discovered an issue which affects the Permalinks. >>> >>> If the archiver msgbody() function fails to detect a text message >>> body, it will return null. >>> If the message has an attachment, then the archiver will generate an >>> entry for it. >>> However the short and medium id generators will fail when accessing the >>> body. >>> >>> The mid will then revert to the previously calculated value. >>> This is derived from the list id and the 'archived-at' header. >>> However the archived-at header is added by the archiver itself if necessary. >>> So if the message ever needs to be reloaded from elsewhere, a new >>> mid/Permalink value will likely be generated. >>> The Permalink will stop working unless the original entry is kept. >>> >>> Unfortunately the fallback mids have exactly the same format as the >>> medium generator (sha224@lid). >>> This makes it impossible to determine which Permalinks are affected >>> from the link alone. >>> However any existing mbox entries which have body: null will be using >>> the fallback mid. >>> >>> >>> On 25 October 2016 at 19:50, sebb <seb...@gmail.com> wrote: >>>> Unfortunately it appears that Message-Ids are not always generated >>>> (*), so there needs to be an equivalent that can be used. >>>> Furthermore, some emails may have more than one Message-Id [+] >>>> >>>> The original message fields (including full body) are not enough to >>>> uniquely id a message in the mailing list stream. >>>> For that one needs something like a sequence number. >>>> However some early mailboxes don't include the ezmlm sequence number. >>>> >>>> However AFAIK every message that reaches a mailing list must have been >>>> delivered to the mailserver mailbox. >>>> So the mailing list message *instance* should be identifiable using >>>> either the sequence number, or failing that, some or all of the >>>> routing information by which it arrived at the mailbox. >>>> >>>> And one can identify the mail *content* using the Message-Id if >>>> present (failing that, a hash of the full e-mail). >>>> >>>> This needs some more work and some worked example messages. >>>> >>>> Sorry to keep going on about this, but AFAICT currently Pony Mail does >>>> not have an algorithm that satisfies all the requirements for use as >>>> an ES id. >>>> >>>> (*) Automated services and older e-mails don't always have them. >>>> Such mails don't appear in mod_mbox listings though they are present >>>> in the raw mbox files. >>>> [+] mod_mbox appears to use the first Message-Id it finds. >>>> >>>> On 22 October 2016 at 23:02, sebb <seb...@gmail.com> wrote: >>>>> On 18 October 2016 at 10:49, Daniel Gruno <humbed...@apache.org> wrote: >>>>>> On 10/18/2016 11:43 AM, sebb wrote: >>>>>>> It just occurs to me that there is another aspect to be considered: >>>>>>> list renaming. >>>>>>> This might affect both the unique id and Permalinks. >>>>>>> >>>>>>> If the list id is embedded in the Permalink hash, I think it would be >>>>>>> very difficult to honour existing links. >>>>>>> >>>>>>> This is in contrast with the link format as used by mod_mbox, where >>>>>>> the list id is exposed in the URL, for example: >>>>>>> >>>>>>> http://mail-archives.apache.org/mod_mbox/incubator-any23-dev/201209.mbox/<message-id> >>>>>>> http://mail-archives.apache.org/mod_mbox/any23-dev/201209.mbox/<message-id> >>>>>>> >>>>>>> It would be easy enough to redirect one URL to the other - if necessary. >>>>>>> (In this case, the original lists have been retained) >>>>>> >>>>>> Generally not an issue. >>>>>> when you rename a list in Pony Mail, you keep the original ID and >>>>>> source, you only touch the list value in elasticsearch. permalinks will >>>>>> remain the same, but point to the new list name (and keep the original >>>>>> in the source). >>>>> >>>>> I see, that's probably OK. >>>>> >>>>>> There are of course edge cases where you have to reimport and you force >>>>>> a list ID on the command line, which may change the generated ID. >>>>> >>>>> That could be a problem both for the ES id and the Permalink. >>>>> If the id uses the new LID, it will result in a new MID. >>>>> If the old message is still present, there will be two copies. >>>>> If the old message is deleted, the Permalink will break. >>>>> >>>>> I think we do need to allow for re-imports. >>>>> This makes generation of the MID tricky. >>>>> >>>>>> If we scrap the list id for a moment, we are left with, at least, the >>>>>> following constants: >>>>>> >>>>>> Sender >>>>>> Date >>>>>> Subject >>>>>> Body >>>>> >>>>> (I assume this includes the attachments) >>>>> >>>>>> Message ID >>>>>> >>>>>> The problem with the above is, that an email with the same values in all >>>>>> fields may have been sent to multiple lists, so ideally we'd keep >>>>>> separate copies for each list....which gets tricky without adding the >>>>>> list ID to this. A workaround would be to force the ID generator to >>>>>> always use the original list ID in the source for generating the ID, and >>>>>> only use the forced list ID if no list was found. >>>>> >>>>> The identical message may be sent to the same list twice. >>>>> I've seen this already on some of the lists (I think I mentioned this >>>>> already). >>>>> It looks like a mail client sometimes duplicates the message. >>>>> Maybe it can also occur if the destination list has an alias and the >>>>> message is sent to the alias as well. >>>>> >>>>> So the list id is not sufficient to identify the specific message. >>>>> >>>>> However mailing list software has to identify bounces. >>>>> This is usually done by adding a unique id to the Return-Path, >>>>> In the case of ezmlm this is a sequence number. >>>>> >>>>> That will be present in mails sent to subscribers, and is present in >>>>> the mbox files (apart from some very early ones). >>>>> >>>>> If present, it should be a constant for a particular message on a >>>>> specific mailing list. >>>>> >>>>>> Thoughts? >>>>> >>>>> The MID needs to be unique to ensure that all messages can be stored OK >>>>> in ES. >>>>> Unwanted duplicates cause message loss. >>>>> >>>>> The Permalink needs to be persistent. >>>>> If there are multiple Permalinks for the same message that is not a >>>>> problem. >>>>> If a single Permalink applies to multiple messages, that is not ideal, >>>>> but at least there is a chance of recovering the message. >>>>> But if a Permalink disappears, then it is a big problem. >>>>> >>>>> Although the mod_mbox solution is not perfect, its Permalinks have the >>>>> advantage that the Message-ID and an idea of the List-name can be got >>>>> from the link. >>>>> This means that there is good chance of being able to find the >>>>> matching message(s) in almost any mail archive, should the originals >>>>> be unavailable. >>>>> >>>>> With the current Pony Mail links, if a message is lost from the >>>>> database, AFAICT the only way to recover it is to re-import all the >>>>> messages and hope the same ids are re-generated. >>>>> Since there have been at least two different generator algorithms >>>>> used, and the generator was/is sensitive to the host time zone, this >>>>> will likely be a lengthy process. >>>>> The long links do at least include the list-id, which would reduce the >>>>> work somewhat. >>>>> >>>>> So I think the Permalink needs to be similar to the current mod_mbox >>>>> solution. >>>>> The MID does not need to be directly related, so long as it is unique >>>>> and stable. >>>>> >>>>>>> >>>>>>> How can list renaming be managed in Pony Mail? >>>>>>> Or is it not allowed? >>>>>>> >>>>>>> >>>>>>> On 14 October 2016 at 00:21, sebb <seb...@gmail.com> wrote: >>>>>>>> On 13 October 2016 at 23:14, Daniel Gruno <humbed...@apache.org> wrote: >>>>>>>>> On 10/14/2016 12:12 AM, sebb wrote: >>>>>>>>>> On 13 October 2016 at 21:28, sebb <seb...@gmail.com> wrote: >>>>>>>>>>> On 7 October 2016 at 00:44, sebb <seb...@gmail.com> wrote: >>>>>>>>>>>> The id generator is used to create a key for the message database, >>>>>>>>>>>> and >>>>>>>>>>>> also to create a Permalink. >>>>>>>>>>>> >>>>>>>>>>>> Therefore, an id generator needs to fulfil the following design >>>>>>>>>>>> goals >>>>>>>>>>>> as a minimum: >>>>>>>>>>>> A) different messages have different IDs >>>>>>>>>>>> B) the same id is generated if the same message is re-processed >>>>>>>>>>>> C) equivalent messages have the same ID >>>>>>>>>>>> >>>>>>>>>>>> Goal A is needed to ensure that the database can contain every >>>>>>>>>>>> different message >>>>>>>>>>>> Goal B is needed to ensure that the database can be reloaded from >>>>>>>>>>>> the >>>>>>>>>>>> original source if necessary >>>>>>>>>>>> Goal C is needed to ensure that the database can be reloaded from >>>>>>>>>>>> an >>>>>>>>>>>> equivalent source, and to ensure that Permalinks are stable. >>>>>>>>>>>> >>>>>>>>>>>> None of the current id generator algorithms meet all of the above >>>>>>>>>>>> goals. >>>>>>>>>>>> >>>>>>>>>>>> The original and medium generators fail to meet goal A. >>>>>>>>>>>> The full generator fails to meet goal B (and therefore C). >>>>>>>>>>>> >>>>>>>>>>>> A sender can easily generate two messages with identical content; >>>>>>>>>>>> it >>>>>>>>>>>> is important to distinguish these. >>>>>>>>>>>> >>>>>>>>>>>> The Message-Id should help here. >>>>>>>>>>>> >>>>>>>>>>>> Message-ID is supposed to be unique, in practice it may not be, so >>>>>>>>>>>> some additional fields need to be used to create the database id. >>>>>>>>>>>> >>>>>>>>>>>> For mailing lists the Return-Path will normally contain a unique id >>>>>>>>>>>> which is used to identify bounces. >>>>>>>>>>>> In theory this might be sufficient on its own. Indeed the path >>>>>>>>>>>> might >>>>>>>>>>>> be usable without hashing. However the early ASF mailing list >>>>>>>>>>>> software >>>>>>>>>>>> did not use unique Return-Paths. >>>>>>>>>>>> >>>>>>>>>>>> The existing mod_mbox solution uses a combination of Message-Id >>>>>>>>>>>> plus >>>>>>>>>>>> YYYYMM plus a list identifier. Have there ever been any collisions? >>>>>>>>>>> >>>>>>>>>>> It's certainly possible for mod_mbox to contain multiple messages >>>>>>>>>>> with >>>>>>>>>>> the same id. >>>>>>>>>>> >>>>>>>>>>> For example: >>>>>>>>>>> >>>>>>>>>>> From >>>>>>>>>>> dev-return-15575-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org >>>>>>>>>>> Tue Jun 7 20:08:04 2016 >>>>>>>>>>> and >>>>>>>>>>> From >>>>>>>>>>> dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org >>>>>>>>>>> Tue Jun 7 20:24:24 2016 >>>>>>>>>>> >>>>>>>>>>> both have the same id: >>>>>>>>>>> >>>>>>>>>>> Message-Id: <bcd87d1a-bfb6-4f2a-92cc-af8f1d16b...@apache.org> >>>>>>>>>>> >>>>>>>>>>> It's exactly the same message which arrived twice in the same >>>>>>>>>>> mailing >>>>>>>>>>> list a few seconds apart. >>>>>>>>>>> Maybe it was sent using Bcc as well as To: ? >>>>>>>>>>> >>>>>>>>>>> It's not exactly a collision, but at present mod_mbox is able to >>>>>>>>>>> store >>>>>>>>>>> both whereas Pony Mail cannot (except if using the full generator, >>>>>>>>>>> which has other problems) >>>>>>>>>>> >>>>>>>>>>> I think it's important to store the full message history in the >>>>>>>>>>> database. >>>>>>>>>>> For example, if one of the messages bounces, it would be odd if the >>>>>>>>>>> source of the bounce were not in the database. >>>>>>>>>>> Also the message sequences will be incomplete. >>>>>>>>>>> This is the case for lists.a.o, the mbox >>>>>>>>>>> >>>>>>>>>>> https://lists.apache.org/api/mbox.lua?list=d...@airavata.apache.org&date=2016-6 >>>>>>>>>>> >>>>>>>>>>> does not have the message sequence number 15575 >>>>>>>>>>> >>>>>>>>>>>> How about using the following: >>>>>>>>>>>> >>>>>>>>>>>> Message-Id >>>>>>>>>>>> Date >>>>>>>>>>>> Return-Path >>>>>>>>>>> >>>>>>>>>>> In this case the return-path can be used to distinguish the >>>>>>>>>>> messages. >>>>>>>>>> >>>>>>>>>> Note that the path depends on where the message is stored: >>>>>>>>>> >>>>>>>>>> Return-Path: >>>>>>>>>> <dev-return-15577-archive-asf-public=cust-asf.ponee...@airavata.apache.org> >>>>>>>>>> as against >>>>>>>>>> Return-Path: >>>>>>>>>> <dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org> >>>>>>>>>> >>>>>>>>>> So it will need adjusting to extract the parts that are the same for >>>>>>>>>> all message sources, whether that is mod_mbox or Pony Mail or sent to >>>>>>>>>> another subscriber. >>>>>>>>> >>>>>>>>> >>>>>>>>> How about we: >>>>>>>>> 1) select a larger list of headers that we know to be the same across >>>>>>>>> the same email sent to different MTAs, and combine them for the ID >>>>>>>>> generation. >>>>>>>> >>>>>>>> Sort of. >>>>>>>> >>>>>>>> AFAICT the *only* value that can distinguish multiple postings (of >>>>>>>> which there seem to be quite a lot) is the mailing list sequence >>>>>>>> number, which in ezmlm is only available in the Return-Path. >>>>>>>> >>>>>>>> However the sequence could perhaps be reset - so it seems risky to >>>>>>>> rely on that plus the list id alone. >>>>>>>> Also the earliest messages did not have sequence numbers. >>>>>>>> >>>>>>>> So there needs to be another way to identify distinct messages. >>>>>>>> In theory, that is the Message-Id. >>>>>>>> Even if it is not completely unique, it should be OK in combination >>>>>>>> with the sequence number. >>>>>>>> >>>>>>>> If the Message-Id does not exist or is very poor, then I think the >>>>>>>> only solution is to look at most - if not all - the parts of a message >>>>>>>> that a user can vary. >>>>>>>> This is what the full id generation does, except that it also includes >>>>>>>> the MTA-specific headers, causing problems with repeatability. >>>>>>>> >>>>>>>> There is another aspect to this. At present the generation is done >>>>>>>> using the parsed mail, rather than the original. >>>>>>>> If there is any chance that the format can change between releases of >>>>>>>> the library, then this may destroy repeatability. >>>>>>>> Likewise if a different library is ever used. >>>>>>>> We may need to additionally ensure that all headers are in a canonical >>>>>>>> form before use, in case MTAs decide to vary the layout. >>>>>>>> >>>>>>>>> 2) Store some more headers in the doc :) >>>>>>>> >>>>>>>> Those might be useful for searching, but each posting needs its own >>>>>>>> unique id otherwise it cannot be stored in the first place. >>>>>>>> >>>>>>>>>> >>>>>>>>>> The format will presumably depend on the mailing list software that >>>>>>>>>> is used. >>>>>>>>>> >>>>>>>>>>>> List-Id >>>>>>>>>>> >>>>>>>>>>> I think this must be the original List-Id, not any override. >>>>>>>>>>> Otherwise there may be problems with permalinks if a list name is >>>>>>>>>>> updated - the old permalink will no longer work. >>>>>>>>>>> >>>>>>>>>>>> Whatever new algorithm is chosen, I think it's important that the >>>>>>>>>>>> format looks different from the existing ones. e.g. one could drop >>>>>>>>>>>> the >>>>>>>>>>>> <> around the list id. >>>>>>>>> >>>>>>