On 12/10/2016 12:46 PM, sebb wrote:
> Another issue I have noticed - From_ mangling.
> 
> Exported mbox files must use From_ mangling to be usable as input.
> However the commonest method (mboxo) only mangles a plain From_, so
> one cannot then distinguish From_ from >From_.
> [It looks like the ASF mod_mbox archive use mboxo]
> 
> This means that the live message as received by the archiver may not
> be the same as the unmangled mbox entry.
> 
> The current importer does not unmangle >From_ lines, which means that
> re-importation of a live message will change the MID (and the
> Permalink)

That would only be in the case of using the full source for generating
the ID, right? I don't see the medium/short generators being affected here.

On that note, yeah, we need to figure out how to improve the medium
generator just a smidgen :)

> Likewise once the importer is fixed to do the unmangling,
> re-importation of previously imported messages will change the MID
> (and the Permalink)
> There will still be the issue of >From_ lines in original messages,
> but that should hopefully only affect a few messages (quoted lines
> have a space after the >)
> 
> 
> On 8 December 2016 at 13:25, sebb <seb...@gmail.com> wrote:
>> I have just discovered that there are some duplicate ezmlm message
>> index numbers in the announce@a.o archives.
>> This means we cannot rely on the number being unique, though the From
>> line is almost certainly unique id the date is included.
>> But unless the same data is present in all copies of the message,
>> hashes derived from it won't be stable.
>>
>> 200103.mbox:From
>> announce-return-32-apmail-announce-archive=apache....@apache.org Fri
>> Mar 09 20:26:49 2001
>> 200103.mbox:From
>> announce-return-33-apmail-announce-archive=apache....@apache.org Mon
>> Mar 12 12:19:04 2001
>> 200103.mbox:From
>> announce-return-34-apmail-announce-archive=apache....@apache.org Mon
>> Mar 12 19:09:39 2001
>> 200103.mbox:From
>> announce-return-35-apmail-announce-archive=apache....@apache.org Wed
>> Mar 28 17:46:40 2001
>> 200104.mbox:From
>> announce-return-36-apmail-announce-archive=apache....@apache.org Mon
>> Apr 09 19:25:39 2001
>> 200105.mbox:From
>> announce-return-37-apmail-announce-archive=apache....@apache.org Tue
>> May 01 21:14:21 2001
>> 200105.mbox:From
>> announce-return-38-apmail-announce-archive=apache....@apache.org Tue
>> May 01 21:59:08 2001
>> 200105.mbox:From
>> announce-return-39-apmail-announce-archive=apache....@apache.org Sat
>> May 12 22:11:46 2001
>> 200105.mbox:From
>> announce-return-40-apmail-announce-archive=apache....@apache.org Mon
>> May 21 22:51:04 2001
>> 200105.mbox:From
>> announce-return-41-apmail-announce-archive=apache....@apache.org Tue
>> May 22 17:12:40 2001
>> 200105.mbox:From
>> announce-return-42-apmail-announce-archive=apache....@apache.org Thu
>> May 31 06:14:01 2001
>> 200106.mbox:From
>> announce-return-43-apmail-announce-archive=apache....@apache.org Fri
>> Jun 01 14:31:52 2001
>> 200107.mbox:From
>> announce-return-44-apmail-announce-archive=apache....@apache.org Tue
>> Jul 17 18:56:30 2001
>> 200110.mbox:From
>> announce-return-45-apmail-announce-archive=apache....@apache.org Sun
>> Oct 14 03:23:08 2001
>> 200111.mbox:From
>> announce-return-46-apmail-announce-archive=apache....@apache.org Fri
>> Nov 16 00:13:49 2001
>>
>> 200210.mbox:From
>> announce-return-32-apmail-announce-archive=apache....@apache.org Thu
>> Oct 03 19:20:20 2002
>> 200210.mbox:From
>> announce-return-33-apmail-announce-archive=apache....@apache.org Fri
>> Oct 04 19:01:01 2002
>> 200211.mbox:From
>> announce-return-34-apmail-announce-archive=apache....@apache.org Tue
>> Nov 05 20:29:51 2002
>> 200211.mbox:From
>> announce-return-35-apmail-announce-archive=apache....@apache.org Wed
>> Nov 27 19:39:07 2002
>> 200301.mbox:From
>> announce-return-36-apmail-announce-archive=apache....@apache.org Mon
>> Jan 20 23:34:23 2003
>> 200301.mbox:From
>> announce-return-37-apmail-announce-archive=apache....@apache.org Sun
>> Jan 26 08:21:21 2003
>> 200302.mbox:From
>> announce-return-38-apmail-announce-archive=apache....@apache.org Mon
>> Feb 24 16:35:58 2003
>> 200303.mbox:From
>> announce-return-39-apmail-announce-archive=apache....@apache.org Mon
>> Mar 03 15:44:33 2003
>> 200303.mbox:From
>> announce-return-40-apmail-announce-archive=apache....@apache.org Mon
>> Mar 17 15:55:44 2003
>> 200305.mbox:From
>> announce-return-41-apmail-announce-archive=apache....@apache.org Wed
>> May 28 16:40:34 2003
>> 200307.mbox:From
>> announce-return-42-apmail-announce-archive=apache....@apache.org Wed
>> Jul 09 12:06:06 2003
>> 200307.mbox:From
>> announce-return-43-apmail-announce-archive=apache....@apache.org Fri
>> Jul 18 13:48:09 2003
>> 200308.mbox:From
>> announce-return-44-apmail-announce-archive=apache....@apache.org Tue
>> Aug 05 16:10:28 2003
>> 200308.mbox:From
>> announce-return-45-apmail-announce-archive=apache....@apache.org Wed
>> Aug 13 10:26:40 2003
>> 200308.mbox:From
>> announce-return-46-apmail-announce-archive=apache....@apache.org Fri
>> Aug 15 10:21:03 2003
>>
>> I don't know what happened to create duplicates partway through the sequence.
>> The sequence proper started here:
>>
>> 200201.mbox:From
>> announce-return-1-apmail-announce-archive=apache....@apache.org Fri
>> Jan 11 20:30:21 2002
>>
>> Prior to Jan 2002 AFAICT there are only numbers 32-46 as shown above
>> in the 2001 archives.
>>
>>
>> On 16 November 2016 at 21:50, sebb <seb...@gmail.com> wrote:
>>> Just discovered an issue which affects the Permalinks.
>>>
>>> If the archiver msgbody() function fails to detect a text message
>>> body, it will return null.
>>> If the message has an attachment, then the archiver will generate an
>>> entry for it.
>>> However the short and medium id generators will fail when accessing the 
>>> body.
>>>
>>> The mid will then revert to the previously calculated value.
>>> This is derived from the list id and the 'archived-at' header.
>>> However the archived-at header is added by the archiver itself if necessary.
>>> So if the message ever needs to be reloaded from elsewhere, a new
>>> mid/Permalink value will likely be generated.
>>> The Permalink will stop working unless the original entry is kept.
>>>
>>> Unfortunately the fallback mids have exactly the same format as the
>>> medium generator (sha224@lid).
>>> This makes it impossible to determine which Permalinks are affected
>>> from the link alone.
>>> However any existing mbox entries which have body: null will be using
>>> the fallback mid.
>>>
>>>
>>> On 25 October 2016 at 19:50, sebb <seb...@gmail.com> wrote:
>>>> Unfortunately it appears that Message-Ids are not always generated
>>>> (*), so there needs to be an equivalent that can be used.
>>>> Furthermore, some emails may have more than one Message-Id [+]
>>>>
>>>> The original message fields (including full body) are not enough to
>>>> uniquely id a message in the mailing list stream.
>>>> For that one needs something like a sequence number.
>>>> However some early mailboxes don't include the ezmlm sequence number.
>>>>
>>>> However AFAIK every message that reaches a mailing list must have been
>>>> delivered to the mailserver mailbox.
>>>> So the mailing list message *instance* should be identifiable using
>>>> either the sequence number, or failing that, some or all of the
>>>> routing information by which it arrived at the mailbox.
>>>>
>>>> And one can identify the mail *content* using the Message-Id if
>>>> present (failing that, a hash of the full e-mail).
>>>>
>>>> This needs some more work and some worked example messages.
>>>>
>>>> Sorry to keep going on about this, but AFAICT currently Pony Mail does
>>>> not have an algorithm that satisfies all the requirements for use as
>>>> an ES id.
>>>>
>>>> (*) Automated services and older e-mails don't always have them.
>>>> Such mails don't appear in mod_mbox listings though they are present
>>>> in the raw mbox files.
>>>> [+] mod_mbox appears to use the first Message-Id it finds.
>>>>
>>>> On 22 October 2016 at 23:02, sebb <seb...@gmail.com> wrote:
>>>>> On 18 October 2016 at 10:49, Daniel Gruno <humbed...@apache.org> wrote:
>>>>>> On 10/18/2016 11:43 AM, sebb wrote:
>>>>>>> It just occurs to me that there is another aspect to be considered:
>>>>>>> list renaming.
>>>>>>> This might affect both the unique id and Permalinks.
>>>>>>>
>>>>>>> If the list id is embedded in the Permalink hash, I think it would be
>>>>>>> very difficult to honour existing links.
>>>>>>>
>>>>>>> This is in contrast with the link format as used by mod_mbox, where
>>>>>>> the list id is exposed in the URL, for example:
>>>>>>>
>>>>>>> http://mail-archives.apache.org/mod_mbox/incubator-any23-dev/201209.mbox/<message-id>
>>>>>>> http://mail-archives.apache.org/mod_mbox/any23-dev/201209.mbox/<message-id>
>>>>>>>
>>>>>>> It would be easy enough to redirect one URL to the other - if necessary.
>>>>>>> (In this case, the original lists have been retained)
>>>>>>
>>>>>> Generally not an issue.
>>>>>> when you rename a list in Pony Mail, you keep the original ID and
>>>>>> source, you only touch the list value in elasticsearch. permalinks will
>>>>>> remain the same, but point to the new list name (and keep the original
>>>>>> in the source).
>>>>>
>>>>> I see, that's probably OK.
>>>>>
>>>>>> There are of course edge cases where you have to reimport and you force
>>>>>> a list ID on the command line, which may change the generated ID.
>>>>>
>>>>> That could be a problem both for the ES id and the Permalink.
>>>>> If the id uses the new LID, it will result in a new MID.
>>>>> If the old message is still present, there will be two copies.
>>>>> If the old message is deleted, the Permalink will break.
>>>>>
>>>>> I think we do need to allow for re-imports.
>>>>> This makes generation of the MID tricky.
>>>>>
>>>>>> If we scrap the list id for a moment, we are left with, at least, the
>>>>>> following constants:
>>>>>>
>>>>>> Sender
>>>>>> Date
>>>>>> Subject
>>>>>> Body
>>>>>
>>>>> (I assume this includes the attachments)
>>>>>
>>>>>> Message ID
>>>>>>
>>>>>> The problem with the above is, that an email with the same values in all
>>>>>> fields may have been sent to multiple lists, so ideally we'd keep
>>>>>> separate copies for each list....which gets tricky without adding the
>>>>>> list ID to this. A workaround would be to force the ID generator to
>>>>>> always use the original list ID in the source for generating the ID, and
>>>>>> only use the forced list ID if no list was found.
>>>>>
>>>>> The identical message may be sent to the same list twice.
>>>>> I've seen this already on some of the lists (I think I mentioned this 
>>>>> already).
>>>>> It looks like a mail client sometimes duplicates the message.
>>>>> Maybe it can also occur if the destination list has an alias and the
>>>>> message is sent to the alias as well.
>>>>>
>>>>> So the list id is not sufficient to identify the specific message.
>>>>>
>>>>> However mailing list software has to identify bounces.
>>>>> This is usually done by adding a unique id to the Return-Path,
>>>>> In the case of ezmlm this is a sequence number.
>>>>>
>>>>> That will be present in mails sent to subscribers, and is present in
>>>>> the mbox files (apart from some very early ones).
>>>>>
>>>>> If present, it should be a constant for a particular message on a
>>>>> specific mailing list.
>>>>>
>>>>>> Thoughts?
>>>>>
>>>>> The MID needs to be unique to ensure that all messages can be stored OK 
>>>>> in ES.
>>>>> Unwanted duplicates cause message loss.
>>>>>
>>>>> The Permalink needs to be persistent.
>>>>> If there are multiple Permalinks for the same message that is not a 
>>>>> problem.
>>>>> If a single Permalink applies to multiple messages, that is not ideal,
>>>>> but at least there is a chance of recovering the message.
>>>>> But if a Permalink disappears, then it is a big problem.
>>>>>
>>>>> Although the mod_mbox solution is not perfect, its Permalinks have the
>>>>> advantage that the Message-ID and an idea of the List-name can be got
>>>>> from the link.
>>>>> This means that there is good chance of being able to find the
>>>>> matching message(s) in almost any mail archive, should the originals
>>>>> be unavailable.
>>>>>
>>>>> With the current Pony Mail links, if a message is lost from the
>>>>> database, AFAICT the only way to recover it is to re-import all the
>>>>> messages and hope the same ids are re-generated.
>>>>> Since there have been at least two different generator algorithms
>>>>> used, and the generator was/is sensitive to the host time zone, this
>>>>> will likely be a lengthy process.
>>>>> The long links do at least include the list-id, which would reduce the
>>>>> work somewhat.
>>>>>
>>>>> So I think the Permalink needs to be similar to the current mod_mbox 
>>>>> solution.
>>>>> The MID does not need to be directly related, so long as it is unique
>>>>> and stable.
>>>>>
>>>>>>>
>>>>>>> How can list renaming be managed in Pony Mail?
>>>>>>> Or is it not allowed?
>>>>>>>
>>>>>>>
>>>>>>> On 14 October 2016 at 00:21, sebb <seb...@gmail.com> wrote:
>>>>>>>> On 13 October 2016 at 23:14, Daniel Gruno <humbed...@apache.org> wrote:
>>>>>>>>> On 10/14/2016 12:12 AM, sebb wrote:
>>>>>>>>>> On 13 October 2016 at 21:28, sebb <seb...@gmail.com> wrote:
>>>>>>>>>>> On 7 October 2016 at 00:44, sebb <seb...@gmail.com> wrote:
>>>>>>>>>>>> The id generator is used to create a key for the message database, 
>>>>>>>>>>>> and
>>>>>>>>>>>> also to create a Permalink.
>>>>>>>>>>>>
>>>>>>>>>>>> Therefore, an id generator needs to fulfil the following design 
>>>>>>>>>>>> goals
>>>>>>>>>>>> as a minimum:
>>>>>>>>>>>> A) different messages have different IDs
>>>>>>>>>>>> B) the same id is generated if the same message is re-processed
>>>>>>>>>>>> C) equivalent messages have the same ID
>>>>>>>>>>>>
>>>>>>>>>>>> Goal A is needed to ensure that the database can contain every 
>>>>>>>>>>>> different message
>>>>>>>>>>>> Goal B is needed to ensure that the database can be reloaded from 
>>>>>>>>>>>> the
>>>>>>>>>>>> original source if necessary
>>>>>>>>>>>> Goal C is needed to ensure that the database can be reloaded from 
>>>>>>>>>>>> an
>>>>>>>>>>>> equivalent source, and to ensure that Permalinks are stable.
>>>>>>>>>>>>
>>>>>>>>>>>> None of the current id generator algorithms meet all of the above 
>>>>>>>>>>>> goals.
>>>>>>>>>>>>
>>>>>>>>>>>> The original and medium generators fail to meet goal A.
>>>>>>>>>>>> The full generator fails to meet goal B (and therefore C).
>>>>>>>>>>>>
>>>>>>>>>>>> A sender can easily generate two messages with identical content; 
>>>>>>>>>>>> it
>>>>>>>>>>>> is important to distinguish these.
>>>>>>>>>>>>
>>>>>>>>>>>> The Message-Id should help here.
>>>>>>>>>>>>
>>>>>>>>>>>> Message-ID is supposed to be unique, in practice it may not be, so
>>>>>>>>>>>> some additional fields need to be used to create the database id.
>>>>>>>>>>>>
>>>>>>>>>>>> For mailing lists the Return-Path will normally contain a unique id
>>>>>>>>>>>> which is used to identify bounces.
>>>>>>>>>>>> In theory this might be sufficient on its own. Indeed the path 
>>>>>>>>>>>> might
>>>>>>>>>>>> be usable without hashing. However the early ASF mailing list 
>>>>>>>>>>>> software
>>>>>>>>>>>> did not use unique Return-Paths.
>>>>>>>>>>>>
>>>>>>>>>>>> The existing mod_mbox solution uses a combination of Message-Id 
>>>>>>>>>>>> plus
>>>>>>>>>>>> YYYYMM plus a list identifier. Have there ever been any collisions?
>>>>>>>>>>>
>>>>>>>>>>> It's certainly possible for mod_mbox to contain multiple messages 
>>>>>>>>>>> with
>>>>>>>>>>> the same id.
>>>>>>>>>>>
>>>>>>>>>>> For example:
>>>>>>>>>>>
>>>>>>>>>>> From 
>>>>>>>>>>> dev-return-15575-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
>>>>>>>>>>>  Tue Jun  7 20:08:04 2016
>>>>>>>>>>> and
>>>>>>>>>>> From 
>>>>>>>>>>> dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
>>>>>>>>>>>  Tue Jun  7 20:24:24 2016
>>>>>>>>>>>
>>>>>>>>>>> both have the same id:
>>>>>>>>>>>
>>>>>>>>>>> Message-Id: <bcd87d1a-bfb6-4f2a-92cc-af8f1d16b...@apache.org>
>>>>>>>>>>>
>>>>>>>>>>> It's exactly the same message which arrived twice in the same 
>>>>>>>>>>> mailing
>>>>>>>>>>> list a few seconds apart.
>>>>>>>>>>> Maybe it was sent using Bcc as well as To: ?
>>>>>>>>>>>
>>>>>>>>>>> It's not exactly a collision, but at present mod_mbox is able to 
>>>>>>>>>>> store
>>>>>>>>>>> both whereas Pony Mail cannot (except if using the full generator,
>>>>>>>>>>> which has other problems)
>>>>>>>>>>>
>>>>>>>>>>> I think it's important to store the full message history in the 
>>>>>>>>>>> database.
>>>>>>>>>>> For example, if one of the messages bounces, it would be odd if the
>>>>>>>>>>> source of the bounce were not in the database.
>>>>>>>>>>> Also the message sequences will be incomplete.
>>>>>>>>>>> This is the case for lists.a.o, the mbox
>>>>>>>>>>>
>>>>>>>>>>> https://lists.apache.org/api/mbox.lua?list=d...@airavata.apache.org&date=2016-6
>>>>>>>>>>>
>>>>>>>>>>> does not have the message sequence number 15575
>>>>>>>>>>>
>>>>>>>>>>>> How about using the following:
>>>>>>>>>>>>
>>>>>>>>>>>> Message-Id
>>>>>>>>>>>> Date
>>>>>>>>>>>> Return-Path
>>>>>>>>>>>
>>>>>>>>>>> In this case the return-path can be used to distinguish the 
>>>>>>>>>>> messages.
>>>>>>>>>>
>>>>>>>>>> Note that the path depends on where the message is stored:
>>>>>>>>>>
>>>>>>>>>> Return-Path: 
>>>>>>>>>> <dev-return-15577-archive-asf-public=cust-asf.ponee...@airavata.apache.org>
>>>>>>>>>> as against
>>>>>>>>>> Return-Path: 
>>>>>>>>>> <dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org>
>>>>>>>>>>
>>>>>>>>>> So it will need adjusting to extract the parts that are the same for
>>>>>>>>>> all message sources, whether that is mod_mbox or Pony Mail or sent to
>>>>>>>>>> another subscriber.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> How about we:
>>>>>>>>> 1) select a larger list of headers that we know to be the same across
>>>>>>>>> the same email sent to different MTAs, and combine them for the ID
>>>>>>>>> generation.
>>>>>>>>
>>>>>>>> Sort of.
>>>>>>>>
>>>>>>>> AFAICT the *only* value that can distinguish multiple postings (of
>>>>>>>> which there seem to be quite a lot) is the mailing list sequence
>>>>>>>> number, which in ezmlm is only available in the Return-Path.
>>>>>>>>
>>>>>>>> However the sequence could perhaps be reset - so it seems risky to
>>>>>>>> rely on that plus the list id alone.
>>>>>>>> Also the earliest messages did not have sequence numbers.
>>>>>>>>
>>>>>>>> So there needs to be another way to identify distinct messages.
>>>>>>>> In theory, that is the Message-Id.
>>>>>>>> Even if it is not completely unique, it should be OK in combination
>>>>>>>> with the sequence number.
>>>>>>>>
>>>>>>>> If the Message-Id does not exist or is very poor, then I think the
>>>>>>>> only solution is to look at most - if not all - the parts of a message
>>>>>>>> that a user can vary.
>>>>>>>> This is what the full id generation does, except that it also includes
>>>>>>>> the MTA-specific headers, causing problems with repeatability.
>>>>>>>>
>>>>>>>> There is another aspect to this. At present the generation is done
>>>>>>>> using the parsed mail, rather than the original.
>>>>>>>> If there is any chance that the format can change between releases of
>>>>>>>> the library, then this may destroy repeatability.
>>>>>>>> Likewise if a different library is ever used.
>>>>>>>> We may need to additionally ensure that all headers are in a canonical
>>>>>>>> form before use, in case MTAs decide to vary the layout.
>>>>>>>>
>>>>>>>>> 2) Store some more headers in the doc :)
>>>>>>>>
>>>>>>>> Those might be useful for searching, but each posting needs its own
>>>>>>>> unique id otherwise it cannot be stored in the first place.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The format will presumably depend on the mailing list software that 
>>>>>>>>>> is used.
>>>>>>>>>>
>>>>>>>>>>>> List-Id
>>>>>>>>>>>
>>>>>>>>>>> I think this must be the original List-Id, not any override.
>>>>>>>>>>> Otherwise there may be problems with permalinks if a list name is
>>>>>>>>>>> updated - the old permalink will no longer work.
>>>>>>>>>>>
>>>>>>>>>>>> Whatever new algorithm is chosen, I think it's important that the
>>>>>>>>>>>> format looks different from the existing ones. e.g. one could drop 
>>>>>>>>>>>> the
>>>>>>>>>>>> <> around the list id.
>>>>>>>>>
>>>>>>

Reply via email to