Re: Better Id generator

sebb Wed, 16 Nov 2016 13:51:16 -0800

Just discovered an issue which affects the Permalinks.

If the archiver msgbody() function fails to detect a text message
body, it will return null.
If the message has an attachment, then the archiver will generate an
entry for it.
However the short and medium id generators will fail when accessing the body.


The mid will then revert to the previously calculated value.
This is derived from the list id and the 'archived-at' header.
However the archived-at header is added by the archiver itself if necessary.
So if the message ever needs to be reloaded from elsewhere, a new
mid/Permalink value will likely be generated.
The Permalink will stop working unless the original entry is kept.

Unfortunately the fallback mids have exactly the same format as the
medium generator (sha224@lid).
This makes it impossible to determine which Permalinks are affected
from the link alone.
However any existing mbox entries which have body: null will be using
the fallback mid.


On 25 October 2016 at 19:50, sebb <[email protected]> wrote:
> Unfortunately it appears that Message-Ids are not always generated
> (*), so there needs to be an equivalent that can be used.
> Furthermore, some emails may have more than one Message-Id [+]
>
> The original message fields (including full body) are not enough to
> uniquely id a message in the mailing list stream.
> For that one needs something like a sequence number.
> However some early mailboxes don't include the ezmlm sequence number.
>
> However AFAIK every message that reaches a mailing list must have been
> delivered to the mailserver mailbox.
> So the mailing list message *instance* should be identifiable using
> either the sequence number, or failing that, some or all of the
> routing information by which it arrived at the mailbox.
>
> And one can identify the mail *content* using the Message-Id if
> present (failing that, a hash of the full e-mail).
>
> This needs some more work and some worked example messages.
>
> Sorry to keep going on about this, but AFAICT currently Pony Mail does
> not have an algorithm that satisfies all the requirements for use as
> an ES id.
>
> (*) Automated services and older e-mails don't always have them.
> Such mails don't appear in mod_mbox listings though they are present
> in the raw mbox files.
> [+] mod_mbox appears to use the first Message-Id it finds.
>
> On 22 October 2016 at 23:02, sebb <[email protected]> wrote:
>> On 18 October 2016 at 10:49, Daniel Gruno <[email protected]> wrote:
>>> On 10/18/2016 11:43 AM, sebb wrote:
>>>> It just occurs to me that there is another aspect to be considered:
>>>> list renaming.
>>>> This might affect both the unique id and Permalinks.
>>>>
>>>> If the list id is embedded in the Permalink hash, I think it would be
>>>> very difficult to honour existing links.
>>>>
>>>> This is in contrast with the link format as used by mod_mbox, where
>>>> the list id is exposed in the URL, for example:
>>>>
>>>> http://mail-archives.apache.org/mod_mbox/incubator-any23-dev/201209.mbox/<message-id>
>>>> http://mail-archives.apache.org/mod_mbox/any23-dev/201209.mbox/<message-id>
>>>>
>>>> It would be easy enough to redirect one URL to the other - if necessary.
>>>> (In this case, the original lists have been retained)
>>>
>>> Generally not an issue.
>>> when you rename a list in Pony Mail, you keep the original ID and
>>> source, you only touch the list value in elasticsearch. permalinks will
>>> remain the same, but point to the new list name (and keep the original
>>> in the source).
>>
>> I see, that's probably OK.
>>
>>> There are of course edge cases where you have to reimport and you force
>>> a list ID on the command line, which may change the generated ID.
>>
>> That could be a problem both for the ES id and the Permalink.
>> If the id uses the new LID, it will result in a new MID.
>> If the old message is still present, there will be two copies.
>> If the old message is deleted, the Permalink will break.
>>
>> I think we do need to allow for re-imports.
>> This makes generation of the MID tricky.
>>
>>> If we scrap the list id for a moment, we are left with, at least, the
>>> following constants:
>>>
>>> Sender
>>> Date
>>> Subject
>>> Body
>>
>> (I assume this includes the attachments)
>>
>>> Message ID
>>>
>>> The problem with the above is, that an email with the same values in all
>>> fields may have been sent to multiple lists, so ideally we'd keep
>>> separate copies for each list....which gets tricky without adding the
>>> list ID to this. A workaround would be to force the ID generator to
>>> always use the original list ID in the source for generating the ID, and
>>> only use the forced list ID if no list was found.
>>
>> The identical message may be sent to the same list twice.
>> I've seen this already on some of the lists (I think I mentioned this 
>> already).
>> It looks like a mail client sometimes duplicates the message.
>> Maybe it can also occur if the destination list has an alias and the
>> message is sent to the alias as well.
>>
>> So the list id is not sufficient to identify the specific message.
>>
>> However mailing list software has to identify bounces.
>> This is usually done by adding a unique id to the Return-Path,
>> In the case of ezmlm this is a sequence number.
>>
>> That will be present in mails sent to subscribers, and is present in
>> the mbox files (apart from some very early ones).
>>
>> If present, it should be a constant for a particular message on a
>> specific mailing list.
>>
>>> Thoughts?
>>
>> The MID needs to be unique to ensure that all messages can be stored OK in 
>> ES.
>> Unwanted duplicates cause message loss.
>>
>> The Permalink needs to be persistent.
>> If there are multiple Permalinks for the same message that is not a problem.
>> If a single Permalink applies to multiple messages, that is not ideal,
>> but at least there is a chance of recovering the message.
>> But if a Permalink disappears, then it is a big problem.
>>
>> Although the mod_mbox solution is not perfect, its Permalinks have the
>> advantage that the Message-ID and an idea of the List-name can be got
>> from the link.
>> This means that there is good chance of being able to find the
>> matching message(s) in almost any mail archive, should the originals
>> be unavailable.
>>
>> With the current Pony Mail links, if a message is lost from the
>> database, AFAICT the only way to recover it is to re-import all the
>> messages and hope the same ids are re-generated.
>> Since there have been at least two different generator algorithms
>> used, and the generator was/is sensitive to the host time zone, this
>> will likely be a lengthy process.
>> The long links do at least include the list-id, which would reduce the
>> work somewhat.
>>
>> So I think the Permalink needs to be similar to the current mod_mbox 
>> solution.
>> The MID does not need to be directly related, so long as it is unique
>> and stable.
>>
>>>>
>>>> How can list renaming be managed in Pony Mail?
>>>> Or is it not allowed?
>>>>
>>>>
>>>> On 14 October 2016 at 00:21, sebb <[email protected]> wrote:
>>>>> On 13 October 2016 at 23:14, Daniel Gruno <[email protected]> wrote:
>>>>>> On 10/14/2016 12:12 AM, sebb wrote:
>>>>>>> On 13 October 2016 at 21:28, sebb <[email protected]> wrote:
>>>>>>>> On 7 October 2016 at 00:44, sebb <[email protected]> wrote:
>>>>>>>>> The id generator is used to create a key for the message database, and
>>>>>>>>> also to create a Permalink.
>>>>>>>>>
>>>>>>>>> Therefore, an id generator needs to fulfil the following design goals
>>>>>>>>> as a minimum:
>>>>>>>>> A) different messages have different IDs
>>>>>>>>> B) the same id is generated if the same message is re-processed
>>>>>>>>> C) equivalent messages have the same ID
>>>>>>>>>
>>>>>>>>> Goal A is needed to ensure that the database can contain every 
>>>>>>>>> different message
>>>>>>>>> Goal B is needed to ensure that the database can be reloaded from the
>>>>>>>>> original source if necessary
>>>>>>>>> Goal C is needed to ensure that the database can be reloaded from an
>>>>>>>>> equivalent source, and to ensure that Permalinks are stable.
>>>>>>>>>
>>>>>>>>> None of the current id generator algorithms meet all of the above 
>>>>>>>>> goals.
>>>>>>>>>
>>>>>>>>> The original and medium generators fail to meet goal A.
>>>>>>>>> The full generator fails to meet goal B (and therefore C).
>>>>>>>>>
>>>>>>>>> A sender can easily generate two messages with identical content; it
>>>>>>>>> is important to distinguish these.
>>>>>>>>>
>>>>>>>>> The Message-Id should help here.
>>>>>>>>>
>>>>>>>>> Message-ID is supposed to be unique, in practice it may not be, so
>>>>>>>>> some additional fields need to be used to create the database id.
>>>>>>>>>
>>>>>>>>> For mailing lists the Return-Path will normally contain a unique id
>>>>>>>>> which is used to identify bounces.
>>>>>>>>> In theory this might be sufficient on its own. Indeed the path might
>>>>>>>>> be usable without hashing. However the early ASF mailing list software
>>>>>>>>> did not use unique Return-Paths.
>>>>>>>>>
>>>>>>>>> The existing mod_mbox solution uses a combination of Message-Id plus
>>>>>>>>> YYYYMM plus a list identifier. Have there ever been any collisions?
>>>>>>>>
>>>>>>>> It's certainly possible for mod_mbox to contain multiple messages with
>>>>>>>> the same id.
>>>>>>>>
>>>>>>>> For example:
>>>>>>>>
>>>>>>>> From 
>>>>>>>> dev-return-15575-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
>>>>>>>>  Tue Jun  7 20:08:04 2016
>>>>>>>> and
>>>>>>>> From 
>>>>>>>> dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
>>>>>>>>  Tue Jun  7 20:24:24 2016
>>>>>>>>
>>>>>>>> both have the same id:
>>>>>>>>
>>>>>>>> Message-Id: <[email protected]>
>>>>>>>>
>>>>>>>> It's exactly the same message which arrived twice in the same mailing
>>>>>>>> list a few seconds apart.
>>>>>>>> Maybe it was sent using Bcc as well as To: ?
>>>>>>>>
>>>>>>>> It's not exactly a collision, but at present mod_mbox is able to store
>>>>>>>> both whereas Pony Mail cannot (except if using the full generator,
>>>>>>>> which has other problems)
>>>>>>>>
>>>>>>>> I think it's important to store the full message history in the 
>>>>>>>> database.
>>>>>>>> For example, if one of the messages bounces, it would be odd if the
>>>>>>>> source of the bounce were not in the database.
>>>>>>>> Also the message sequences will be incomplete.
>>>>>>>> This is the case for lists.a.o, the mbox
>>>>>>>>
>>>>>>>> https://lists.apache.org/api/[email protected]&date=2016-6
>>>>>>>>
>>>>>>>> does not have the message sequence number 15575
>>>>>>>>
>>>>>>>>> How about using the following:
>>>>>>>>>
>>>>>>>>> Message-Id
>>>>>>>>> Date
>>>>>>>>> Return-Path
>>>>>>>>
>>>>>>>> In this case the return-path can be used to distinguish the messages.
>>>>>>>
>>>>>>> Note that the path depends on where the message is stored:
>>>>>>>
>>>>>>> Return-Path: 
>>>>>>> <dev-return-15577-archive-asf-public=cust-asf.ponee...@airavata.apache.org>
>>>>>>> as against
>>>>>>> Return-Path: 
>>>>>>> <dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org>
>>>>>>>
>>>>>>> So it will need adjusting to extract the parts that are the same for
>>>>>>> all message sources, whether that is mod_mbox or Pony Mail or sent to
>>>>>>> another subscriber.
>>>>>>
>>>>>>
>>>>>> How about we:
>>>>>> 1) select a larger list of headers that we know to be the same across
>>>>>> the same email sent to different MTAs, and combine them for the ID
>>>>>> generation.
>>>>>
>>>>> Sort of.
>>>>>
>>>>> AFAICT the *only* value that can distinguish multiple postings (of
>>>>> which there seem to be quite a lot) is the mailing list sequence
>>>>> number, which in ezmlm is only available in the Return-Path.
>>>>>
>>>>> However the sequence could perhaps be reset - so it seems risky to
>>>>> rely on that plus the list id alone.
>>>>> Also the earliest messages did not have sequence numbers.
>>>>>
>>>>> So there needs to be another way to identify distinct messages.
>>>>> In theory, that is the Message-Id.
>>>>> Even if it is not completely unique, it should be OK in combination
>>>>> with the sequence number.
>>>>>
>>>>> If the Message-Id does not exist or is very poor, then I think the
>>>>> only solution is to look at most - if not all - the parts of a message
>>>>> that a user can vary.
>>>>> This is what the full id generation does, except that it also includes
>>>>> the MTA-specific headers, causing problems with repeatability.
>>>>>
>>>>> There is another aspect to this. At present the generation is done
>>>>> using the parsed mail, rather than the original.
>>>>> If there is any chance that the format can change between releases of
>>>>> the library, then this may destroy repeatability.
>>>>> Likewise if a different library is ever used.
>>>>> We may need to additionally ensure that all headers are in a canonical
>>>>> form before use, in case MTAs decide to vary the layout.
>>>>>
>>>>>> 2) Store some more headers in the doc :)
>>>>>
>>>>> Those might be useful for searching, but each posting needs its own
>>>>> unique id otherwise it cannot be stored in the first place.
>>>>>
>>>>>>>
>>>>>>> The format will presumably depend on the mailing list software that is 
>>>>>>> used.
>>>>>>>
>>>>>>>>> List-Id
>>>>>>>>
>>>>>>>> I think this must be the original List-Id, not any override.
>>>>>>>> Otherwise there may be problems with permalinks if a list name is
>>>>>>>> updated - the old permalink will no longer work.
>>>>>>>>
>>>>>>>>> Whatever new algorithm is chosen, I think it's important that the
>>>>>>>>> format looks different from the existing ones. e.g. one could drop the
>>>>>>>>> <> around the list id.
>>>>>>
>>>

Re: Better Id generator

Reply via email to