Re: Better Id generator

sebb Tue, 18 Oct 2016 02:44:16 -0700

It just occurs to me that there is another aspect to be considered:
list renaming.
This might affect both the unique id and Permalinks.


If the list id is embedded in the Permalink hash, I think it would be
very difficult to honour existing links.

This is in contrast with the link format as used by mod_mbox, where
the list id is exposed in the URL, for example:

http://mail-archives.apache.org/mod_mbox/incubator-any23-dev/201209.mbox/<message-id>
http://mail-archives.apache.org/mod_mbox/any23-dev/201209.mbox/<message-id>

It would be easy enough to redirect one URL to the other - if necessary.
(In this case, the original lists have been retained)

How can list renaming be managed in Pony Mail?
Or is it not allowed?


On 14 October 2016 at 00:21, sebb <[email protected]> wrote:
> On 13 October 2016 at 23:14, Daniel Gruno <[email protected]> wrote:
>> On 10/14/2016 12:12 AM, sebb wrote:
>>> On 13 October 2016 at 21:28, sebb <[email protected]> wrote:
>>>> On 7 October 2016 at 00:44, sebb <[email protected]> wrote:
>>>>> The id generator is used to create a key for the message database, and
>>>>> also to create a Permalink.
>>>>>
>>>>> Therefore, an id generator needs to fulfil the following design goals
>>>>> as a minimum:
>>>>> A) different messages have different IDs
>>>>> B) the same id is generated if the same message is re-processed
>>>>> C) equivalent messages have the same ID
>>>>>
>>>>> Goal A is needed to ensure that the database can contain every different 
>>>>> message
>>>>> Goal B is needed to ensure that the database can be reloaded from the
>>>>> original source if necessary
>>>>> Goal C is needed to ensure that the database can be reloaded from an
>>>>> equivalent source, and to ensure that Permalinks are stable.
>>>>>
>>>>> None of the current id generator algorithms meet all of the above goals.
>>>>>
>>>>> The original and medium generators fail to meet goal A.
>>>>> The full generator fails to meet goal B (and therefore C).
>>>>>
>>>>> A sender can easily generate two messages with identical content; it
>>>>> is important to distinguish these.
>>>>>
>>>>> The Message-Id should help here.
>>>>>
>>>>> Message-ID is supposed to be unique, in practice it may not be, so
>>>>> some additional fields need to be used to create the database id.
>>>>>
>>>>> For mailing lists the Return-Path will normally contain a unique id
>>>>> which is used to identify bounces.
>>>>> In theory this might be sufficient on its own. Indeed the path might
>>>>> be usable without hashing. However the early ASF mailing list software
>>>>> did not use unique Return-Paths.
>>>>>
>>>>> The existing mod_mbox solution uses a combination of Message-Id plus
>>>>> YYYYMM plus a list identifier. Have there ever been any collisions?
>>>>
>>>> It's certainly possible for mod_mbox to contain multiple messages with
>>>> the same id.
>>>>
>>>> For example:
>>>>
>>>> From 
>>>> dev-return-15575-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
>>>>  Tue Jun  7 20:08:04 2016
>>>> and
>>>> From 
>>>> dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
>>>>  Tue Jun  7 20:24:24 2016
>>>>
>>>> both have the same id:
>>>>
>>>> Message-Id: <[email protected]>
>>>>
>>>> It's exactly the same message which arrived twice in the same mailing
>>>> list a few seconds apart.
>>>> Maybe it was sent using Bcc as well as To: ?
>>>>
>>>> It's not exactly a collision, but at present mod_mbox is able to store
>>>> both whereas Pony Mail cannot (except if using the full generator,
>>>> which has other problems)
>>>>
>>>> I think it's important to store the full message history in the database.
>>>> For example, if one of the messages bounces, it would be odd if the
>>>> source of the bounce were not in the database.
>>>> Also the message sequences will be incomplete.
>>>> This is the case for lists.a.o, the mbox
>>>>
>>>> https://lists.apache.org/api/[email protected]&date=2016-6
>>>>
>>>> does not have the message sequence number 15575
>>>>
>>>>> How about using the following:
>>>>>
>>>>> Message-Id
>>>>> Date
>>>>> Return-Path
>>>>
>>>> In this case the return-path can be used to distinguish the messages.
>>>
>>> Note that the path depends on where the message is stored:
>>>
>>> Return-Path: 
>>> <dev-return-15577-archive-asf-public=cust-asf.ponee...@airavata.apache.org>
>>> as against
>>> Return-Path: 
>>> <dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org>
>>>
>>> So it will need adjusting to extract the parts that are the same for
>>> all message sources, whether that is mod_mbox or Pony Mail or sent to
>>> another subscriber.
>>
>>
>> How about we:
>> 1) select a larger list of headers that we know to be the same across
>> the same email sent to different MTAs, and combine them for the ID
>> generation.
>
> Sort of.
>
> AFAICT the *only* value that can distinguish multiple postings (of
> which there seem to be quite a lot) is the mailing list sequence
> number, which in ezmlm is only available in the Return-Path.
>
> However the sequence could perhaps be reset - so it seems risky to
> rely on that plus the list id alone.
> Also the earliest messages did not have sequence numbers.
>
> So there needs to be another way to identify distinct messages.
> In theory, that is the Message-Id.
> Even if it is not completely unique, it should be OK in combination
> with the sequence number.
>
> If the Message-Id does not exist or is very poor, then I think the
> only solution is to look at most - if not all - the parts of a message
> that a user can vary.
> This is what the full id generation does, except that it also includes
> the MTA-specific headers, causing problems with repeatability.
>
> There is another aspect to this. At present the generation is done
> using the parsed mail, rather than the original.
> If there is any chance that the format can change between releases of
> the library, then this may destroy repeatability.
> Likewise if a different library is ever used.
> We may need to additionally ensure that all headers are in a canonical
> form before use, in case MTAs decide to vary the layout.
>
>> 2) Store some more headers in the doc :)
>
> Those might be useful for searching, but each posting needs its own
> unique id otherwise it cannot be stored in the first place.
>
>>>
>>> The format will presumably depend on the mailing list software that is used.
>>>
>>>>> List-Id
>>>>
>>>> I think this must be the original List-Id, not any override.
>>>> Otherwise there may be problems with permalinks if a list name is
>>>> updated - the old permalink will no longer work.
>>>>
>>>>> Whatever new algorithm is chosen, I think it's important that the
>>>>> format looks different from the existing ones. e.g. one could drop the
>>>>> <> around the list id.
>>

Re: Better Id generator

Reply via email to