On 10/14/2016 12:12 AM, sebb wrote:
> On 13 October 2016 at 21:28, sebb <seb...@gmail.com> wrote:
>> On 7 October 2016 at 00:44, sebb <seb...@gmail.com> wrote:
>>> The id generator is used to create a key for the message database, and
>>> also to create a Permalink.
>>> Therefore, an id generator needs to fulfil the following design goals
>>> as a minimum:
>>> A) different messages have different IDs
>>> B) the same id is generated if the same message is re-processed
>>> C) equivalent messages have the same ID
>>> Goal A is needed to ensure that the database can contain every different 
>>> message
>>> Goal B is needed to ensure that the database can be reloaded from the
>>> original source if necessary
>>> Goal C is needed to ensure that the database can be reloaded from an
>>> equivalent source, and to ensure that Permalinks are stable.
>>> None of the current id generator algorithms meet all of the above goals.
>>> The original and medium generators fail to meet goal A.
>>> The full generator fails to meet goal B (and therefore C).
>>> A sender can easily generate two messages with identical content; it
>>> is important to distinguish these.
>>> The Message-Id should help here.
>>> Message-ID is supposed to be unique, in practice it may not be, so
>>> some additional fields need to be used to create the database id.
>>> For mailing lists the Return-Path will normally contain a unique id
>>> which is used to identify bounces.
>>> In theory this might be sufficient on its own. Indeed the path might
>>> be usable without hashing. However the early ASF mailing list software
>>> did not use unique Return-Paths.
>>> The existing mod_mbox solution uses a combination of Message-Id plus
>>> YYYYMM plus a list identifier. Have there ever been any collisions?
>> It's certainly possible for mod_mbox to contain multiple messages with
>> the same id.
>> For example:
>> From 
>> dev-return-15575-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
>>  Tue Jun  7 20:08:04 2016
>> and
>> From 
>> dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
>>  Tue Jun  7 20:24:24 2016
>> both have the same id:
>> Message-Id: <bcd87d1a-bfb6-4f2a-92cc-af8f1d16b...@apache.org>
>> It's exactly the same message which arrived twice in the same mailing
>> list a few seconds apart.
>> Maybe it was sent using Bcc as well as To: ?
>> It's not exactly a collision, but at present mod_mbox is able to store
>> both whereas Pony Mail cannot (except if using the full generator,
>> which has other problems)
>> I think it's important to store the full message history in the database.
>> For example, if one of the messages bounces, it would be odd if the
>> source of the bounce were not in the database.
>> Also the message sequences will be incomplete.
>> This is the case for lists.a.o, the mbox
>> https://lists.apache.org/api/mbox.lua?list=d...@airavata.apache.org&date=2016-6
>> does not have the message sequence number 15575
>>> How about using the following:
>>> Message-Id
>>> Date
>>> Return-Path
>> In this case the return-path can be used to distinguish the messages.
> Note that the path depends on where the message is stored:
> Return-Path: 
> <dev-return-15577-archive-asf-public=cust-asf.ponee...@airavata.apache.org>
> as against
> Return-Path: 
> <dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org>
> So it will need adjusting to extract the parts that are the same for
> all message sources, whether that is mod_mbox or Pony Mail or sent to
> another subscriber.

How about we:
1) select a larger list of headers that we know to be the same across
the same email sent to different MTAs, and combine them for the ID

2) Store some more headers in the doc :)

> The format will presumably depend on the mailing list software that is used.
>>> List-Id
>> I think this must be the original List-Id, not any override.
>> Otherwise there may be problems with permalinks if a list name is
>> updated - the old permalink will no longer work.
>>> Whatever new algorithm is chosen, I think it's important that the
>>> format looks different from the existing ones. e.g. one could drop the
>>> <> around the list id.

Reply via email to