On 13 October 2016 at 23:14, Daniel Gruno <humbed...@apache.org> wrote:
> On 10/14/2016 12:12 AM, sebb wrote:
>> On 13 October 2016 at 21:28, sebb <seb...@gmail.com> wrote:
>>> On 7 October 2016 at 00:44, sebb <seb...@gmail.com> wrote:
>>>> The id generator is used to create a key for the message database, and
>>>> also to create a Permalink.
>>>> Therefore, an id generator needs to fulfil the following design goals
>>>> as a minimum:
>>>> A) different messages have different IDs
>>>> B) the same id is generated if the same message is re-processed
>>>> C) equivalent messages have the same ID
>>>> Goal A is needed to ensure that the database can contain every different 
>>>> message
>>>> Goal B is needed to ensure that the database can be reloaded from the
>>>> original source if necessary
>>>> Goal C is needed to ensure that the database can be reloaded from an
>>>> equivalent source, and to ensure that Permalinks are stable.
>>>> None of the current id generator algorithms meet all of the above goals.
>>>> The original and medium generators fail to meet goal A.
>>>> The full generator fails to meet goal B (and therefore C).
>>>> A sender can easily generate two messages with identical content; it
>>>> is important to distinguish these.
>>>> The Message-Id should help here.
>>>> Message-ID is supposed to be unique, in practice it may not be, so
>>>> some additional fields need to be used to create the database id.
>>>> For mailing lists the Return-Path will normally contain a unique id
>>>> which is used to identify bounces.
>>>> In theory this might be sufficient on its own. Indeed the path might
>>>> be usable without hashing. However the early ASF mailing list software
>>>> did not use unique Return-Paths.
>>>> The existing mod_mbox solution uses a combination of Message-Id plus
>>>> YYYYMM plus a list identifier. Have there ever been any collisions?
>>> It's certainly possible for mod_mbox to contain multiple messages with
>>> the same id.
>>> For example:
>>> From 
>>> dev-return-15575-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
>>>  Tue Jun  7 20:08:04 2016
>>> and
>>> From 
>>> dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
>>>  Tue Jun  7 20:24:24 2016
>>> both have the same id:
>>> Message-Id: <bcd87d1a-bfb6-4f2a-92cc-af8f1d16b...@apache.org>
>>> It's exactly the same message which arrived twice in the same mailing
>>> list a few seconds apart.
>>> Maybe it was sent using Bcc as well as To: ?
>>> It's not exactly a collision, but at present mod_mbox is able to store
>>> both whereas Pony Mail cannot (except if using the full generator,
>>> which has other problems)
>>> I think it's important to store the full message history in the database.
>>> For example, if one of the messages bounces, it would be odd if the
>>> source of the bounce were not in the database.
>>> Also the message sequences will be incomplete.
>>> This is the case for lists.a.o, the mbox
>>> https://lists.apache.org/api/mbox.lua?list=d...@airavata.apache.org&date=2016-6
>>> does not have the message sequence number 15575
>>>> How about using the following:
>>>> Message-Id
>>>> Date
>>>> Return-Path
>>> In this case the return-path can be used to distinguish the messages.
>> Note that the path depends on where the message is stored:
>> Return-Path: 
>> <dev-return-15577-archive-asf-public=cust-asf.ponee...@airavata.apache.org>
>> as against
>> Return-Path: 
>> <dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org>
>> So it will need adjusting to extract the parts that are the same for
>> all message sources, whether that is mod_mbox or Pony Mail or sent to
>> another subscriber.
> How about we:
> 1) select a larger list of headers that we know to be the same across
> the same email sent to different MTAs, and combine them for the ID
> generation.

Sort of.

AFAICT the *only* value that can distinguish multiple postings (of
which there seem to be quite a lot) is the mailing list sequence
number, which in ezmlm is only available in the Return-Path.

However the sequence could perhaps be reset - so it seems risky to
rely on that plus the list id alone.
Also the earliest messages did not have sequence numbers.

So there needs to be another way to identify distinct messages.
In theory, that is the Message-Id.
Even if it is not completely unique, it should be OK in combination
with the sequence number.

If the Message-Id does not exist or is very poor, then I think the
only solution is to look at most - if not all - the parts of a message
that a user can vary.
This is what the full id generation does, except that it also includes
the MTA-specific headers, causing problems with repeatability.

There is another aspect to this. At present the generation is done
using the parsed mail, rather than the original.
If there is any chance that the format can change between releases of
the library, then this may destroy repeatability.
Likewise if a different library is ever used.
We may need to additionally ensure that all headers are in a canonical
form before use, in case MTAs decide to vary the layout.

> 2) Store some more headers in the doc :)

Those might be useful for searching, but each posting needs its own
unique id otherwise it cannot be stored in the first place.

>> The format will presumably depend on the mailing list software that is used.
>>>> List-Id
>>> I think this must be the original List-Id, not any override.
>>> Otherwise there may be problems with permalinks if a list name is
>>> updated - the old permalink will no longer work.
>>>> Whatever new algorithm is chosen, I think it's important that the
>>>> format looks different from the existing ones. e.g. one could drop the
>>>> <> around the list id.

Reply via email to