On 10/18/2016 11:43 AM, sebb wrote:
> It just occurs to me that there is another aspect to be considered:
> list renaming.
> This might affect both the unique id and Permalinks.
> If the list id is embedded in the Permalink hash, I think it would be
> very difficult to honour existing links.
> This is in contrast with the link format as used by mod_mbox, where
> the list id is exposed in the URL, for example:
> http://mail-archives.apache.org/mod_mbox/incubator-any23-dev/201209.mbox/<message-id>
> http://mail-archives.apache.org/mod_mbox/any23-dev/201209.mbox/<message-id>
> It would be easy enough to redirect one URL to the other - if necessary.
> (In this case, the original lists have been retained)

Generally not an issue.
when you rename a list in Pony Mail, you keep the original ID and
source, you only touch the list value in elasticsearch. permalinks will
remain the same, but point to the new list name (and keep the original
in the source).

There are of course edge cases where you have to reimport and you force
a list ID on the command line, which may change the generated ID.

If we scrap the list id for a moment, we are left with, at least, the
following constants:

Message ID

The problem with the above is, that an email with the same values in all
fields may have been sent to multiple lists, so ideally we'd keep
separate copies for each list....which gets tricky without adding the
list ID to this. A workaround would be to force the ID generator to
always use the original list ID in the source for generating the ID, and
only use the forced list ID if no list was found.


> How can list renaming be managed in Pony Mail?
> Or is it not allowed?
> On 14 October 2016 at 00:21, sebb <seb...@gmail.com> wrote:
>> On 13 October 2016 at 23:14, Daniel Gruno <humbed...@apache.org> wrote:
>>> On 10/14/2016 12:12 AM, sebb wrote:
>>>> On 13 October 2016 at 21:28, sebb <seb...@gmail.com> wrote:
>>>>> On 7 October 2016 at 00:44, sebb <seb...@gmail.com> wrote:
>>>>>> The id generator is used to create a key for the message database, and
>>>>>> also to create a Permalink.
>>>>>> Therefore, an id generator needs to fulfil the following design goals
>>>>>> as a minimum:
>>>>>> A) different messages have different IDs
>>>>>> B) the same id is generated if the same message is re-processed
>>>>>> C) equivalent messages have the same ID
>>>>>> Goal A is needed to ensure that the database can contain every different 
>>>>>> message
>>>>>> Goal B is needed to ensure that the database can be reloaded from the
>>>>>> original source if necessary
>>>>>> Goal C is needed to ensure that the database can be reloaded from an
>>>>>> equivalent source, and to ensure that Permalinks are stable.
>>>>>> None of the current id generator algorithms meet all of the above goals.
>>>>>> The original and medium generators fail to meet goal A.
>>>>>> The full generator fails to meet goal B (and therefore C).
>>>>>> A sender can easily generate two messages with identical content; it
>>>>>> is important to distinguish these.
>>>>>> The Message-Id should help here.
>>>>>> Message-ID is supposed to be unique, in practice it may not be, so
>>>>>> some additional fields need to be used to create the database id.
>>>>>> For mailing lists the Return-Path will normally contain a unique id
>>>>>> which is used to identify bounces.
>>>>>> In theory this might be sufficient on its own. Indeed the path might
>>>>>> be usable without hashing. However the early ASF mailing list software
>>>>>> did not use unique Return-Paths.
>>>>>> The existing mod_mbox solution uses a combination of Message-Id plus
>>>>>> YYYYMM plus a list identifier. Have there ever been any collisions?
>>>>> It's certainly possible for mod_mbox to contain multiple messages with
>>>>> the same id.
>>>>> For example:
>>>>> From 
>>>>> dev-return-15575-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
>>>>>  Tue Jun  7 20:08:04 2016
>>>>> and
>>>>> From 
>>>>> dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
>>>>>  Tue Jun  7 20:24:24 2016
>>>>> both have the same id:
>>>>> Message-Id: <bcd87d1a-bfb6-4f2a-92cc-af8f1d16b...@apache.org>
>>>>> It's exactly the same message which arrived twice in the same mailing
>>>>> list a few seconds apart.
>>>>> Maybe it was sent using Bcc as well as To: ?
>>>>> It's not exactly a collision, but at present mod_mbox is able to store
>>>>> both whereas Pony Mail cannot (except if using the full generator,
>>>>> which has other problems)
>>>>> I think it's important to store the full message history in the database.
>>>>> For example, if one of the messages bounces, it would be odd if the
>>>>> source of the bounce were not in the database.
>>>>> Also the message sequences will be incomplete.
>>>>> This is the case for lists.a.o, the mbox
>>>>> https://lists.apache.org/api/mbox.lua?list=d...@airavata.apache.org&date=2016-6
>>>>> does not have the message sequence number 15575
>>>>>> How about using the following:
>>>>>> Message-Id
>>>>>> Date
>>>>>> Return-Path
>>>>> In this case the return-path can be used to distinguish the messages.
>>>> Note that the path depends on where the message is stored:
>>>> Return-Path: 
>>>> <dev-return-15577-archive-asf-public=cust-asf.ponee...@airavata.apache.org>
>>>> as against
>>>> Return-Path: 
>>>> <dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org>
>>>> So it will need adjusting to extract the parts that are the same for
>>>> all message sources, whether that is mod_mbox or Pony Mail or sent to
>>>> another subscriber.
>>> How about we:
>>> 1) select a larger list of headers that we know to be the same across
>>> the same email sent to different MTAs, and combine them for the ID
>>> generation.
>> Sort of.
>> AFAICT the *only* value that can distinguish multiple postings (of
>> which there seem to be quite a lot) is the mailing list sequence
>> number, which in ezmlm is only available in the Return-Path.
>> However the sequence could perhaps be reset - so it seems risky to
>> rely on that plus the list id alone.
>> Also the earliest messages did not have sequence numbers.
>> So there needs to be another way to identify distinct messages.
>> In theory, that is the Message-Id.
>> Even if it is not completely unique, it should be OK in combination
>> with the sequence number.
>> If the Message-Id does not exist or is very poor, then I think the
>> only solution is to look at most - if not all - the parts of a message
>> that a user can vary.
>> This is what the full id generation does, except that it also includes
>> the MTA-specific headers, causing problems with repeatability.
>> There is another aspect to this. At present the generation is done
>> using the parsed mail, rather than the original.
>> If there is any chance that the format can change between releases of
>> the library, then this may destroy repeatability.
>> Likewise if a different library is ever used.
>> We may need to additionally ensure that all headers are in a canonical
>> form before use, in case MTAs decide to vary the layout.
>>> 2) Store some more headers in the doc :)
>> Those might be useful for searching, but each posting needs its own
>> unique id otherwise it cannot be stored in the first place.
>>>> The format will presumably depend on the mailing list software that is 
>>>> used.
>>>>>> List-Id
>>>>> I think this must be the original List-Id, not any override.
>>>>> Otherwise there may be problems with permalinks if a list name is
>>>>> updated - the old permalink will no longer work.
>>>>>> Whatever new algorithm is chosen, I think it's important that the
>>>>>> format looks different from the existing ones. e.g. one could drop the
>>>>>> <> around the list id.

Reply via email to