Re: Better Id generator

sebb Tue, 18 Oct 2016 03:02:19 -0700

On 18 October 2016 at 10:49, Daniel Gruno <[email protected]> wrote:
> On 10/18/2016 11:43 AM, sebb wrote:
>> It just occurs to me that there is another aspect to be considered:
>> list renaming.
>> This might affect both the unique id and Permalinks.
>>
>> If the list id is embedded in the Permalink hash, I think it would be
>> very difficult to honour existing links.
>>
>> This is in contrast with the link format as used by mod_mbox, where
>> the list id is exposed in the URL, for example:
>>
>> http://mail-archives.apache.org/mod_mbox/incubator-any23-dev/201209.mbox/<message-id>
>> http://mail-archives.apache.org/mod_mbox/any23-dev/201209.mbox/<message-id>
>>
>> It would be easy enough to redirect one URL to the other - if necessary.
>> (In this case, the original lists have been retained)
>
> Generally not an issue.
> when you rename a list in Pony Mail, you keep the original ID and
> source, you only touch the list value in elasticsearch. permalinks will
> remain the same, but point to the new list name (and keep the original
> in the source).
>
> There are of course edge cases where you have to reimport and you force
> a list ID on the command line, which may change the generated ID.
>
> If we scrap the list id for a moment, we are left with, at least, the
> following constants:
>
> Sender
> Date
> Subject
> Body
> Message ID
>
> The problem with the above is, that an email with the same values in all
> fields may have been sent to multiple lists, so ideally we'd keep
> separate copies for each list....which gets tricky without adding the
> list ID to this. A workaround would be to force the ID generator to
> always use the original list ID in the source for generating the ID, and
> only use the forced list ID if no list was found.
>
> Thoughts?


Need to think more about this, but it occurs to me now that the ES id
and Permalink have similar, but not identical, requirements.
I think this is where some of the issues originate.
One might need to consider using separate algorithms.

>>
>> How can list renaming be managed in Pony Mail?
>> Or is it not allowed?
>>
>>
>> On 14 October 2016 at 00:21, sebb <[email protected]> wrote:
>>> On 13 October 2016 at 23:14, Daniel Gruno <[email protected]> wrote:
>>>> On 10/14/2016 12:12 AM, sebb wrote:
>>>>> On 13 October 2016 at 21:28, sebb <[email protected]> wrote:
>>>>>> On 7 October 2016 at 00:44, sebb <[email protected]> wrote:
>>>>>>> The id generator is used to create a key for the message database, and
>>>>>>> also to create a Permalink.
>>>>>>>
>>>>>>> Therefore, an id generator needs to fulfil the following design goals
>>>>>>> as a minimum:
>>>>>>> A) different messages have different IDs
>>>>>>> B) the same id is generated if the same message is re-processed
>>>>>>> C) equivalent messages have the same ID
>>>>>>>
>>>>>>> Goal A is needed to ensure that the database can contain every 
>>>>>>> different message
>>>>>>> Goal B is needed to ensure that the database can be reloaded from the
>>>>>>> original source if necessary
>>>>>>> Goal C is needed to ensure that the database can be reloaded from an
>>>>>>> equivalent source, and to ensure that Permalinks are stable.
>>>>>>>
>>>>>>> None of the current id generator algorithms meet all of the above goals.
>>>>>>>
>>>>>>> The original and medium generators fail to meet goal A.
>>>>>>> The full generator fails to meet goal B (and therefore C).
>>>>>>>
>>>>>>> A sender can easily generate two messages with identical content; it
>>>>>>> is important to distinguish these.
>>>>>>>
>>>>>>> The Message-Id should help here.
>>>>>>>
>>>>>>> Message-ID is supposed to be unique, in practice it may not be, so
>>>>>>> some additional fields need to be used to create the database id.
>>>>>>>
>>>>>>> For mailing lists the Return-Path will normally contain a unique id
>>>>>>> which is used to identify bounces.
>>>>>>> In theory this might be sufficient on its own. Indeed the path might
>>>>>>> be usable without hashing. However the early ASF mailing list software
>>>>>>> did not use unique Return-Paths.
>>>>>>>
>>>>>>> The existing mod_mbox solution uses a combination of Message-Id plus
>>>>>>> YYYYMM plus a list identifier. Have there ever been any collisions?
>>>>>>
>>>>>> It's certainly possible for mod_mbox to contain multiple messages with
>>>>>> the same id.
>>>>>>
>>>>>> For example:
>>>>>>
>>>>>> From 
>>>>>> dev-return-15575-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
>>>>>>  Tue Jun  7 20:08:04 2016
>>>>>> and
>>>>>> From 
>>>>>> dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
>>>>>>  Tue Jun  7 20:24:24 2016
>>>>>>
>>>>>> both have the same id:
>>>>>>
>>>>>> Message-Id: <[email protected]>
>>>>>>
>>>>>> It's exactly the same message which arrived twice in the same mailing
>>>>>> list a few seconds apart.
>>>>>> Maybe it was sent using Bcc as well as To: ?
>>>>>>
>>>>>> It's not exactly a collision, but at present mod_mbox is able to store
>>>>>> both whereas Pony Mail cannot (except if using the full generator,
>>>>>> which has other problems)
>>>>>>
>>>>>> I think it's important to store the full message history in the database.
>>>>>> For example, if one of the messages bounces, it would be odd if the
>>>>>> source of the bounce were not in the database.
>>>>>> Also the message sequences will be incomplete.
>>>>>> This is the case for lists.a.o, the mbox
>>>>>>
>>>>>> https://lists.apache.org/api/[email protected]&date=2016-6
>>>>>>
>>>>>> does not have the message sequence number 15575
>>>>>>
>>>>>>> How about using the following:
>>>>>>>
>>>>>>> Message-Id
>>>>>>> Date
>>>>>>> Return-Path
>>>>>>
>>>>>> In this case the return-path can be used to distinguish the messages.
>>>>>
>>>>> Note that the path depends on where the message is stored:
>>>>>
>>>>> Return-Path: 
>>>>> <dev-return-15577-archive-asf-public=cust-asf.ponee...@airavata.apache.org>
>>>>> as against
>>>>> Return-Path: 
>>>>> <dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org>
>>>>>
>>>>> So it will need adjusting to extract the parts that are the same for
>>>>> all message sources, whether that is mod_mbox or Pony Mail or sent to
>>>>> another subscriber.
>>>>
>>>>
>>>> How about we:
>>>> 1) select a larger list of headers that we know to be the same across
>>>> the same email sent to different MTAs, and combine them for the ID
>>>> generation.
>>>
>>> Sort of.
>>>
>>> AFAICT the *only* value that can distinguish multiple postings (of
>>> which there seem to be quite a lot) is the mailing list sequence
>>> number, which in ezmlm is only available in the Return-Path.
>>>
>>> However the sequence could perhaps be reset - so it seems risky to
>>> rely on that plus the list id alone.
>>> Also the earliest messages did not have sequence numbers.
>>>
>>> So there needs to be another way to identify distinct messages.
>>> In theory, that is the Message-Id.
>>> Even if it is not completely unique, it should be OK in combination
>>> with the sequence number.
>>>
>>> If the Message-Id does not exist or is very poor, then I think the
>>> only solution is to look at most - if not all - the parts of a message
>>> that a user can vary.
>>> This is what the full id generation does, except that it also includes
>>> the MTA-specific headers, causing problems with repeatability.
>>>
>>> There is another aspect to this. At present the generation is done
>>> using the parsed mail, rather than the original.
>>> If there is any chance that the format can change between releases of
>>> the library, then this may destroy repeatability.
>>> Likewise if a different library is ever used.
>>> We may need to additionally ensure that all headers are in a canonical
>>> form before use, in case MTAs decide to vary the layout.
>>>
>>>> 2) Store some more headers in the doc :)
>>>
>>> Those might be useful for searching, but each posting needs its own
>>> unique id otherwise it cannot be stored in the first place.
>>>
>>>>>
>>>>> The format will presumably depend on the mailing list software that is 
>>>>> used.
>>>>>
>>>>>>> List-Id
>>>>>>
>>>>>> I think this must be the original List-Id, not any override.
>>>>>> Otherwise there may be problems with permalinks if a list name is
>>>>>> updated - the old permalink will no longer work.
>>>>>>
>>>>>>> Whatever new algorithm is chosen, I think it's important that the
>>>>>>> format looks different from the existing ones. e.g. one could drop the
>>>>>>> <> around the list id.
>>>>
>

Re: Better Id generator

Reply via email to