On 18 October 2016 at 10:49, Daniel Gruno <humbed...@apache.org> wrote:
> On 10/18/2016 11:43 AM, sebb wrote:
>> It just occurs to me that there is another aspect to be considered:
>> list renaming.
>> This might affect both the unique id and Permalinks.
>> If the list id is embedded in the Permalink hash, I think it would be
>> very difficult to honour existing links.
>> This is in contrast with the link format as used by mod_mbox, where
>> the list id is exposed in the URL, for example:
>> http://mail-archives.apache.org/mod_mbox/incubator-any23-dev/201209.mbox/<message-id>
>> http://mail-archives.apache.org/mod_mbox/any23-dev/201209.mbox/<message-id>
>> It would be easy enough to redirect one URL to the other - if necessary.
>> (In this case, the original lists have been retained)
> Generally not an issue.
> when you rename a list in Pony Mail, you keep the original ID and
> source, you only touch the list value in elasticsearch. permalinks will
> remain the same, but point to the new list name (and keep the original
> in the source).
> There are of course edge cases where you have to reimport and you force
> a list ID on the command line, which may change the generated ID.
> If we scrap the list id for a moment, we are left with, at least, the
> following constants:
> Sender
> Date
> Subject
> Body
> Message ID
> The problem with the above is, that an email with the same values in all
> fields may have been sent to multiple lists, so ideally we'd keep
> separate copies for each list....which gets tricky without adding the
> list ID to this. A workaround would be to force the ID generator to
> always use the original list ID in the source for generating the ID, and
> only use the forced list ID if no list was found.
> Thoughts?

Need to think more about this, but it occurs to me now that the ES id
and Permalink have similar, but not identical, requirements.
I think this is where some of the issues originate.
One might need to consider using separate algorithms.

>> How can list renaming be managed in Pony Mail?
>> Or is it not allowed?
>> On 14 October 2016 at 00:21, sebb <seb...@gmail.com> wrote:
>>> On 13 October 2016 at 23:14, Daniel Gruno <humbed...@apache.org> wrote:
>>>> On 10/14/2016 12:12 AM, sebb wrote:
>>>>> On 13 October 2016 at 21:28, sebb <seb...@gmail.com> wrote:
>>>>>> On 7 October 2016 at 00:44, sebb <seb...@gmail.com> wrote:
>>>>>>> The id generator is used to create a key for the message database, and
>>>>>>> also to create a Permalink.
>>>>>>> Therefore, an id generator needs to fulfil the following design goals
>>>>>>> as a minimum:
>>>>>>> A) different messages have different IDs
>>>>>>> B) the same id is generated if the same message is re-processed
>>>>>>> C) equivalent messages have the same ID
>>>>>>> Goal A is needed to ensure that the database can contain every 
>>>>>>> different message
>>>>>>> Goal B is needed to ensure that the database can be reloaded from the
>>>>>>> original source if necessary
>>>>>>> Goal C is needed to ensure that the database can be reloaded from an
>>>>>>> equivalent source, and to ensure that Permalinks are stable.
>>>>>>> None of the current id generator algorithms meet all of the above goals.
>>>>>>> The original and medium generators fail to meet goal A.
>>>>>>> The full generator fails to meet goal B (and therefore C).
>>>>>>> A sender can easily generate two messages with identical content; it
>>>>>>> is important to distinguish these.
>>>>>>> The Message-Id should help here.
>>>>>>> Message-ID is supposed to be unique, in practice it may not be, so
>>>>>>> some additional fields need to be used to create the database id.
>>>>>>> For mailing lists the Return-Path will normally contain a unique id
>>>>>>> which is used to identify bounces.
>>>>>>> In theory this might be sufficient on its own. Indeed the path might
>>>>>>> be usable without hashing. However the early ASF mailing list software
>>>>>>> did not use unique Return-Paths.
>>>>>>> The existing mod_mbox solution uses a combination of Message-Id plus
>>>>>>> YYYYMM plus a list identifier. Have there ever been any collisions?
>>>>>> It's certainly possible for mod_mbox to contain multiple messages with
>>>>>> the same id.
>>>>>> For example:
>>>>>> From 
>>>>>> dev-return-15575-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
>>>>>>  Tue Jun  7 20:08:04 2016
>>>>>> and
>>>>>> From 
>>>>>> dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
>>>>>>  Tue Jun  7 20:24:24 2016
>>>>>> both have the same id:
>>>>>> Message-Id: <bcd87d1a-bfb6-4f2a-92cc-af8f1d16b...@apache.org>
>>>>>> It's exactly the same message which arrived twice in the same mailing
>>>>>> list a few seconds apart.
>>>>>> Maybe it was sent using Bcc as well as To: ?
>>>>>> It's not exactly a collision, but at present mod_mbox is able to store
>>>>>> both whereas Pony Mail cannot (except if using the full generator,
>>>>>> which has other problems)
>>>>>> I think it's important to store the full message history in the database.
>>>>>> For example, if one of the messages bounces, it would be odd if the
>>>>>> source of the bounce were not in the database.
>>>>>> Also the message sequences will be incomplete.
>>>>>> This is the case for lists.a.o, the mbox
>>>>>> https://lists.apache.org/api/mbox.lua?list=d...@airavata.apache.org&date=2016-6
>>>>>> does not have the message sequence number 15575
>>>>>>> How about using the following:
>>>>>>> Message-Id
>>>>>>> Date
>>>>>>> Return-Path
>>>>>> In this case the return-path can be used to distinguish the messages.
>>>>> Note that the path depends on where the message is stored:
>>>>> Return-Path: 
>>>>> <dev-return-15577-archive-asf-public=cust-asf.ponee...@airavata.apache.org>
>>>>> as against
>>>>> Return-Path: 
>>>>> <dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org>
>>>>> So it will need adjusting to extract the parts that are the same for
>>>>> all message sources, whether that is mod_mbox or Pony Mail or sent to
>>>>> another subscriber.
>>>> How about we:
>>>> 1) select a larger list of headers that we know to be the same across
>>>> the same email sent to different MTAs, and combine them for the ID
>>>> generation.
>>> Sort of.
>>> AFAICT the *only* value that can distinguish multiple postings (of
>>> which there seem to be quite a lot) is the mailing list sequence
>>> number, which in ezmlm is only available in the Return-Path.
>>> However the sequence could perhaps be reset - so it seems risky to
>>> rely on that plus the list id alone.
>>> Also the earliest messages did not have sequence numbers.
>>> So there needs to be another way to identify distinct messages.
>>> In theory, that is the Message-Id.
>>> Even if it is not completely unique, it should be OK in combination
>>> with the sequence number.
>>> If the Message-Id does not exist or is very poor, then I think the
>>> only solution is to look at most - if not all - the parts of a message
>>> that a user can vary.
>>> This is what the full id generation does, except that it also includes
>>> the MTA-specific headers, causing problems with repeatability.
>>> There is another aspect to this. At present the generation is done
>>> using the parsed mail, rather than the original.
>>> If there is any chance that the format can change between releases of
>>> the library, then this may destroy repeatability.
>>> Likewise if a different library is ever used.
>>> We may need to additionally ensure that all headers are in a canonical
>>> form before use, in case MTAs decide to vary the layout.
>>>> 2) Store some more headers in the doc :)
>>> Those might be useful for searching, but each posting needs its own
>>> unique id otherwise it cannot be stored in the first place.
>>>>> The format will presumably depend on the mailing list software that is 
>>>>> used.
>>>>>>> List-Id
>>>>>> I think this must be the original List-Id, not any override.
>>>>>> Otherwise there may be problems with permalinks if a list name is
>>>>>> updated - the old permalink will no longer work.
>>>>>>> Whatever new algorithm is chosen, I think it's important that the
>>>>>>> format looks different from the existing ones. e.g. one could drop the
>>>>>>> <> around the list id.

Reply via email to