Re: Dealing with 'unparseable'/undisplayable mail

sebb Sat, 19 Nov 2016 02:38:27 -0800

On 18 November 2016 at 15:15, sebb <[email protected]> wrote:
> On 18 November 2016 at 12:23, Shane Curcuru <[email protected]> wrote:
>> +1 in general to the idea, although I don't have specific ideas on
>> implementation.
>>
>> Question for the eventual implementation: would the unique URL/ID for
>> the message change if it were later re-parsed to understand the content?
>>  I.e. will these unparseable messages have stable URLs?
>
> I think that's a tangential issue.
>
> Any ID generation algorithm needs to to depend on the source mail
> only, i.e. basically what's stored in the 'mbox_source' doc type.
> Otherwise even re-importing mails that are currently understood OK may
> not be stable either.
>
> However I do agree that the same UID needs to be generated regardless
> of whether the mail is read from an mbox file or reparsed from the
> mbox_source.
> The existing database copy may have arrived directly at the archiver
> subscriber, which will be slightly different from the contents which
> are derive from a mbox file.
> Ensuring a stable ID is the main subject of the thread "Better id generation".


The existing 'medium' and 'original' id generation schemes use the parsed body.
That is fragile. (The full scheme uses the message source but has other issues)

Any changes to the extracted body content will change the id.
For example, if html2text is ever updated it might generate slightly
different output.
(Though at present it is not in use).

However this change is not about changing the body content, it is
giving a meaning to a null body.

It would mean that some mails would be given a MID which did not
previously have one.

But I agree that if the parsing was later improved, re-importing would
create a non-null body that would change the ID. But then any change
to the body output will cause a change in ID with the current scheme.

> The change I am proposing here is to do with the parsed e-mail in the
> 'mbox' doc type.
> This is used for searching and display.
>
>> - Shane
>>
>> sebb wrote on 11/18/16 12:28 PM:
>>> At present if the archiver cannot extract what it considers
>>> displayable body text from the input, it may drop the message
>>> entirely.
>>>
>>> For example, this currently happens for valid messages such as
>>> HTML-only (Nexus) and for signature-only messages [1], [2]. (*)
>>>
>>> This does not make sense for an archiver.
>>>
>>> At the very least it should index the raw source, and put some kind of
>>> marker in the summary record to show that it could not understand the
>>> message structure.
>>>
>>> I don't think it would be good to store the message in the body itself.
>>> That would mess up searches and statistics.
>>>
>>> Rather than add a separate flag, it occurs to me that this would be a
>>> good use for storing the body as 'null'.
>>>
>>> It's not possible for a real message to have a null body - it may be
>>> empty, but it cannot be null.
>>>
>>> The GUI could then be fixed to display a standard message explaining
>>> that the message cannot be displayed, as is done by mod_mbox. At least
>>> then readers can look at the source itself.
>>>
>>> Also, when the parser is improved to deal with more message layouts,
>>> it would be easy to find such emails and re-index them.
>>>
>>> Does that make sense?
>>>
>>> [1] 
>>> http://mail-archives.apache.org/mod_mbox/httpd-users/201212.mbox/%3Ce9bd8c2b31947867d3bf1174d27b8e01%40mail.gmail.com%3E
>>>
>>> [2] 
>>> http://mail-archives.apache.org/mod_mbox/ofbiz-dev/201505.mbox/%3C26E553E8-7C7F-4CB3-A553-EE7487E2BC9C%40ecomify.de%3E
>>>
>>> (*) HTML-only messages are currently dropped if html2text is not available
>>> Sig-only messages are dropped because the code only checks multipart
>>> messages for attachments.
>>> Both of these can be fixed, now that they are known. But other issues
>>> may arise in future.
>>>
>>

Re: Dealing with 'unparseable'/undisplayable mail

Reply via email to