Re: Dealing with 'unparseable'/undisplayable mail

sebb Fri, 18 Nov 2016 07:16:13 -0800

On 18 November 2016 at 12:23, Shane Curcuru <[email protected]> wrote:
> +1 in general to the idea, although I don't have specific ideas on
> implementation.
>
> Question for the eventual implementation: would the unique URL/ID for
> the message change if it were later re-parsed to understand the content?
>  I.e. will these unparseable messages have stable URLs?


I think that's a tangential issue.

Any ID generation algorithm needs to to depend on the source mail
only, i.e. basically what's stored in the 'mbox_source' doc type.
Otherwise even re-importing mails that are currently understood OK may
not be stable either.

However I do agree that the same UID needs to be generated regardless
of whether the mail is read from an mbox file or reparsed from the
mbox_source.
The existing database copy may have arrived directly at the archiver
subscriber, which will be slightly different from the contents which
are derive from a mbox file.
Ensuring a stable ID is the main subject of the thread "Better id generation".

The change I am proposing here is to do with the parsed e-mail in the
'mbox' doc type.
This is used for searching and display.

> - Shane
>
> sebb wrote on 11/18/16 12:28 PM:
>> At present if the archiver cannot extract what it considers
>> displayable body text from the input, it may drop the message
>> entirely.
>>
>> For example, this currently happens for valid messages such as
>> HTML-only (Nexus) and for signature-only messages [1], [2]. (*)
>>
>> This does not make sense for an archiver.
>>
>> At the very least it should index the raw source, and put some kind of
>> marker in the summary record to show that it could not understand the
>> message structure.
>>
>> I don't think it would be good to store the message in the body itself.
>> That would mess up searches and statistics.
>>
>> Rather than add a separate flag, it occurs to me that this would be a
>> good use for storing the body as 'null'.
>>
>> It's not possible for a real message to have a null body - it may be
>> empty, but it cannot be null.
>>
>> The GUI could then be fixed to display a standard message explaining
>> that the message cannot be displayed, as is done by mod_mbox. At least
>> then readers can look at the source itself.
>>
>> Also, when the parser is improved to deal with more message layouts,
>> it would be easy to find such emails and re-index them.
>>
>> Does that make sense?
>>
>> [1] 
>> http://mail-archives.apache.org/mod_mbox/httpd-users/201212.mbox/%3Ce9bd8c2b31947867d3bf1174d27b8e01%40mail.gmail.com%3E
>>
>> [2] 
>> http://mail-archives.apache.org/mod_mbox/ofbiz-dev/201505.mbox/%3C26E553E8-7C7F-4CB3-A553-EE7487E2BC9C%40ecomify.de%3E
>>
>> (*) HTML-only messages are currently dropped if html2text is not available
>> Sig-only messages are dropped because the code only checks multipart
>> messages for attachments.
>> Both of these can be fixed, now that they are known. But other issues
>> may arise in future.
>>
>

Re: Dealing with 'unparseable'/undisplayable mail

Reply via email to