+1 in general to the idea, although I don't have specific ideas on implementation.
Question for the eventual implementation: would the unique URL/ID for the message change if it were later re-parsed to understand the content? I.e. will these unparseable messages have stable URLs? - Shane sebb wrote on 11/18/16 12:28 PM: > At present if the archiver cannot extract what it considers > displayable body text from the input, it may drop the message > entirely. > > For example, this currently happens for valid messages such as > HTML-only (Nexus) and for signature-only messages [1], [2]. (*) > > This does not make sense for an archiver. > > At the very least it should index the raw source, and put some kind of > marker in the summary record to show that it could not understand the > message structure. > > I don't think it would be good to store the message in the body itself. > That would mess up searches and statistics. > > Rather than add a separate flag, it occurs to me that this would be a > good use for storing the body as 'null'. > > It's not possible for a real message to have a null body - it may be > empty, but it cannot be null. > > The GUI could then be fixed to display a standard message explaining > that the message cannot be displayed, as is done by mod_mbox. At least > then readers can look at the source itself. > > Also, when the parser is improved to deal with more message layouts, > it would be easy to find such emails and re-index them. > > Does that make sense? > > [1] > http://mail-archives.apache.org/mod_mbox/httpd-users/201212.mbox/%3Ce9bd8c2b31947867d3bf1174d27b8e01%40mail.gmail.com%3E > > [2] > http://mail-archives.apache.org/mod_mbox/ofbiz-dev/201505.mbox/%3C26E553E8-7C7F-4CB3-A553-EE7487E2BC9C%40ecomify.de%3E > > (*) HTML-only messages are currently dropped if html2text is not available > Sig-only messages are dropped because the code only checks multipart > messages for attachments. > Both of these can be fixed, now that they are known. But other issues > may arise in future. >
