On 18 November 2016 at 12:23, Shane Curcuru <[email protected]> wrote: > +1 in general to the idea, although I don't have specific ideas on > implementation. > > Question for the eventual implementation: would the unique URL/ID for > the message change if it were later re-parsed to understand the content? > I.e. will these unparseable messages have stable URLs?
I think that's a tangential issue. Any ID generation algorithm needs to to depend on the source mail only, i.e. basically what's stored in the 'mbox_source' doc type. Otherwise even re-importing mails that are currently understood OK may not be stable either. However I do agree that the same UID needs to be generated regardless of whether the mail is read from an mbox file or reparsed from the mbox_source. The existing database copy may have arrived directly at the archiver subscriber, which will be slightly different from the contents which are derive from a mbox file. Ensuring a stable ID is the main subject of the thread "Better id generation". The change I am proposing here is to do with the parsed e-mail in the 'mbox' doc type. This is used for searching and display. > - Shane > > sebb wrote on 11/18/16 12:28 PM: >> At present if the archiver cannot extract what it considers >> displayable body text from the input, it may drop the message >> entirely. >> >> For example, this currently happens for valid messages such as >> HTML-only (Nexus) and for signature-only messages [1], [2]. (*) >> >> This does not make sense for an archiver. >> >> At the very least it should index the raw source, and put some kind of >> marker in the summary record to show that it could not understand the >> message structure. >> >> I don't think it would be good to store the message in the body itself. >> That would mess up searches and statistics. >> >> Rather than add a separate flag, it occurs to me that this would be a >> good use for storing the body as 'null'. >> >> It's not possible for a real message to have a null body - it may be >> empty, but it cannot be null. >> >> The GUI could then be fixed to display a standard message explaining >> that the message cannot be displayed, as is done by mod_mbox. At least >> then readers can look at the source itself. >> >> Also, when the parser is improved to deal with more message layouts, >> it would be easy to find such emails and re-index them. >> >> Does that make sense? >> >> [1] >> http://mail-archives.apache.org/mod_mbox/httpd-users/201212.mbox/%3Ce9bd8c2b31947867d3bf1174d27b8e01%40mail.gmail.com%3E >> >> [2] >> http://mail-archives.apache.org/mod_mbox/ofbiz-dev/201505.mbox/%3C26E553E8-7C7F-4CB3-A553-EE7487E2BC9C%40ecomify.de%3E >> >> (*) HTML-only messages are currently dropped if html2text is not available >> Sig-only messages are dropped because the code only checks multipart >> messages for attachments. >> Both of these can be fixed, now that they are known. But other issues >> may arise in future. >> >
