[
https://issues.apache.org/jira/browse/TIKA-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Butler updated TIKA-667:
-----------------------------
Attachment: mailparser.diff
Diff for RFC822Parser.java and MailContentHandler.java
> Changes to RFC822Parser to support turning off strict parsing
> -------------------------------------------------------------
>
> Key: TIKA-667
> URL: https://issues.apache.org/jira/browse/TIKA-667
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.0
> Reporter: Mark Butler
> Priority: Minor
> Fix For: 1.0
>
> Attachments: mailparser.diff
>
>
> Currently in RFC822Parser if Apache-Mime4J fails while parsing any field,
> then parsing the whole document will fail. This causes problems on the Enron
> Corpus - see https://issues.apache.org/jira/browse/TIKA-657
> RFC822Parser is configured from a MimeEntityConfig object. MimeEntityConfig
> contains an option for "strict parsing". Currently MailContentHandler only
> performs strict parsing, I.E. if a MimeException is encountered when
> processing any fields in MailContentHandler.field then processing the
> document fails. However, we may prefer not to have strict parsing I.E.
> continue even if processing one or more fields fails. This can be achieved by
> placing a try / catch block around the logic inside
> MailContentHandler.field(), and only rethrowing the error if strictParsing is
> enabled, otherwise we log the error.
> I enclose a diff for RFC822Parser and MailContentHandler that does this. I
> have also made some other minor changes to MailContentHandler: there was some
> repeated code for handling To:, Cc: and Bcc: fields, so I have replaced that
> with a single private method, and rewritten stripOutFieldPrefix, to avoid
> manipulating the String using re-assignment.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira