[ 
https://issues.apache.org/jira/browse/TIKA-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-667.
--------------------------------

    Resolution: Fixed
      Assignee: Jukka Zitting

Thanks! Patch committed in revision 1160018.

Note that I removed the log message in case a problem with a header field is 
encountered. In such a situation I think it's fine to just silently ignore that 
field, just like Mime4J does when silently skipping parse issues when strict 
parsing is not enabled.

PS. I also changed some tab indentation to spaces.

> Changes to RFC822Parser to support turning off strict parsing
> -------------------------------------------------------------
>
>                 Key: TIKA-667
>                 URL: https://issues.apache.org/jira/browse/TIKA-667
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Mark Butler
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 1.0
>
>         Attachments: mailparser.diff
>
>
> Currently in RFC822Parser if Apache-Mime4J fails while parsing any field, 
> then parsing the whole document will fail. This causes problems on the Enron 
> Corpus - see https://issues.apache.org/jira/browse/TIKA-657
> RFC822Parser is configured from a MimeEntityConfig object. MimeEntityConfig 
> contains an option for "strict parsing". Currently MailContentHandler only 
> performs strict parsing, I.E. if a MimeException is encountered when 
> processing any fields in MailContentHandler.field then processing the 
> document fails. However, we may prefer not to have strict parsing I.E. 
> continue even if processing one or more fields fails. This can be achieved by 
> placing a try / catch block around the logic inside 
> MailContentHandler.field(), and only rethrowing the error if strictParsing is 
> enabled, otherwise we log the error.
> I enclose a diff for RFC822Parser and MailContentHandler that does this. I 
> have also made some other minor changes to MailContentHandler: there was some 
> repeated code for handling To:, Cc: and Bcc: fields, so I have replaced that 
> with a single private method, and rewritten stripOutFieldPrefix, to avoid 
> manipulating the String using re-assignment. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to