[
https://issues.apache.org/jira/browse/TIKA-657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040149#comment-13040149
]
Mark Butler commented on TIKA-657:
----------------------------------
I took the Enron dataset and processed it using Tika and Behemoth. It contains
517,424 documents.
Using Tika 0.9 I encountered runtime errors on 27,224 documents. Sorting the
exceptions, there were four different stack traces. I enclose a summary of
these exceptions below. However I did not see the problems with Tagsoup parsing
that Benson reports?
I then took the version of Tika in head. Here I encountered run time errors on
1,218 documents. I enclose a summary of these exceptions below also. There were
two sources of error. First, the Enron corpus contains emails with lines longer
than the default 10,000 characters used in the RFC822Parser parser. The other
problem is that the Enron corpus contains malformed dates, which cause
apache-mime4j to throw a MimeException.
The first problem is easily fixed because RFC822Parser is configured from a
MimeEntityConfig object, so passing in an object with a higher MaxLineLen -
e.g. 60,000 - avoids these exceptions. I noticed that MimeEntityConfig also
contains an option for "strict parsing". Currently MailContentHandler only
performs strict parsing, i.e. if a MimeException is encountered when processing
any fields in MailContentHandler.field then it is passed back up and processing
the document fails. However, we may prefer not to have strict parsing i.e.
continue even if processing one or more fields fails. This can be achieved by
placing a try / catch block around the logic inside MailContentHandler.field(),
and only rethrowing the error if strictParsing is enabled, otherwise we log the
error.
I then re-ran this on the entire corpus and it parsed successfully.
> Email parser gets into trouble on malformed html in enron corpus
> ----------------------------------------------------------------
>
> Key: TIKA-657
> URL: https://issues.apache.org/jira/browse/TIKA-657
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.9
> Reporter: Benson Margulies
> Assignee: Julien Nioche
>
> There is a very large corpus of email addresses available:
> http://www.cs.cmu.edu/~enron/.
> In processing even a subset of this corpus, I see numerous 'unexpected
> RuntimeException' errors resulting from tagsoup throwing on truly awful html.
> It seems to me that being able to do something with this entire stack would
> make a good '1.0' criteria for tika's email parser.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira