[ 
https://issues.apache.org/jira/browse/TIKA-657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040149#comment-13040149
 ] 

Mark Butler commented on TIKA-657:
----------------------------------

I took the Enron dataset and processed it using Tika and Behemoth. It contains 
517,424 documents.

Using Tika 0.9 I encountered runtime errors on 27,224 documents. Sorting the 
exceptions, there were four different stack traces. I enclose a summary of 
these exceptions below. However I did not see the problems with Tagsoup parsing 
that Benson reports? 

I then took the version of Tika in head. Here I encountered run time errors on 
1,218 documents. I enclose a summary of these exceptions below also. There were 
two sources of error. First, the Enron corpus contains emails with lines longer 
than the default 10,000 characters used in the RFC822Parser parser. The other 
problem is that the Enron corpus contains malformed dates, which cause 
apache-mime4j to throw a MimeException. 

The first problem is easily fixed because RFC822Parser is configured from a 
MimeEntityConfig object, so passing in an object with a higher MaxLineLen - 
e.g. 60,000 - avoids these exceptions. I noticed that MimeEntityConfig also 
contains an option for "strict parsing". Currently MailContentHandler only 
performs strict parsing, i.e. if a MimeException is encountered when processing 
any fields in MailContentHandler.field then it is passed back up and processing 
the document fails. However, we may prefer not to have strict parsing i.e. 
continue even if processing one or more fields fails. This can be achieved by 
placing a try / catch block around the logic inside MailContentHandler.field(), 
and only rethrowing the error if strictParsing is enabled, otherwise we log the 
error.

I then re-ran this on the entire corpus and it parsed successfully.

> Email parser gets into trouble on malformed html in enron corpus
> ----------------------------------------------------------------
>
>                 Key: TIKA-657
>                 URL: https://issues.apache.org/jira/browse/TIKA-657
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Benson Margulies
>            Assignee: Julien Nioche
>
> There is a very large corpus of email addresses available: 
> http://www.cs.cmu.edu/~enron/.
> In processing even a subset of this corpus, I see numerous 'unexpected 
> RuntimeException' errors resulting from tagsoup throwing on truly awful html. 
> It seems to me that being able to do something with this entire stack would 
> make a good '1.0' criteria for tika's email parser.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to