[
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benjamin Douglas updated TIKA-461:
----------------------------------
Attachment: TIKA-461-config.patch
Running the current trunk on the Enron email set revealed one weakness of the
mime4j default configuration. The default is to allow any individual header to
be at most 1000 characters. It is easy to exceed this when sending an email to
a large group of people. This last patch ups the limit to 10,000 characters,
which should be reasonable for most valid emails.
> RFC822 messages not parsed
> --------------------------
>
> Key: TIKA-461
> URL: https://issues.apache.org/jira/browse/TIKA-461
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 0.7
> Reporter: Joshua Turner
> Assignee: Julien Nioche
> Attachments: testRFC822-multipart, TIKA-461-config.patch,
> TIKA-461-parse.patch, TIKA-461-plus-tests-1.patch, TIKA-461.patch
>
>
> Presented with an RFC822 message exported from Thunderbird, AutodetectParser
> produces an empty body, and a Metadata containing only one key-value pair:
> "Content-Type=message/rfc822". Directly calling MboxParser likewise gives an
> empty body, but with two metadata pairs: "Content-Encoding=us-ascii
> Content-Type=application/mbox".
> A quick peek at the source of MboxParser shows that the implementation is
> pretty naive. If the wiring can be sorted out, something like Apache James'
> mime4j might be a better bet.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.