[
https://issues.apache.org/jira/browse/TIKA-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022793#comment-13022793
]
Benjamin Douglas commented on TIKA-640:
---------------------------------------
I believe the real difference between these two scenarios is that the
IOException definitely comes from the current document being read while an
OutOfMemoryError might be caused by any number of other things, many of which
are not recoverable. Catching unchecked OutOfMemoryErrors, skipping the file,
and moving on seems like something we don't want to expect our normal program
flow to look like. Catching checked IOExceptions, skipping the file, and moving
on seems more reasonable.
That said, I will concede that having an email with a header so large that it
noticeably imposes on someone's heap should be quite rare. And, because of the
way that Tika handles metadata, it all needs to end up in memory anyway --
there is no streaming of metadata, it is represented as a bag of Strings. The
hard-coded limit I think has some value as a stopgap in cases where a bogus
file gets mis-detected, but that is a general problem not limited to RFC822
messages. It certainly could be decided to err on the side of letting all
documents through, even bogus ones with potential memory problems, instead of
being too conservative and not letting some valid documents through.
> RFC822Parser should configure Mime4j not to fail reading mails containing
> more than 1000 chars in one headers text (even if folded)
> -----------------------------------------------------------------------------------------------------------------------------------
>
> Key: TIKA-640
> URL: https://issues.apache.org/jira/browse/TIKA-640
> Project: Tika
> Issue Type: Wish
> Components: parser
> Affects Versions: 0.9
> Environment: All
> Reporter: Jens Wilmer
> Labels: mail, rfc822parser
> Original Estimate: 5m
> Remaining Estimate: 5m
>
> Standard configuration of Mime4j accepts only 1000 characters per line and
> 1000 charackters per header. The streaming approach of tika should not need
> theese limitations, an exception is being thrown and none of the data read is
> available.
> Solution:
> Replace all occurences of:
> Parser parser = new RFC822Parser();
> by:
> MimeEntityConfig config = new MimeEntityConfig();
> config.setMaxLineLen(-1);
> config.setMaxContentLen(-1);
> Parser parser = new RFC822Parser(config);
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira