[
https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213150#comment-16213150
]
Ken Krugler commented on TIKA-2471:
-----------------------------------
Hi [[email protected]] - I don't think using MBoxIterator is the issue. The
problem is the regex logic used to find headers in the text that's inside of
one email message.
I think we first need to hear back from [~thaichat04] about why headers are
being extracted in mbox parser code, versus just relying on the RFC8222 parser.
> Tab-prefixed message body lines in Mbox interpreted as headers
> --------------------------------------------------------------
>
> Key: TIKA-2471
> URL: https://issues.apache.org/jira/browse/TIKA-2471
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.16
> Reporter: Matthew Caruana Galizia
> Labels: message, rfc822
> Attachments: mbox
>
>
> The mbox parser code is overly optimistic. It parses the entire message
> looking for anything that matches a header pattern, wherever it occurs in a
> line!
> It looks to me like the parsing logic is in desperate need of a refactor. But
> more to the point, what is the idea behind setting the headers in the
> MboxParser if they're going to be set by the RFC822Parser in any case?
> Also, out of curiosity, why does the parser force Windows-1252 as the charset?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)