Matthew Caruana Galizia created TIKA-2471:
---------------------------------------------

             Summary: Tab-prefixed message body lines in Mbox interpreted as 
headers
                 Key: TIKA-2471
                 URL: https://issues.apache.org/jira/browse/TIKA-2471
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.16
            Reporter: Matthew Caruana Galizia


The mbox parser code is overly optimistic. It parses the entire message looking 
for anything that matches a header pattern, wherever it occurs in a line!

It looks to me like the parsing logic is in desperate need of a refactor. But 
more to the point, what is the idea behind setting the headers in the 
MboxParser if they're going to be set by the RFC822Parser in any case?

Also, out of curiosity, why does the parser force Windows-1252 as the charset?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to