Eli Trucco created TIKA-2037:
--------------------------------
Summary: Problems with email attachments
Key: TIKA-2037
URL: https://issues.apache.org/jira/browse/TIKA-2037
Project: Tika
Issue Type: Bug
Components: detector, parser
Affects Versions: 1.13
Environment: Eclipse, Java 8
Reporter: Eli Trucco
Priority: Minor
I stumbled across a couple of problems while parsing and extracting attachments
from .eml files from Thunderbird. Some of them are wrongly identified (as
text/html, or application/xhtml+xml) and in a lot of them, the attachments are
not detected. I tried to parse 20 random eml files with attachments
(pdf,txt,html,etc), and at least 10 of them are either identified as html, or
correctly identified as rfc822 but the attachments are not extracted. I tried
the same files using TikaCLI -z option with the same result.
What I did: I extended the class ParsingEmbeddedDocumentExtractor to extract
and store the attachments somewhere else (exactly as shown in this example code
https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)