[
https://issues.apache.org/jira/browse/TIKA-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906286#comment-16906286
]
Tim Allison commented on TIKA-2921:
-----------------------------------
Oh...ok. TikaCLI uses the BoilerpipeHandler under the hood, and that strips out
stuff that the BoilerpipeHandler thinks is, well, boilerpipe.
So, when I run this:
{noformat}
TikaConfig tikaConfig;
try (InputStream is =
getStream("org/apache/tika/parser/mail/tika-2921.xml")) {
tikaConfig = new TikaConfig(is);
}
ContentHandler inner = new ToXMLContentHandler();
ContentHandler handler = new BoilerpipeContentHandler(inner);
try (InputStream tis = getStream("test-documents/TIKA-2921.eml")) {
new AutoDetectParser(tikaConfig).parse(tis, handler, new
Metadata(), new ParseContext());
}
System.out.println(inner);
{noformat}
I get this:
{noformat}
<head><metadata.../><title>Re: website issue?</title></head><body><blockquote
/></html>
{noformat}
> Tika discarding bodies of inline MIME elements in RFC822 email
> --------------------------------------------------------------
>
> Key: TIKA-2921
> URL: https://issues.apache.org/jira/browse/TIKA-2921
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.22
> Environment: Reproducible on Java 8 and 11 on both Linux and Win 10.
> Reporter: Joshua Turner
> Priority: Major
> Attachments: tika-2921.xml
>
>
> Given an rfc822 email that has two inline body parts (such as the one
> attached), MailContentHandler's handleInlineBodyPart() method correctly
> identifies the body part that should be emitted as the principal content of
> the mail item, but then uses
> EmbeddedDocumentUtil.tryToFindExistingLeafParser() to find a parser for that
> part. If no existing leaf parser is found, it simply gives up and treats the
> given part as an attachment.
> IMHO, the correct behaviour would be to create the necessary parser if none
> is found, insert it into the parsing context, and use it to extract the
> content of the selected body part.
> In the meantime, I'm working around the issue by creating and registering a
> custom EmbeddedDocumentExtractor to guess whether it's been called by the
> RFC822Parser by looking at the "X-Parsed-By" metadata value. When triggered,
> it looks at the Content-Type of the passed-in metadata, and if it's plain
> text or email, it creates a new TXTParser or HTMLParser and a new context,
> and has them parse into the passed-in ContentHandler. It works, but it's
> pretty hacky. It'd be far better to have the change in behaviour suggested
> above.
> [^test.eml]
> ^I've attached the email inline because using the attachment field yields an
> error: "JIRA could not attach the file as there was a missing token. Please
> try attaching the file again." I tried twice with the same error returned.^
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)