[jira] [Commented] (TIKA-2921) Tika discarding bodies of inline MIME elements in RFC822 email

Joshua Turner (JIRA) Wed, 14 Aug 2019 12:01:11 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907543#comment-16907543
 ]


Joshua Turner commented on TIKA-2921:
-------------------------------------

Made some good headway today. I've confirmed the behaviour of the 
BoilerpipeContentHandler, and that it's not the root cause of my issue. (I'll 
leave it to the wisdom of the kind folks steering Tika whether using the 
BoilerpipeContentHandler *should* be used by default in the UI... principle of 
least astonishment and all. ;) )

What *did* turn out to be the root cause of my troubles was that my application 
has a custom EmbeddedDocumentExtractor that overrides the default behaviour of 
recursively parsing out the files by org.apache.tika.parser.pkg.PackageParser, 
replacing the extracted content with a content listing. 
In looking at EmbeddedDocumentUtil.getEmbedddedDocumentExtractor(), I noticed 
that there's some initialization work done on the instances it returns, beyond 
what's done by the constructor for ParsingEmbeddedDocumentExtractor.

My subclass didn't do that initialization; adding it to the subclass fixed the 
behaviour. I may simply not understand why, but is there a reason that the 
ParsingEmbeddedDocumentHandler doesn't put an AutoDetectParser into the context 
as part of its default constructor?

For reference:

{code:java}
package com.handshape.tika2921test;

import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.exception.TikaException;
import org.apache.tika.extractor.EmbeddedDocumentExtractor;
import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.sax.WriteOutContentHandler;
import static org.apache.tika.sax.XHTMLContentHandler.XHTML;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.AttributesImpl;

/**
 * @author jturner
 */
public class Tika2921Main {

    public static void main(String[] args) {
        try {
            TikaConfig tikaConfig;
            try (InputStream is = 
Tika2921Main.class.getResourceAsStream("tika.xml")) {
                tikaConfig = new TikaConfig(is);
            }

            ContentHandler bodyContentHandler = new BodyContentHandler(new 
WriteOutContentHandler(System.out));
            final AutoDetectParser autoDetectParser = new 
AutoDetectParser(tikaConfig);

            ParseContext context = new ParseContext();

            System.out.println("Case 1: rfc822 email");

            try (InputStream tis = 
Tika2921Main.class.getResourceAsStream("test.eml")) {
                Metadata metadata = new Metadata();
                context.set(EmbeddedDocumentExtractor.class, new 
Tika2921EmbeddedDocumentExtractor(context, metadata));
                autoDetectParser.parse(tis, bodyContentHandler, metadata, 
context);
            }

            System.out.println("Case 2: zip archive");

            try (InputStream tis = 
Tika2921Main.class.getResourceAsStream("zip.zip")) {
                Metadata metadata = new Metadata();
                context.set(EmbeddedDocumentExtractor.class, new 
Tika2921EmbeddedDocumentExtractor(context, metadata));
                autoDetectParser.parse(tis, bodyContentHandler, metadata, 
context);
            }

        } catch (IOException | SAXException | TikaException ex) {
            ex.printStackTrace();
        }
    }

    static class Tika2921EmbeddedDocumentExtractor extends 
ParsingEmbeddedDocumentExtractor {

        private final Metadata outerMetadata;

        public Tika2921EmbeddedDocumentExtractor(ParseContext context, Metadata 
outerMetadata) {
            super(context);
            this.outerMetadata = outerMetadata;

            // From here to the end of this method is what was necessary to fix 
my issue.
            Parser embeddedParser = context.get(Parser.class);
            if (embeddedParser == null) {
                TikaConfig tikaConfig = context.get(TikaConfig.class);
                if (tikaConfig == null) {
                    context.set(Parser.class, new AutoDetectParser());
                } else {
                    context.set(Parser.class, new AutoDetectParser(tikaConfig));
                }
            }
        }

        @Override
        public boolean shouldParseEmbedded(Metadata innermetadata) {
            return super.shouldParseEmbedded(innermetadata);
        }

        @Override
        public void parseEmbedded(InputStream stream, ContentHandler handler, 
Metadata metadata, boolean outputHtml) throws SAXException, IOException {

            // This is the bit that decides whether to provide the archive 
            // directory listing, or defer to the behaviour of the parent class.
            boolean isInPackage = false;
            for (String value : outerMetadata.getValues("X-Parsed-By")) {
                if (value.equals("org.apache.tika.parser.pkg.PackageParser")) {
                    isInPackage = true;
                }
            }
            if (isInPackage) {
                String name = 
String.valueOf(metadata.get(Metadata.RESOURCE_NAME_KEY));
                String size = 
String.valueOf(metadata.get(Metadata.CONTENT_LENGTH));
                String date = String.valueOf(metadata.get("date"));
                if (name != null && name.length() > 0 && outputHtml) {
                    handler.startElement(XHTML, "pre", "pre", new 
AttributesImpl());
                    char[] separator = " - ".toCharArray();
                    handler.characters(date.toCharArray(), 0, date.length());
                    handler.characters(separator, 0, separator.length);
                    handler.characters(size.toCharArray(), 0, size.length());
                    handler.characters(separator, 0, separator.length);
                    handler.characters(name.toCharArray(), 0, name.length());
                    handler.endElement(XHTML, "pre", "pre");
                }
            } else {
                super.parseEmbedded(stream, handler, metadata, outputHtml);
            }
        }
    }
}

{code}


> Tika discarding bodies of inline MIME elements in RFC822 email
> --------------------------------------------------------------
>
>                 Key: TIKA-2921
>                 URL: https://issues.apache.org/jira/browse/TIKA-2921
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.22
>         Environment: Reproducible on Java 8 and 11 on both Linux and Win 10.
>            Reporter: Joshua Turner
>            Priority: Major
>         Attachments: tika-2921.xml
>
>
> Given an rfc822 email that has two inline body parts (such as the one 
> attached), MailContentHandler's handleInlineBodyPart() method correctly 
> identifies the body part that should be emitted as the principal content of 
> the mail item, but then uses 
> EmbeddedDocumentUtil.tryToFindExistingLeafParser() to find a parser for that 
> part. If no existing leaf parser is found, it simply gives up and treats the 
> given part as an attachment.
> IMHO, the correct behaviour would be to create the necessary parser if none 
> is found, insert it into the parsing context, and use it to extract the 
> content of the selected body part.
> In the meantime, I'm working around the issue by creating and registering a 
> custom EmbeddedDocumentExtractor to guess whether it's been called by the 
> RFC822Parser by looking at the "X-Parsed-By" metadata value. When triggered, 
> it looks at the Content-Type of the passed-in metadata, and if it's plain 
> text or email, it creates a new TXTParser or HTMLParser and a new context, 
> and has them parse into the passed-in ContentHandler. It works, but it's 
> pretty hacky. It'd be far better to have the change in behaviour suggested 
> above. 
> [^test.eml]
> ^I've attached the email inline because using the attachment field yields an 
> error: "JIRA could not attach the file as there was a missing token. Please 
> try attaching the file again." I tried twice with the same error returned.^



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (TIKA-2921) Tika discarding bodies of inline MIME elements in RFC822 email

Reply via email to