[ 
https://issues.apache.org/jira/browse/TIKA-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905419#comment-16905419
 ] 

Tim Allison commented on TIKA-2921:
-----------------------------------

Sorry for my delay!

I agree that I don't like {{tryToFindExistingLeafParser}}...even though I'm 
responsible for it. :P  The initial goal was to make sure that we used our 
parsers for, e.g. {{application/rtf}}, and not to use whatever the user 
specified for {{application/rtf}}.  I'm not sure I agree with this decision now.

I'm hesitant to hardcode the creation of the RTFParser or the TXTParser, etc.  
Is there a reason that you aren't including these parsers in your main parser, 
which, if you're using the AutoDetectParser should automatically get added to 
the ParseContext IIRC...

How are you calling Tika?  Are you using the AutoDetectParser?  How are you 
filling the ParseContext?

> Tika discarding bodies of inline MIME elements in RFC822 email
> --------------------------------------------------------------
>
>                 Key: TIKA-2921
>                 URL: https://issues.apache.org/jira/browse/TIKA-2921
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.22
>         Environment: Reproducible on Java 8 and 11 on both Linux and Win 10.
>            Reporter: Joshua Turner
>            Priority: Major
>
> Given an rfc822 email that has two inline body parts (such as the one 
> attached), MailContentHandler's handleInlineBodyPart() method correctly 
> identifies the body part that should be emitted as the principal content of 
> the mail item, but then uses 
> EmbeddedDocumentUtil.tryToFindExistingLeafParser() to find a parser for that 
> part. If no existing leaf parser is found, it simply gives up and treats the 
> given part as an attachment.
> IMHO, the correct behaviour would be to create the necessary parser if none 
> is found, insert it into the parsing context, and use it to extract the 
> content of the selected body part.
> In the meantime, I'm working around the issue by creating and registering a 
> custom EmbeddedDocumentExtractor to guess whether it's been called by the 
> RFC822Parser by looking at the "X-Parsed-By" metadata value. When triggered, 
> it looks at the Content-Type of the passed-in metadata, and if it's plain 
> text or email, it creates a new TXTParser or HTMLParser and a new context, 
> and has them parse into the passed-in ContentHandler. It works, but it's 
> pretty hacky. It'd be far better to have the change in behaviour suggested 
> above. 
> [^test.eml]
> ^I've attached the email inline because using the attachment field yields an 
> error: "JIRA could not attach the file as there was a missing token. Please 
> try attaching the file again." I tried twice with the same error returned.^



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to