[ 
https://issues.apache.org/jira/browse/TIKA-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906221#comment-16906221
 ] 

Tim Allison commented on TIKA-2921:
-----------------------------------


This is what I'm getting as a unit test and when I run {{java -jar tika-app.jar 
--config=config.xml file.eml}}.

Is this what you're seeing?  How, exactly, are you calling Tika and/or 
including dependencies?



{noformat}
<?xml version="1.0" encoding="UTF-8"?><html 
xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="Message:Raw-Header:X-Spam-Status" content="No, score=-2.099 
tagged_above=-999 required=5&#9;tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, 
DKIM_VALID=-0.1,&#9;DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, 
FREEMAIL_FROM=0.001,&#9;HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=ham 
autolearn_force=no"/>
<meta name="subject" content="Re: website issue?"/>
<meta name="dc:creator" content="Norman Dimock &lt;[email protected]&gt;"/>
<meta name="Message:Raw-Header:X-Received" content="by 2002:a1f:8b48:: with 
SMTP id n69mr3322403vkd.12.1547641463203; Wed, 16 Jan 2019 04:24:23 -0800 
(PST)"/>
<meta name="Message:From-Email" content="[email protected]"/>
<meta name="dcterms:created" content="2019-01-16T12:24:10Z"/>
<meta name="Message-To" content="Josh Turner &lt;[email protected]&gt;"/>
<meta name="Message:Raw-Header:Authentication-Results" 
content="mail.handshape.com (amavisd-new);&#9;dkim=pass (2048-bit key) 
header.d=gmail.com"/>
<meta name="Message:Raw-Header:X-Google-DKIM-Signature" content="v=1; 
a=rsa-sha256; c=relaxed/relaxed;        d=1e100.net; s=20161025;        
h=x-gm-message-state:mime-version:references:in-reply-to:from:date         
:message-id:subject:to;        bh=ImRnwxGjgAUe17miKW5RkSb+P41jBHp5BWiDMnxmb+8=; 
       b=n0Ql87INTq9Mjgp8dmEhGP8wE9MCZX/a0WQ876dzW++ic5nCMlnhw9j0c09oXIS5hA     
    VQ6QqeS384BEDtY6oROMn63O8GsQncbpXyamUhg0LMWzOhKhY3iWWawd2h6i+EeYoJEg        
 8k+vAFVJU70vtGNLu3GHU477Shw1nFQGhEWccZu68lxkMX9joFEGGUtyJLnH4GqKzYbC         
vfhpVgr1pxeOiaU+4Cdth9e+4WLnR9T983q3F5D36NS9tnkcH4LMhhkfEca8raF2MTzX         
g+f8idp3OgiIuGMAOd99Go/nK4vTASix8hCSpnEsbzYKcH5bv0o3dFLN64RJQeIkPUte         
G4nA=="/>
<meta name="Message:Raw-Header:X-Gm-Message-State" 
content="AJcUukcWtkn1r1vSPsnQJF/GJiB2lFaDUgfyVAbbsih6aQt1qbyiN4EW&#9;fJEZFoU2CuQvQn82Lhd0aknLAeFMZ6xkngJtpYU4rA=="/>
<meta name="Message:Raw-Header:X-Virus-Scanned" content="Debian amavisd-new at 
handshape.com"/>
<meta name="Message:Raw-Header:MIME-Version" content="1.0"/>
<meta name="Multipart-Boundary" content="000000000000a76ce1057f925b48"/>
<meta name="Message:Raw-Header:Message-ID" 
content="&lt;CAMpLFpCimic+dGB4-zpNRBizbP1uNpFTw=3dvrzawopeui5...@mail.gmail.com&gt;"/>
<meta name="dc:title" content="Re: website issue?"/>
<meta name="Message:Raw-Header:X-Spam-Flag" content="NO"/>
<meta name="Message:Raw-Header:In-Reply-To" 
content="&lt;CAMpLFpCVygEwb+t=FmD6TqiDLrQHkREvh=_2=zinf8wh1-y...@mail.gmail.com&gt;"/>
<meta name="Content-Length" content="4107"/>
<meta name="Message:Raw-Header:X-Spam-Level" content=""/>
<meta name="Content-Type" content="message/rfc822"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.mail.RFC822Parser"/>
<meta name="creator" content="Norman Dimock &lt;[email protected]&gt;"/>
<meta name="Message:Raw-Header:X-Original-To" content="[email protected]"/>
<meta name="meta:author" content="Norman Dimock &lt;[email protected]&gt;"/>
<meta name="Message:Raw-Header:X-Google-Smtp-Source" 
content="ALg8bN7D3XZNSh8tgBFuosEPt01e12Ue8kk4R9OVClU5OHsa+NcWnqcd1JrII4+rSJNSjwaNu8oppTqZiSi1OMUCNfQ="/>
<meta name="meta:creation-date" content="2019-01-16T12:24:10Z"/>
<meta name="Message:Raw-Header:References" 
content="&lt;CAMpLFpB=uu_mqgwf5rtowqcfkd9cmzkboj782872ydgfp1d...@mail.gmail.com&gt;
 &lt;CAMpLFpCVygEwb+t=FmD6TqiDLrQHkREvh=_2=zinf8wh1-y...@mail.gmail.com&gt;"/>
<meta name="Creation-Date" content="2019-01-16T12:24:10Z"/>
<meta name="resourceName" content="TIKA-2921.eml"/>
<meta name="Message:Raw-Header:Return-Path" 
content="&lt;[email protected]&gt;"/>
<meta name="Message:Raw-Header:X-Spam-Score" content="-2.099"/>
<meta name="Message:Raw-Header:DKIM-Signature" content="v=1; a=rsa-sha256; 
c=relaxed/relaxed;        d=gmail.com; s=20161025;        
h=mime-version:references:in-reply-to:from:date:message-id:subject:to;        
bh=ImRnwxGjgAUe17miKW5RkSb+P41jBHp5BWiDMnxmb+8=;        
b=GA7HxxV7NFyCliid7O5w68Pyl+El9pLalsedSV28GjdrjXjAABu12zB+OWjB2lVGBr         
+gNyuAM0zcvHiwVQdlqa6ddq5D+UGT7ppzKDSh8ZTctt89tdmHFMuTECMB93xD8lOFVD         
tXoRJjD+bkd9NX18/8whrcweh/WeK7hai+02ZYLrtIxwsrCbfGdm/pY+KgDcHjs3OB/p         
lQJzFJHCgNCZ7oVR+T63RE+YMWfGs1sKIkjB2iIXByZseLR10afCxnBAfkg9Y/Cyjoep         
UE6B/4GngonMFO1Qwp55Ym5LcWMNORlIv6hrLwGglz+Rvs84EsFI0EY0hVVpQnB2H5UF         
7/dg=="/>
<meta name="Message:Raw-Header:Delivered-To" content="[email protected]"/>
<meta name="Message:From-Name" content="Norman Dimock"/>
<meta name="Author" content="Norman Dimock &lt;[email protected]&gt;"/>
<meta name="Multipart-Subtype" content="alternative"/>
<meta name="Message:Raw-Header:Received" content="from localhost (localhost 
[127.0.0.1])&#9;by handshape.com (Postfix) with ESMTP id 3E3A334690E&#9;for 
&lt;[email protected]&gt;; Wed, 16 Jan 2019 07:24:26 -0500 (EST)"/>
<meta name="Message:Raw-Header:Received" content="from handshape.com 
([127.0.0.1])&#9;by localhost (mail.handshape.com [127.0.0.1]) (amavisd-new, 
port 10024)&#9;with ESMTP id 1iIzkulfL3MZ for 
&lt;[email protected]&gt;;&#9;Wed, 16 Jan 2019 07:24:24 -0500 (EST)"/>
<meta name="Message:Raw-Header:Received" content="from mail-vk1-f175.google.com 
(mail-vk1-f175.google.com [209.85.221.175])&#9;(using TLSv1.2 with cipher 
ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))&#9;(No client certificate 
requested)&#9;by handshape.com (Postfix) with ESMTPS id 5771734690D&#9;for 
&lt;[email protected]&gt;; Wed, 16 Jan 2019 07:24:24 -0500 (EST)"/>
<meta name="Message:Raw-Header:Received" content="by mail-vk1-f175.google.com 
with SMTP id 197so1371013vkf.4        for &lt;[email protected]&gt;; Wed, 
16 Jan 2019 04:24:24 -0800 (PST)"/>
<meta name="Message-From" content="Norman Dimock &lt;[email protected]&gt;"/>
<title>Re: website issue?</title>
</head>
<body><blockquote>.. twice, I've done that!</blockquote>




</body></html>
{noformat}

> Tika discarding bodies of inline MIME elements in RFC822 email
> --------------------------------------------------------------
>
>                 Key: TIKA-2921
>                 URL: https://issues.apache.org/jira/browse/TIKA-2921
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.22
>         Environment: Reproducible on Java 8 and 11 on both Linux and Win 10.
>            Reporter: Joshua Turner
>            Priority: Major
>         Attachments: tika-2921.xml
>
>
> Given an rfc822 email that has two inline body parts (such as the one 
> attached), MailContentHandler's handleInlineBodyPart() method correctly 
> identifies the body part that should be emitted as the principal content of 
> the mail item, but then uses 
> EmbeddedDocumentUtil.tryToFindExistingLeafParser() to find a parser for that 
> part. If no existing leaf parser is found, it simply gives up and treats the 
> given part as an attachment.
> IMHO, the correct behaviour would be to create the necessary parser if none 
> is found, insert it into the parsing context, and use it to extract the 
> content of the selected body part.
> In the meantime, I'm working around the issue by creating and registering a 
> custom EmbeddedDocumentExtractor to guess whether it's been called by the 
> RFC822Parser by looking at the "X-Parsed-By" metadata value. When triggered, 
> it looks at the Content-Type of the passed-in metadata, and if it's plain 
> text or email, it creates a new TXTParser or HTMLParser and a new context, 
> and has them parse into the passed-in ContentHandler. It works, but it's 
> pretty hacky. It'd be far better to have the change in behaviour suggested 
> above. 
> [^test.eml]
> ^I've attached the email inline because using the attachment field yields an 
> error: "JIRA could not attach the file as there was a missing token. Please 
> try attaching the file again." I tried twice with the same error returned.^



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to