[ 
https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216178#comment-16216178
 ] 

Tim Allison commented on TIKA-2478:
-----------------------------------

bq. The important thing to note here is that, in multipart MIME messages, it is 
perfectly valid to have parts within parts. In theory, that nesting can extend 
to any depth. Any reasonably capable email client should then be able to 
recursively process all of the message parts.

https://stackoverflow.com/questions/3902455/mail-multipart-alternative-vs-multipart-mixed

Yikes!

> MBOX import includes redundant copies of the text
> -------------------------------------------------
>
>                 Key: TIKA-2478
>                 URL: https://issues.apache.org/jira/browse/TIKA-2478
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16
>            Reporter: Robert Letzler
>            Assignee: Tim Allison
>            Priority: Minor
>         Attachments: UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml, mixed-simple, 
> mixed-with-pdf-inline
>
>
> MBOX messages often get parsed into four documents:
> a.    The mbox file - outer container "/"
> b.    The actual email--  "/embedded-1"
> c.    The utf-8 text content of the email "/embedded-1/embedded-2"
> d.    The utf-8 html content of the email  "/embedded-1/embedded-3"
> entries C and D are redundant and distracting.  The MSG parser parses the 
> first non-null: email body and then it skips the rest.  Please modify MBOX to 
> not have separate "attached" documents for the html body and the text body.
> The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an 
> example of input sufficient to generate this behavior.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to