[
https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207543#comment-16207543
]
Tim Allison edited comment on TIKA-2478 at 10/17/17 12:07 PM:
--------------------------------------------------------------
Thank you [~letzlerr] for opening this and pointing to a triggering document.
Fellow devs, as Rob points out, in the OutlookParser, we select the first of
the non-null bodies in this order: html, rtf, text. We do not include all of
the bodies, which would be duplicative. Also, we "inline" the body, we don't
treat it as a separate attachment. Should we try to modify the RFC822 parser
to do the same thing as the OutlookParser?
was (Author: [email protected]):
Thank you [~letzlerr] for opening this and pointing to a triggering document.
Fellow devs, in the OutlookParser, we select the first of the non-null bodies
in this order: html, rtf, text. We do not include all of the bodies, which
would be duplicative. Also, we "inline" the body, we don't treat it as a
separate attachment. Should we try to modify the RFC822 parser to do the same
thing as the OutlookParser?
> MBOX import includes redundant copies of the text
> -------------------------------------------------
>
> Key: TIKA-2478
> URL: https://issues.apache.org/jira/browse/TIKA-2478
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.16
> Reporter: Robert Letzler
> Priority: Minor
>
> MBOX messages often get parsed into four documents:
> a. The mbox file - outer container "/"
> b. The actual email-- "/embedded-1"
> c. The utf-8 text content of the email "/embedded-1/embedded-2"
> d. The utf-8 html content of the email "/embedded-1/embedded-3"
> entries C and D are redundant and distracting. The MSG parser parses the
> first non-null: email body and then it skips the rest. Please modify MBOX to
> not have separate "attached" documents for the html body and the text body.
> The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an
> example of input sufficient to generate this behavior.
> Thanks!
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)