Robert Letzler created TIKA-2478:
------------------------------------

             Summary: MBOX import includes redundant copies of the text
                 Key: TIKA-2478
                 URL: https://issues.apache.org/jira/browse/TIKA-2478
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.16
            Reporter: Robert Letzler
            Priority: Minor


MBOX messages often get parsed into four documents:
a.      The mbox file - outer container "/"
b.      The actual email--  "/embedded-1"
c.      The utf-8 text content of the email "/embedded-1/embedded-2"
d.      The utf-8 html content of the email  "/embedded-1/embedded-3"

entries C and D are redundant and distracting.  The MSG parser parses the first 
non-null: email body and then it skips the rest.  Please modify MBOX to not 
have separate "attached" documents for the html body and the text body.

The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an example 
of input sufficient to generate this behavior.

Thanks!





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to