Robert Letzler created TIKA-2478:
------------------------------------
Summary: MBOX import includes redundant copies of the text
Key: TIKA-2478
URL: https://issues.apache.org/jira/browse/TIKA-2478
Project: Tika
Issue Type: Bug
Affects Versions: 1.16
Reporter: Robert Letzler
Priority: Minor
MBOX messages often get parsed into four documents:
a. The mbox file - outer container "/"
b. The actual email-- "/embedded-1"
c. The utf-8 text content of the email "/embedded-1/embedded-2"
d. The utf-8 html content of the email "/embedded-1/embedded-3"
entries C and D are redundant and distracting. The MSG parser parses the first
non-null: email body and then it skips the rest. Please modify MBOX to not
have separate "attached" documents for the html body and the text body.
The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an example
of input sufficient to generate this behavior.
Thanks!
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)