Robert Letzler created TIKA-2478: ------------------------------------ Summary: MBOX import includes redundant copies of the text Key: TIKA-2478 URL: https://issues.apache.org/jira/browse/TIKA-2478 Project: Tika Issue Type: Bug Affects Versions: 1.16 Reporter: Robert Letzler Priority: Minor
MBOX messages often get parsed into four documents: a. The mbox file - outer container "/" b. The actual email-- "/embedded-1" c. The utf-8 text content of the email "/embedded-1/embedded-2" d. The utf-8 html content of the email "/embedded-1/embedded-3" entries C and D are redundant and distracting. The MSG parser parses the first non-null: email body and then it skips the rest. Please modify MBOX to not have separate "attached" documents for the html body and the text body. The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an example of input sufficient to generate this behavior. Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029)