[
https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236049#comment-16236049
]
Hudson commented on TIKA-2478:
------------------------------
SUCCESS: Integrated in Jenkins build Tika-trunk #1382 (See
[https://builds.apache.org/job/Tika-trunk/1382/])
TIKA-2478 -- rfc822 parser should handle alternative parts as the (tallison:
[https://github.com/apache/tika/commit/ff481b25dd7f141f55907ce194b9bc2c77fc7069])
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
* (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java
* (add)
tika-parsers/src/test/resources/test-documents/testRFC822-mixed-with-pdf-inline
* (edit) CHANGES.txt
* (edit)
tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java
* (add) tika-parsers/src/test/resources/test-documents/testRFC822-mixed-simple
* (edit)
tika-parsers/src/test/java/org/apache/tika/parser/mbox/MboxParserTest.java
* (add)
tika-parsers/src/test/resources/org/apache/tika/parser/mail/tika-config-extract-all-alternatives.xml
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/mail/RFC822Parser.java
* (edit)
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractOfficeParser.java
* (add)
tika-parsers/src/test/resources/org/apache/tika/parser/microsoft/tika-config-extract-all-alternatives-msg.xml
* (add) tika-parsers/src/test/resources/test-documents/testMBOX_complex.mbox
> RFC822 includes redundant copies of the text
> --------------------------------------------
>
> Key: TIKA-2478
> URL: https://issues.apache.org/jira/browse/TIKA-2478
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.16
> Reporter: Robert Letzler
> Assignee: Tim Allison
> Priority: Minor
> Fix For: 1.17
>
> Attachments: TIKA-2478.patch, UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml,
> mixed-simple, mixed-with-pdf-inline
>
>
> MBOX messages often get parsed into four documents:
> a. The mbox file - outer container "/"
> b. The actual email-- "/embedded-1"
> c. The utf-8 text content of the email "/embedded-1/embedded-2"
> d. The utf-8 html content of the email "/embedded-1/embedded-3"
> entries C and D are redundant and distracting. The MSG parser parses the
> first non-null: email body and then it skips the rest. Please modify MBOX to
> not have separate "attached" documents for the html body and the text body.
> The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an
> example of input sufficient to generate this behavior.
> Thanks!
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)