[
https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215024#comment-16215024
]
Tim Allison edited comment on TIKA-2478 at 10/23/17 12:39 PM:
--------------------------------------------------------------
[~kkrugler], thank you for these notes. I think that {{testRFC822-multipart}}
shows some of what you describe. A {{multipart/mixed}} can contain another
part, and we can't simply use a boolean "inPart", but have to use a stack to
remember which part we're in and to figure out which part we're exiting. My
current drafty patch looks for {{multipart/alternative}}, buffers the contents
for each alternative and then at the end of the {{multipart/alternative}}, it
looks for html, then rtf, then text...The first non-null is the one that gets
processed, and the content is now "inlined," not treated as an embedded
document for {{multipart/alternative}}s. Any other part is processed as it was
before. Does this sound about right?
{{testRFC822-multipart}} doesn't have plain text (e.g. not a
{{multipart/alternative}}) before and after the .gif. If you'd be willing to
share an example or if I've missed one in our existing unit tests, it would be
helpful to have. Thank you!
was (Author: [email protected]):
[~kkrugler], thank you for these notes. I think that {{testRFC822-multipart}}
shows some of what you describe. A {{multipart/mixed}} can contain another
part, and we can't simply use a boolean "inPart", but have to use a stack to
remember which part we're in and to figure out which part we're exiting. My
current drafty patch looks for {{multipart/alternative}}, buffers the contents
for each alternative and then at the end of the {{multipart/alternative}}, it
looks for html, then rtf, then text...The first non-null is the one that gets
processed, and the content is not "inlined," not treated as an embedded
document for {{multipart/alternative}}s. Any other part is processed as it was
before. Does this sound about right?
{{testRFC822-multipart}} doesn't have plain text (e.g. not a
{{multipart/alternative}}) before and after the .gif. If you'd be willing to
share an example or if I've missed one in our existing unit tests, it would be
helpful to have. Thank you!
> MBOX import includes redundant copies of the text
> -------------------------------------------------
>
> Key: TIKA-2478
> URL: https://issues.apache.org/jira/browse/TIKA-2478
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.16
> Reporter: Robert Letzler
> Priority: Minor
>
> MBOX messages often get parsed into four documents:
> a. The mbox file - outer container "/"
> b. The actual email-- "/embedded-1"
> c. The utf-8 text content of the email "/embedded-1/embedded-2"
> d. The utf-8 html content of the email "/embedded-1/embedded-3"
> entries C and D are redundant and distracting. The MSG parser parses the
> first non-null: email body and then it skips the rest. Please modify MBOX to
> not have separate "attached" documents for the html body and the text body.
> The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an
> example of input sufficient to generate this behavior.
> Thanks!
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)