[jira] [Commented] (TIKA-2478) MBOX import includes redundant copies of the text

Tim Allison (JIRA) Mon, 23 Oct 2017 05:12:15 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215024#comment-16215024
 ]


Tim Allison commented on TIKA-2478:
-----------------------------------

[~kkrugler], thank you for these notes.  I think that {{testRFC822-multipart}} 
shows some of what you describe.  A {{multipart/mixed}} can contain another 
part, and we can't simply use a boolean "inPart", but have to use a stack to 
remember which part we're in and to figure out which part we're exiting.  My 
current drafty patch looks for {{multipart/alternative}}, buffers the contents 
for each alternative and then at the end of the {{multipart/alternative}}, it 
looks for html, then rtf, then text...The first non-null is the one that gets 
processed, and the content is not "inlined," not treated as an embedded 
document for {{multipart/alternative}}s.  Any other part is processed as it was 
before.  Does this sound about right?

{{testRFC822-multipart}} doesn't have plain text (e.g. not a 
{{multipart/alternative}}) before and after the .gif.  If you'd be willing to 
share an example or if I've missed one in our existing unit tests, it would be 
helpful to have.  Thank you!

> MBOX import includes redundant copies of the text
> -------------------------------------------------
>
>                 Key: TIKA-2478
>                 URL: https://issues.apache.org/jira/browse/TIKA-2478
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16
>            Reporter: Robert Letzler
>            Priority: Minor
>
> MBOX messages often get parsed into four documents:
> a.    The mbox file - outer container "/"
> b.    The actual email--  "/embedded-1"
> c.    The utf-8 text content of the email "/embedded-1/embedded-2"
> d.    The utf-8 html content of the email  "/embedded-1/embedded-3"
> entries C and D are redundant and distracting.  The MSG parser parses the 
> first non-null: email body and then it skips the rest.  Please modify MBOX to 
> not have separate "attached" documents for the html body and the text body.
> The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an 
> example of input sufficient to generate this behavior.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (TIKA-2478) MBOX import includes redundant copies of the text

Reply via email to