[
https://issues.apache.org/jira/browse/TIKA-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18033303#comment-18033303
]
ASF GitHub Bot commented on TIKA-4530:
--------------------------------------
tballison merged PR #2376:
URL: https://github.com/apache/tika/pull/2376
> Don't let body content slip into headers in MboxParser
> ------------------------------------------------------
>
> Key: TIKA-4530
> URL: https://issues.apache.org/jira/browse/TIKA-4530
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Major
>
> On an mbox file that's part of the ipres2025 Digital Preservation Bakeoff, I
> noticed that we were getting content types that looked like this:
> \{{message/rfc822, multipart/alternative; a {text-decoration:
> none;text-decoration:none!important;} <t...}}.
>
> The problem is that we're caching what look like multiline header bits
> whether or not we're in an rfc822 header within an mbox file. We should stop
> caching multiline bits if we're not in a header.
>
> [https://www.ipres2025.nz/post/ipres-tools-demo-session-the-digital-preservation-bake-off]
>
> Pantry:
> [https://drive.google.com/drive/folders/1_BFjNw95HhH45VO-Y2gmJTbd6kfYAtuY]
> The file is from edrm: [email protected]:
> https://drive.google.com/drive/folders/1gpUbxmb8-AL2r1eqCODLzeT7QBN1LW9S
--
This message was sent by Atlassian Jira
(v8.20.10#820010)