Tim Allison created TIKA-4530:
---------------------------------
Summary: Don't let body content slip into headers in MboxParser
Key: TIKA-4530
URL: https://issues.apache.org/jira/browse/TIKA-4530
Project: Tika
Issue Type: Improvement
Reporter: Tim Allison
On an mbox file that's part of the ipres2025 Digital Preservation Bakeoff, I
noticed that we were getting content types that looked like this:
\{{message/rfc822, multipart/alternative; a {text-decoration:
none;text-decoration:none!important;} <t...}}.
The problem is that we're caching what look like multiline header bits whether
or not we're in an rfc822 header within an mbox file. We should stop caching
multiline bits if we're not in a header.
[https://www.ipres2025.nz/post/ipres-tools-demo-session-the-digital-preservation-bake-off]
Pantry:
[https://drive.google.com/drive/folders/1_BFjNw95HhH45VO-Y2gmJTbd6kfYAtuY]
The file is from edrm: [email protected]:
https://drive.google.com/drive/folders/1gpUbxmb8-AL2r1eqCODLzeT7QBN1LW9S
--
This message was sent by Atlassian Jira
(v8.20.10#820010)