Tim Allison created TIKA-4530:
---------------------------------

             Summary: Don't let body content slip into headers in MboxParser
                 Key: TIKA-4530
                 URL: https://issues.apache.org/jira/browse/TIKA-4530
             Project: Tika
          Issue Type: Improvement
            Reporter: Tim Allison


On an mbox file that's part of the ipres2025 Digital Preservation Bakeoff, I 
noticed that we were getting content types that looked like this: 
\{{message/rfc822, multipart/alternative; a {text-decoration: 
none;text-decoration:none!important;} <t...}}.

 

The problem is that we're caching what look like multiline header bits whether 
or not we're in an rfc822 header within an mbox file. We should stop caching 
multiline bits if we're not in a header.

 

[https://www.ipres2025.nz/post/ipres-tools-demo-session-the-digital-preservation-bake-off]

 

Pantry: 
[https://drive.google.com/drive/folders/1_BFjNw95HhH45VO-Y2gmJTbd6kfYAtuY]

The file is from edrm: [email protected]: 
https://drive.google.com/drive/folders/1gpUbxmb8-AL2r1eqCODLzeT7QBN1LW9S



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to