[ 
https://issues.apache.org/jira/browse/TIKA-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18033319#comment-18033319
 ] 

Hudson commented on TIKA-4530:
------------------------------

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk17 #1002 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk17/1002/])
TIKA-4530 -- don't let body content slip into headers in mbox (#2376) (github: 
[https://github.com/apache/tika/commit/b7e9ed56213ba0d56d608d909935998979128732])
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-mail-module/src/test/resources/test-documents/multiline2.mbox
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-mail-module/src/main/java/org/apache/tika/parser/mbox/MboxParser.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-mail-module/src/test/java/org/apache/tika/parser/mbox/MboxParserTest.java


> Don't let body content slip into headers in MboxParser
> ------------------------------------------------------
>
>                 Key: TIKA-4530
>                 URL: https://issues.apache.org/jira/browse/TIKA-4530
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>             Fix For: 4.0.0, 3.3.0
>
>
> On an mbox file that's part of the ipres2025 Digital Preservation Bakeoff, I 
> noticed that we were getting content types that looked like this: 
> \{{message/rfc822, multipart/alternative; a {text-decoration: 
> none;text-decoration:none!important;} <t...}}.
>  
> The problem is that we're caching what look like multiline header bits 
> whether or not we're in an rfc822 header within an mbox file. We should stop 
> caching multiline bits if we're not in a header.
>  
> [https://www.ipres2025.nz/post/ipres-tools-demo-session-the-digital-preservation-bake-off]
>  
> Pantry: 
> [https://drive.google.com/drive/folders/1_BFjNw95HhH45VO-Y2gmJTbd6kfYAtuY]
> The file is from edrm: [email protected]: 
> https://drive.google.com/drive/folders/1gpUbxmb8-AL2r1eqCODLzeT7QBN1LW9S



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to