[jira] [Commented] (TIKA-3765) setExtractAnnotationText setting ignored after v2.3.0

Tim Allison (Jira) Tue, 17 May 2022 07:01:27 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538221#comment-17538221
 ]


Tim Allison commented on TIKA-3765:
-----------------------------------

Very helpful.  Thank you.

In 2.2.1, I see:
{noformat}
<meta name="dc:publisher" content=""/>
<title>Test</title>
</head>
<body><div class="slide-content"><div class="embedded" id="slide1_rId2"/>
<p>Test</p>
</div>
<div class="slide-master-content"/>
<div class="embedded" id="/docProps/thumbnail.jpeg"/>
</body></html>
{noformat}

In 2.4.0, I see:
{noformat}
<meta name="dc:publisher" content=""/>
<title>Test</title>
</head>
<body><div class="slide-content"><div class="embedded" id="slide1_rId2"/>
<p>Test</p>
</div>
<div class="slide-master-content"/>
<div class="package-entry"><h1>image1.jpg</h1>
</div>
<div class="embedded" id="/docProps/thumbnail.jpeg"/>
</body></html>
{noformat}

This is related to TIKA-3644 -- the goal there was that we were reporting 
embedded file names in the stream for some file formats but not others.  I 
realized that the change I made in TIKA-3644 would be less than desirable.  I 
thought I had reverted the changes in TIKA-3644 in TIKA-3711.

In Tika >= 2.4.0, you can turn off writing embedded file names to streams by 
configuring the EmbeddedDocumentExtractor like so.

{noformat}
        ParseContext parseContext = new ParseContext();
        ParsingEmbeddedDocumentExtractor ex = new 
ParsingEmbeddedDocumentExtractor(parseContext);
        ex.setWriteFileNameToContent(false);
        parseContext.set(EmbeddedDocumentExtractor.class, ex);
{noformat}




> setExtractAnnotationText setting ignored after v2.3.0
> -----------------------------------------------------
>
>                 Key: TIKA-3765
>                 URL: https://issues.apache.org/jira/browse/TIKA-3765
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.3.0, 2.4.0
>            Reporter: Natalia Escalera
>            Priority: Minor
>         Attachments: TIKA-3765.pptx
>
>
> Prior to version 2.3.0, setting the PDFParserConfig setExtractAnnotationText 
> to false ignored the image annotations in a power point document.
> This is no longer the case in Tika >= 2.3.0:
>  
> {code:java}
> content == "Test Test 1 "
> |       |
> |       false
> |       11 differences (52% similarity)
> |       Test Test 1 (image1.jpg )
> |       Test Test 1 (-----------)
> Test Test 1 image1.jpg {code}
>  
> This issue can be easily reproduced by creating a pptx document with an image 
> and calling tika with setExtractAnnotationText(false).
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3765) setExtractAnnotationText setting ignored after v2.3.0

Reply via email to