Natalia Escalera created TIKA-3765:
--------------------------------------

             Summary: setExtractAnnotationText setting ignored after v2.3.0
                 Key: TIKA-3765
                 URL: https://issues.apache.org/jira/browse/TIKA-3765
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 2.4.0, 2.3.0
            Reporter: Natalia Escalera


Prior to version 2.3.0, setting the PDFParserConfig setExtractAnnotationText to 
false ignored the image annotations in a power point document.

This is no longer the case in Tika >= 2.3.0:

 
{code:java}
content == "Test Test 1 "
|       |
|       false
|       11 differences (52% similarity)
|       Test Test 1 (image1.jpg )
|       Test Test 1 (-----------)
Test Test 1 image1.jpg {code}
 

This issue can be easily reproduced by creating a pptx document with an image 
and calling tika with setExtractAnnotationText(false).

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to