[
https://issues.apache.org/jira/browse/TIKA-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538221#comment-17538221
]
Tim Allison commented on TIKA-3765:
-----------------------------------
Very helpful. Thank you.
In 2.2.1, I see:
{noformat}
<meta name="dc:publisher" content=""/>
<title>Test</title>
</head>
<body><div class="slide-content"><div class="embedded" id="slide1_rId2"/>
<p>Test</p>
</div>
<div class="slide-master-content"/>
<div class="embedded" id="/docProps/thumbnail.jpeg"/>
</body></html>
{noformat}
In 2.4.0, I see:
{noformat}
<meta name="dc:publisher" content=""/>
<title>Test</title>
</head>
<body><div class="slide-content"><div class="embedded" id="slide1_rId2"/>
<p>Test</p>
</div>
<div class="slide-master-content"/>
<div class="package-entry"><h1>image1.jpg</h1>
</div>
<div class="embedded" id="/docProps/thumbnail.jpeg"/>
</body></html>
{noformat}
This is related to TIKA-3644 -- the goal there was that we were reporting
embedded file names in the stream for some file formats but not others. I
realized that the change I made in TIKA-3644 would be less than desirable. I
thought I had reverted the changes in TIKA-3644 in TIKA-3711.
In Tika >= 2.4.0, you can turn off writing embedded file names to streams by
configuring the EmbeddedDocumentExtractor like so.
{noformat}
ParseContext parseContext = new ParseContext();
ParsingEmbeddedDocumentExtractor ex = new
ParsingEmbeddedDocumentExtractor(parseContext);
ex.setWriteFileNameToContent(false);
parseContext.set(EmbeddedDocumentExtractor.class, ex);
{noformat}
> setExtractAnnotationText setting ignored after v2.3.0
> -----------------------------------------------------
>
> Key: TIKA-3765
> URL: https://issues.apache.org/jira/browse/TIKA-3765
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.3.0, 2.4.0
> Reporter: Natalia Escalera
> Priority: Minor
> Attachments: TIKA-3765.pptx
>
>
> Prior to version 2.3.0, setting the PDFParserConfig setExtractAnnotationText
> to false ignored the image annotations in a power point document.
> This is no longer the case in Tika >= 2.3.0:
>
> {code:java}
> content == "Test Test 1 "
> | |
> | false
> | 11 differences (52% similarity)
> | Test Test 1 (image1.jpg )
> | Test Test 1 (-----------)
> Test Test 1 image1.jpg {code}
>
> This issue can be easily reproduced by creating a pptx document with an image
> and calling tika with setExtractAnnotationText(false).
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)