Kshitij created TIKA-4057:
-----------------------------
Summary: Skip Thumbnails from Metadata When Scanning PPTX files
Key: TIKA-4057
URL: https://issues.apache.org/jira/browse/TIKA-4057
Project: Tika
Issue Type: Wish
Components: metadata, mime
Affects Versions: 2.6.0
Reporter: Kshitij
I am scanning Pptx using tika parser/core 2.6.0 version and using
EmbeddedDocumentExtractor to verify if embedded images are present in pptx or
not. It seems that metadata contains thumbnails with mime type as "image/jpeg".
The key and value for thumbnail areĀ "dc:title" and "/docProps/thumbnail.jpeg"
respectively. So even if there is no embedded image in pptx file, result always
shows "Embedded image present" due to thumbnails. Is there any way to introduce
any parameter in officeParserConfig that will skip the thumbnails while parsing
. Thanks
--
This message was sent by Atlassian Jira
(v8.20.10#820010)