Alexey Pismenskiy created TIKA-4363:
---------------------------------------
Summary: Duplicate text when OCR and extractMarkedContent
(PDFParserConfig) enabled
Key: TIKA-4363
URL: https://issues.apache.org/jira/browse/TIKA-4363
Project: Tika
Issue Type: Bug
Affects Versions: 2.9.2
Reporter: Alexey Pismenskiy
Attachments: MarkedPdfDuplicateTextWithTesseract.pdf,
tika-conf-override.xml
Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is
enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is
causing duplicate text extraction.
Attached are example of the configuration and marked PDF file that caused this
issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)