Alexey Pismenskiy created TIKA-4363: ---------------------------------------
Summary: Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled Key: TIKA-4363 URL: https://issues.apache.org/jira/browse/TIKA-4363 Project: Tika Issue Type: Bug Affects Versions: 2.9.2 Reporter: Alexey Pismenskiy Attachments: MarkedPdfDuplicateTextWithTesseract.pdf, tika-conf-override.xml Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is causing duplicate text extraction. Attached are example of the configuration and marked PDF file that caused this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)