Alexey Pismenskiy created TIKA-4363:
---------------------------------------

             Summary: Duplicate text when OCR and extractMarkedContent 
(PDFParserConfig) enabled
                 Key: TIKA-4363
                 URL: https://issues.apache.org/jira/browse/TIKA-4363
             Project: Tika
          Issue Type: Bug
    Affects Versions: 2.9.2
            Reporter: Alexey Pismenskiy
         Attachments: MarkedPdfDuplicateTextWithTesseract.pdf, 
tika-conf-override.xml

Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is 
enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is 
causing duplicate text extraction.

Attached are example of the configuration and marked PDF file that caused this 
issue. 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to