[ https://issues.apache.org/jira/browse/TIKA-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexey Pismenskiy updated TIKA-4363: ------------------------------------ Description: Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is causing duplicate text extraction. Attached are example of the configuration and marked PDF file that can reproduce the issue with the following test: {{@Test}} {{public void testPDFDuplicate() throws Exception {}} {{ String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}} {{ TikaConfig tikaConfig = new TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}} {{ Tika tika = new Tika(tikaConfig);}} {{ String issueFile = "/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}} {{ URL resource = getClass().getResource(issueFile);}} {{ assert resource != null;}} {{ try (InputStream issueStream = resource.openStream()) {}} {{ String issueContent = tika.parseToString(issueStream);}} {{ System.out.println(issueContent);}} {{ assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}} {{ assertEquals(1, StringUtils.countMatches(issueContent, "aabb6ba1-34ab-4af2"), "Does not contain the expected number of occurrences");}} {{}}} PDFParser.java:214 * This is where it checks for the extractMarkedContent flag and will go into the PDFMarkedContent2XHTML class. AbstractPDF2XHTML.java:791 - 806 * In this code, the totalCharsPerPage was never updated by the PDFMarkedContent2XHTML and therefore matches the conditions to perform OCR on the PDF even though text has been extracted. One thing to note, if we turn off extractMarkedContent, then it goes into PDF2XHTML on PDFParser.java:219 and the variable totalCharsPerPage gets updated properly. {{ }} was: Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is causing duplicate text extraction. Attached are example of the configuration and marked PDF file that can reproduce the issue with the following test: {{@Test}} {{public void testPDFDuplicate() throws Exception {}} {{ String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}} {{ TikaConfig tikaConfig = new TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}} {{ Tika tika = new Tika(tikaConfig);}} {{ String issueFile = "/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}} {{ URL resource = getClass().getResource(issueFile);}} {{ assert resource != null;}} {{ try (InputStream issueStream = resource.openStream()) {}} {{ String issueContent = tika.parseToString(issueStream);}} {{ System.out.println(issueContent);}} {{ assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}} {{ assertEquals(1, StringUtils.countMatches(issueContent, "aabb6ba1-34ab-4af2"), "Does not contain the expected number of occurrences");}} {{}}} {{ }} > Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled > -------------------------------------------------------------------------- > > Key: TIKA-4363 > URL: https://issues.apache.org/jira/browse/TIKA-4363 > Project: Tika > Issue Type: Bug > Affects Versions: 2.9.2 > Reporter: Alexey Pismenskiy > Priority: Major > Attachments: MarkedPdfDuplicateTextWithTesseract.pdf, > tika-conf-override.xml > > > Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is > enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is > causing duplicate text extraction. > Attached are example of the configuration and marked PDF file that can > reproduce the issue with the following test: > {{@Test}} > {{public void testPDFDuplicate() throws Exception {}} > {{ String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}} > {{ TikaConfig tikaConfig = new > TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}} > {{ Tika tika = new Tika(tikaConfig);}} > {{ String issueFile = > "/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}} > {{ URL resource = getClass().getResource(issueFile);}} > {{ assert resource != null;}} > {{ try (InputStream issueStream = resource.openStream()) {}} > {{ String issueContent = tika.parseToString(issueStream);}} > {{ System.out.println(issueContent);}} > {{ assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}} > {{ assertEquals(1, StringUtils.countMatches(issueContent, > "aabb6ba1-34ab-4af2"), "Does not contain the expected number of > occurrences");}} > {{}}} > > PDFParser.java:214 > * This is where it checks for the extractMarkedContent flag and will go into > the PDFMarkedContent2XHTML class. > > AbstractPDF2XHTML.java:791 - 806 > * In this code, the totalCharsPerPage was never updated by the > PDFMarkedContent2XHTML and therefore matches the conditions to perform OCR on > the PDF even though text has been extracted. > One thing to note, if we turn off extractMarkedContent, then it goes into > PDF2XHTML on PDFParser.java:219 and the variable totalCharsPerPage gets > updated properly. > {{ }} > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)