[ https://issues.apache.org/jira/browse/TIKA-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17905503#comment-17905503 ]
Tim Allison edited comment on TIKA-4363 at 12/13/24 2:25 PM: ------------------------------------------------------------- Thank you for opening this and explaining the problem in detail. As I look at PDFMarkedContent2XHTML, I'm reminded that that handler builds the text from the structure tree root. {noformat} //TODO: figure out when we're crossing page boundaries during the recursion // step above and do the page by page processing then...rather than dumping this // all here. {noformat} The current code does not calculate which content from the structure tree root appears on which page. In short, it currently has no way of knowing how many {{totalCharsPerPage}} there are. The right solution is to do the {{TODO}}. Maybe we could do a minimal effort algorithm of keeping a tally of "totalCharsPerPage" based on the currentPageRef??? Short of that, maybe turn off ocr if the codepath goes through PDFMarkedContent2XHTML#processPages()? was (Author: talli...@mitre.org): Thank you for opening this and explaining the problem in detail. As I look at PDFMarkedContent2XHTML, I'm reminded that that handler builds the text from the structure tree root. {noformat} //TODO: figure out when we're crossing page boundaries during the recursion // step above and do the page by page processing then...rather than dumping this // all here. {noformat} The current code does not calculate which content from the structure tree root appears on which page. In short, it currently has no way of knowing how many {{totalCharsPerPage}} there are. The right solution is to do the {{TODO}}. Maybe we could do a minimal effort algorithm of keeping a tally of "totalCharsPerPage" based on the currentPageRef??? Short of that, maybe turn off ocr if the codepath goes through PDFMarkedContent2XHTML#processPages()? > Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled > -------------------------------------------------------------------------- > > Key: TIKA-4363 > URL: https://issues.apache.org/jira/browse/TIKA-4363 > Project: Tika > Issue Type: Bug > Affects Versions: 2.9.2 > Reporter: Alexey Pismenskiy > Assignee: Tim Allison > Priority: Major > Attachments: MarkedPdfDuplicateTextWithTesseract.pdf, > tika-conf-override.xml > > > Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is > enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is > causing duplicate text extraction. > Attached are example of the configuration and marked PDF file that can > reproduce the issue with the following test: > {{@Test}} > {{public void testPDFDuplicate() throws Exception {}} > {{ String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}} > {{ TikaConfig tikaConfig = new > TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}} > {{ Tika tika = new Tika(tikaConfig);}} > {{ String issueFile = > "/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}} > {{ URL resource = getClass().getResource(issueFile);}} > {{ assert resource != null;}} > {{ try (InputStream issueStream = resource.openStream()) {}} > {{ String issueContent = tika.parseToString(issueStream);}} > {{ System.out.println(issueContent);}} > {{ assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}} > {{ assertEquals(1, StringUtils.countMatches(issueContent, > "aabb6ba1-34ab-4af2"), "Does not contain the expected number of > occurrences");}} > {{}}} > > PDFParser.java:214 > * This is where it checks for the extractMarkedContent flag and will go into > the PDFMarkedContent2XHTML class. > > AbstractPDF2XHTML.java:791 - 806 > * In this code, the totalCharsPerPage was never updated by the > PDFMarkedContent2XHTML and therefore matches the conditions to perform OCR on > the PDF even though text has been extracted. > One thing to note, if we turn off extractMarkedContent, then it goes into > PDF2XHTML on PDFParser.java:219 and the variable totalCharsPerPage gets > updated properly. > {{ }} > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)