[ https://issues.apache.org/jira/browse/TIKA-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913168#comment-17913168 ]
Tilman Hausherr edited comment on TIKA-4363 at 1/15/25 5:34 AM: ---------------------------------------------------------------- Maybe I misunderstood the question last year... here's an answer to what you may or may not have meant: {code:java} COSBase object = ((COSObject) pageBase).getObject(); if (object instanceof COSDictionary) { int index = document.getPages().indexOf(new PDPage((COSDictionary) object)) + 1; System.out.println("page: " + index); } {code} Also I don't understand why currentPageRef is used with a new type ObjectRef instead of just using COSObject or COSBase to have a unique key for MCID. (I made a TODO comment about that) was (Author: tilman): Maybe I misunderstood the question... {code:java} COSBase object = ((COSObject) pageBase).getObject(); if (object instanceof COSDictionary) { int index = document.getPages().indexOf(new PDPage((COSDictionary) object)) + 1; System.out.println("page: " + index); } {code} Also I don't understand why currentPageRef is used with a new type ObjectRef instead of just using COSObject or COSBase to have a unique key for MCID. (I made a TODO comment about that) > Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled > -------------------------------------------------------------------------- > > Key: TIKA-4363 > URL: https://issues.apache.org/jira/browse/TIKA-4363 > Project: Tika > Issue Type: Bug > Affects Versions: 2.9.2 > Reporter: Alexey Pismenskiy > Assignee: Tim Allison > Priority: Major > Attachments: MarkedPdfDuplicateTextWithTesseract.pdf, > tika-conf-override.xml > > > Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is > enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is > causing duplicate text extraction. > Attached are example of the configuration and marked PDF file that can > reproduce the issue with the following test: > {{@Test}} > {{public void testPDFDuplicate() throws Exception {}} > {{ String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}} > {{ TikaConfig tikaConfig = new > TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}} > {{ Tika tika = new Tika(tikaConfig);}} > {{ String issueFile = > "/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}} > {{ URL resource = getClass().getResource(issueFile);}} > {{ assert resource != null;}} > {{ try (InputStream issueStream = resource.openStream()) {}} > {{ String issueContent = tika.parseToString(issueStream);}} > {{ System.out.println(issueContent);}} > {{ assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}} > {{ assertEquals(1, StringUtils.countMatches(issueContent, > "aabb6ba1-34ab-4af2"), "Does not contain the expected number of > occurrences");}} > {{}}} > > PDFParser.java:214 > * This is where it checks for the extractMarkedContent flag and will go into > the PDFMarkedContent2XHTML class. > > AbstractPDF2XHTML.java:791 - 806 > * In this code, the totalCharsPerPage was never updated by the > PDFMarkedContent2XHTML and therefore matches the conditions to perform OCR on > the PDF even though text has been extracted. > One thing to note, if we turn off extractMarkedContent, then it goes into > PDF2XHTML on PDFParser.java:219 and the variable totalCharsPerPage gets > updated properly. > {{ }} > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)