[jira] [Comment Edited] (TIKA-4363) Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled

Tilman Hausherr (Jira) Tue, 14 Jan 2025 21:52:21 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913168#comment-17913168
 ]


Tilman Hausherr edited comment on TIKA-4363 at 1/15/25 5:34 AM:
----------------------------------------------------------------

Maybe I misunderstood the question last year... here's an answer to what you 
may or may not have meant:
{code:java}
COSBase object = ((COSObject) pageBase).getObject();
if (object instanceof COSDictionary) {
    int index = document.getPages().indexOf(new PDPage((COSDictionary) object)) 
+ 1;
    System.out.println("page: " + index);
}
{code}
Also I don't understand why currentPageRef is used with a new type ObjectRef 
instead of just using COSObject or COSBase to have a unique key for MCID. (I 
made a TODO comment about that)


was (Author: tilman):
Maybe I misunderstood the question... 
{code:java}
COSBase object = ((COSObject) pageBase).getObject();
if (object instanceof COSDictionary) {
    int index = document.getPages().indexOf(new PDPage((COSDictionary) object)) 
+ 1;
    System.out.println("page: " + index);
}
{code}
Also I don't understand why currentPageRef is used with a new type ObjectRef 
instead of just using COSObject or COSBase to have a unique key for MCID. (I 
made a TODO comment about that)

> Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled
> --------------------------------------------------------------------------
>
>                 Key: TIKA-4363
>                 URL: https://issues.apache.org/jira/browse/TIKA-4363
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.9.2
>            Reporter: Alexey Pismenskiy
>            Assignee: Tim Allison
>            Priority: Major
>         Attachments: MarkedPdfDuplicateTextWithTesseract.pdf, 
> tika-conf-override.xml
>
>
> Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is 
> enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is 
> causing duplicate text extraction.
> Attached are example of the configuration and marked PDF file that can 
> reproduce the issue with the following test: 
> {{@Test}}
> {{public void testPDFDuplicate() throws Exception {}}
> {{  String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}}
> {{  TikaConfig tikaConfig = new 
> TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}}
> {{  Tika tika = new Tika(tikaConfig);}}
> {{  String issueFile = 
> "/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}}
> {{  URL resource = getClass().getResource(issueFile);}}
> {{  assert resource != null;}}
> {{  try (InputStream issueStream = resource.openStream()) {}}
> {{    String issueContent = tika.parseToString(issueStream);}}
> {{    System.out.println(issueContent);}}
> {{    assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}}
> {{    assertEquals(1, StringUtils.countMatches(issueContent, 
> "aabb6ba1-34ab-4af2"), "Does not contain the expected number of 
> occurrences");}}
> {{}}}
>  
> PDFParser.java:214
>  * This is where it checks for the extractMarkedContent flag and will go into 
> the PDFMarkedContent2XHTML class.
>  
> AbstractPDF2XHTML.java:791 - 806
>  * In this code, the totalCharsPerPage was never updated by the 
> PDFMarkedContent2XHTML and therefore matches the conditions to perform OCR on 
> the PDF even though text has been extracted.
> One thing to note, if we turn off extractMarkedContent, then it goes into 
> PDF2XHTML on PDFParser.java:219 and the variable totalCharsPerPage gets 
> updated properly.
> {{ }}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (TIKA-4363) Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled

Reply via email to