[jira] [Comment Edited] (TIKA-4363) Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled

Tim Allison (Jira) Fri, 13 Dec 2024 06:26:12 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17905503#comment-17905503
 ]


Tim Allison edited comment on TIKA-4363 at 12/13/24 2:25 PM:
-------------------------------------------------------------

Thank you for opening this and explaining the problem in detail.

As I look at PDFMarkedContent2XHTML, I'm reminded that that handler builds the 
text from the structure tree root. 

{noformat}
        //TODO: figure out when we're crossing page boundaries during the 
recursion
        // step above and do the page by page processing then...rather than 
dumping this
        // all here.
{noformat}

The current code does not calculate which content from the structure tree root 
appears on which page. In short, it currently has no way of knowing how many 
{{totalCharsPerPage}} there are.

The right solution is to do the {{TODO}}. Maybe we could do a minimal effort 
algorithm of keeping a tally of "totalCharsPerPage" based on the 
currentPageRef??? 

Short of that, maybe turn off ocr if the codepath goes through 
PDFMarkedContent2XHTML#processPages()?




was (Author: [email protected]):
Thank you for opening this and explaining the problem in detail.

As I look at PDFMarkedContent2XHTML, I'm reminded that that handler builds the 
text from the structure tree root. 

{noformat}
//TODO: figure out when we're crossing page boundaries during the recursion
        // step above and do the page by page processing then...rather than 
dumping this
        // all here.
{noformat}

The current code does not calculate which content from the structure tree root 
appears on which page. In short, it currently has no way of knowing how many 
{{totalCharsPerPage}} there are.

The right solution is to do the {{TODO}}. Maybe we could do a minimal effort 
algorithm of keeping a tally of "totalCharsPerPage" based on the 
currentPageRef??? 

Short of that, maybe turn off ocr if the codepath goes through 
PDFMarkedContent2XHTML#processPages()?



> Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled
> --------------------------------------------------------------------------
>
>                 Key: TIKA-4363
>                 URL: https://issues.apache.org/jira/browse/TIKA-4363
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.9.2
>            Reporter: Alexey Pismenskiy
>            Assignee: Tim Allison
>            Priority: Major
>         Attachments: MarkedPdfDuplicateTextWithTesseract.pdf, 
> tika-conf-override.xml
>
>
> Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is 
> enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is 
> causing duplicate text extraction.
> Attached are example of the configuration and marked PDF file that can 
> reproduce the issue with the following test: 
> {{@Test}}
> {{public void testPDFDuplicate() throws Exception {}}
> {{  String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}}
> {{  TikaConfig tikaConfig = new 
> TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}}
> {{  Tika tika = new Tika(tikaConfig);}}
> {{  String issueFile = 
> "/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}}
> {{  URL resource = getClass().getResource(issueFile);}}
> {{  assert resource != null;}}
> {{  try (InputStream issueStream = resource.openStream()) {}}
> {{    String issueContent = tika.parseToString(issueStream);}}
> {{    System.out.println(issueContent);}}
> {{    assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}}
> {{    assertEquals(1, StringUtils.countMatches(issueContent, 
> "aabb6ba1-34ab-4af2"), "Does not contain the expected number of 
> occurrences");}}
> {{}}}
>  
> PDFParser.java:214
>  * This is where it checks for the extractMarkedContent flag and will go into 
> the PDFMarkedContent2XHTML class.
>  
> AbstractPDF2XHTML.java:791 - 806
>  * In this code, the totalCharsPerPage was never updated by the 
> PDFMarkedContent2XHTML and therefore matches the conditions to perform OCR on 
> the PDF even though text has been extracted.
> One thing to note, if we turn off extractMarkedContent, then it goes into 
> PDF2XHTML on PDFParser.java:219 and the variable totalCharsPerPage gets 
> updated properly.
> {{ }}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (TIKA-4363) Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled

Reply via email to