[
https://issues.apache.org/jira/browse/TIKA-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17905503#comment-17905503
]
Tim Allison edited comment on TIKA-4363 at 12/13/24 2:25 PM:
-------------------------------------------------------------
Thank you for opening this and explaining the problem in detail.
As I look at PDFMarkedContent2XHTML, I'm reminded that that handler builds the
text from the structure tree root.
{noformat}
//TODO: figure out when we're crossing page boundaries during the
recursion
// step above and do the page by page processing then...rather than
dumping this
// all here.
{noformat}
The current code does not calculate which content from the structure tree root
appears on which page. In short, it currently has no way of knowing how many
{{totalCharsPerPage}} there are.
The right solution is to do the {{TODO}}. Maybe we could do a minimal effort
algorithm of keeping a tally of "totalCharsPerPage" based on the
currentPageRef???
Short of that, maybe turn off ocr if the codepath goes through
PDFMarkedContent2XHTML#processPages()?
was (Author: [email protected]):
Thank you for opening this and explaining the problem in detail.
As I look at PDFMarkedContent2XHTML, I'm reminded that that handler builds the
text from the structure tree root.
{noformat}
//TODO: figure out when we're crossing page boundaries during the recursion
// step above and do the page by page processing then...rather than
dumping this
// all here.
{noformat}
The current code does not calculate which content from the structure tree root
appears on which page. In short, it currently has no way of knowing how many
{{totalCharsPerPage}} there are.
The right solution is to do the {{TODO}}. Maybe we could do a minimal effort
algorithm of keeping a tally of "totalCharsPerPage" based on the
currentPageRef???
Short of that, maybe turn off ocr if the codepath goes through
PDFMarkedContent2XHTML#processPages()?
> Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled
> --------------------------------------------------------------------------
>
> Key: TIKA-4363
> URL: https://issues.apache.org/jira/browse/TIKA-4363
> Project: Tika
> Issue Type: Bug
> Affects Versions: 2.9.2
> Reporter: Alexey Pismenskiy
> Assignee: Tim Allison
> Priority: Major
> Attachments: MarkedPdfDuplicateTextWithTesseract.pdf,
> tika-conf-override.xml
>
>
> Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is
> enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is
> causing duplicate text extraction.
> Attached are example of the configuration and marked PDF file that can
> reproduce the issue with the following test:
> {{@Test}}
> {{public void testPDFDuplicate() throws Exception {}}
> {{ String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}}
> {{ TikaConfig tikaConfig = new
> TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}}
> {{ Tika tika = new Tika(tikaConfig);}}
> {{ String issueFile =
> "/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}}
> {{ URL resource = getClass().getResource(issueFile);}}
> {{ assert resource != null;}}
> {{ try (InputStream issueStream = resource.openStream()) {}}
> {{ String issueContent = tika.parseToString(issueStream);}}
> {{ System.out.println(issueContent);}}
> {{ assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}}
> {{ assertEquals(1, StringUtils.countMatches(issueContent,
> "aabb6ba1-34ab-4af2"), "Does not contain the expected number of
> occurrences");}}
> {{}}}
>
> PDFParser.java:214
> * This is where it checks for the extractMarkedContent flag and will go into
> the PDFMarkedContent2XHTML class.
>
> AbstractPDF2XHTML.java:791 - 806
> * In this code, the totalCharsPerPage was never updated by the
> PDFMarkedContent2XHTML and therefore matches the conditions to perform OCR on
> the PDF even though text has been extracted.
> One thing to note, if we turn off extractMarkedContent, then it goes into
> PDF2XHTML on PDFParser.java:219 and the variable totalCharsPerPage gets
> updated properly.
> {{ }}
>
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)