[
https://issues.apache.org/jira/browse/TIKA-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913312#comment-17913312
]
Hudson commented on TIKA-4363:
------------------------------
SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk17 #603 (See
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk17/603/])
TIKA-4363: refactor (tilman:
[https://github.com/apache/tika/commit/657e75b53b82b03d5e296c23687f2e913e0ba4ac])
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFMarkedContent2XHTML.java
> Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled
> --------------------------------------------------------------------------
>
> Key: TIKA-4363
> URL: https://issues.apache.org/jira/browse/TIKA-4363
> Project: Tika
> Issue Type: Bug
> Affects Versions: 2.9.2
> Reporter: Alexey Pismenskiy
> Assignee: Tim Allison
> Priority: Major
> Attachments: MarkedPdfDuplicateTextWithTesseract.pdf,
> tika-conf-override.xml
>
>
> Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is
> enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is
> causing duplicate text extraction.
> Attached are example of the configuration and marked PDF file that can
> reproduce the issue with the following test:
> {{@Test}}
> {{public void testPDFDuplicate() throws Exception {}}
> {{ String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}}
> {{ TikaConfig tikaConfig = new
> TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}}
> {{ Tika tika = new Tika(tikaConfig);}}
> {{ String issueFile =
> "/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}}
> {{ URL resource = getClass().getResource(issueFile);}}
> {{ assert resource != null;}}
> {{ try (InputStream issueStream = resource.openStream()) {}}
> {{ String issueContent = tika.parseToString(issueStream);}}
> {{ System.out.println(issueContent);}}
> {{ assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}}
> {{ assertEquals(1, StringUtils.countMatches(issueContent,
> "aabb6ba1-34ab-4af2"), "Does not contain the expected number of
> occurrences");}}
> {{}}}
>
> PDFParser.java:214
> * This is where it checks for the extractMarkedContent flag and will go into
> the PDFMarkedContent2XHTML class.
>
> AbstractPDF2XHTML.java:791 - 806
> * In this code, the totalCharsPerPage was never updated by the
> PDFMarkedContent2XHTML and therefore matches the conditions to perform OCR on
> the PDF even though text has been extracted.
> One thing to note, if we turn off extractMarkedContent, then it goes into
> PDF2XHTML on PDFParser.java:219 and the variable totalCharsPerPage gets
> updated properly.
> {{ }}
>
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)