[jira] [Updated] (TIKA-4363) Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled

Alexey Pismenskiy (Jira) Wed, 11 Dec 2024 09:00:55 -0800


     [ 
https://issues.apache.org/jira/browse/TIKA-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexey Pismenskiy updated TIKA-4363:
------------------------------------
    Description: 
Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is 
enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is 
causing duplicate text extraction.

Attached are example of the configuration and marked PDF file that can 
reproduce the issue with the following test: 

{{@Test}}
{{public void testPDFDuplicate() throws Exception {}}
{{  String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}}
{{  TikaConfig tikaConfig = new 
TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}}
{{  Tika tika = new Tika(tikaConfig);}}
{{  String issueFile = 
"/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}}
{{  URL resource = getClass().getResource(issueFile);}}
{{  assert resource != null;}}
{{  try (InputStream issueStream = resource.openStream()) {}}
{{    String issueContent = tika.parseToString(issueStream);}}
{{    System.out.println(issueContent);}}
{{    assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}}
{{    assertEquals(1, StringUtils.countMatches(issueContent, 
"aabb6ba1-34ab-4af2"), "Does not contain the expected number of occurrences");}}
{{  }}}
{{}}}

{{ }}

 

 

 

 

 

 

  was:
Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is 
enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is 
causing duplicate text extraction.

Attached are example of the configuration and marked PDF file that caused this 
issue. 

 

 

 

 

 


> Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled
> --------------------------------------------------------------------------
>
>                 Key: TIKA-4363
>                 URL: https://issues.apache.org/jira/browse/TIKA-4363
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.9.2
>            Reporter: Alexey Pismenskiy
>            Priority: Major
>         Attachments: MarkedPdfDuplicateTextWithTesseract.pdf, 
> tika-conf-override.xml
>
>
> Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is 
> enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is 
> causing duplicate text extraction.
> Attached are example of the configuration and marked PDF file that can 
> reproduce the issue with the following test: 
> {{@Test}}
> {{public void testPDFDuplicate() throws Exception {}}
> {{  String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}}
> {{  TikaConfig tikaConfig = new 
> TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}}
> {{  Tika tika = new Tika(tikaConfig);}}
> {{  String issueFile = 
> "/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}}
> {{  URL resource = getClass().getResource(issueFile);}}
> {{  assert resource != null;}}
> {{  try (InputStream issueStream = resource.openStream()) {}}
> {{    String issueContent = tika.parseToString(issueStream);}}
> {{    System.out.println(issueContent);}}
> {{    assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}}
> {{    assertEquals(1, StringUtils.countMatches(issueContent, 
> "aabb6ba1-34ab-4af2"), "Does not contain the expected number of 
> occurrences");}}
> {{  }}}
> {{}}}
> {{ }}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (TIKA-4363) Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled

Reply via email to