[jira] [Commented] (TIKA-3202) Tika duplicates the ocr text

Tim Allison (Jira) Tue, 22 Sep 2020 14:33:12 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200395#comment-17200395
 ]


Tim Allison commented on TIKA-3202:
-----------------------------------

If I understand correctly, that's how it is designed.  You can get "only text", 
"only ocr" or both.  

We aren't doing anything smart behind the scenes with page locations to figure 
out where the electronic text is vs where the images are.

  We're rendering each page and then running OCR on the full page.  If you get 
the xhtml version of the output, we do put the OCR'd text into its own 
{{<div>}} so that you can programmatically see which text came from the 
electronic text and which from the OCR.

> Tika duplicates the ocr text
> ----------------------------
>
>                 Key: TIKA-3202
>                 URL: https://issues.apache.org/jira/browse/TIKA-3202
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.24.1
>            Reporter: marek kapowicki
>            Priority: Major
>         Attachments: text_and_image.pdf
>
>
> I m using tika 1.24.1 together with tesseract from docker image 
> apache/tika:1.24-full
> The header X-Tika-PDFocrStrategy: OCR_AND_TEXT occurs the issue
> the output from pdf processing is duplicated:
> The output from the attached pdf file is:
> {code:java}
> There is some text 
> [image: image0.jpg]
> There is some textT
> here is an image!!
> {code}
> the curl to reproduce:
> {code:java}
> curl -H "X-Tika-PDFextractInlineImages: true" -H "X-Tika-PDFocrStrategy: 
> OCR_AND_TEXT" -T text_and_image.pdf  http://localhost:9998/tika
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3202) Tika duplicates the ocr text

Reply via email to