[jira] [Closed] (TIKA-3202) Tika duplicates the ocr text

marek kapowicki (Jira) Tue, 22 Sep 2020 22:55:52 -0700


     [ 
https://issues.apache.org/jira/browse/TIKA-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


marek kapowicki closed TIKA-3202.
---------------------------------
    Resolution: Works for Me

> Tika duplicates the ocr text
> ----------------------------
>
>                 Key: TIKA-3202
>                 URL: https://issues.apache.org/jira/browse/TIKA-3202
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.24.1
>            Reporter: marek kapowicki
>            Priority: Major
>         Attachments: text_and_image.pdf
>
>
> I m using tika 1.24.1 together with tesseract from docker image 
> apache/tika:1.24-full
> The header X-Tika-PDFocrStrategy: OCR_AND_TEXT occurs the issue
> the output from pdf processing is duplicated:
> The output from the attached pdf file is:
> {code:java}
> There is some text 
> [image: image0.jpg]
> There is some textT
> here is an image!!
> {code}
> the curl to reproduce:
> {code:java}
> curl -H "X-Tika-PDFextractInlineImages: true" -H "X-Tika-PDFocrStrategy: 
> OCR_AND_TEXT" -T text_and_image.pdf  http://localhost:9998/tika
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (TIKA-3202) Tika duplicates the ocr text

Reply via email to