[jira] [Commented] (TIKA-3202) Tika duplicates the ocr text

marek kapowicki (Jira) Tue, 22 Sep 2020 14:43:12 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200397#comment-17200397
 ]


marek kapowicki commented on TIKA-3202:
---------------------------------------

ONLY_OCR and no_ocr works fine. But now I can see how ocr_and_text is designed

The goal i wanted to achieve is to use ocr for images and text extraction for 
electronic text - and I made the wrong assumption that this flag is for this 
purpose

> Tika duplicates the ocr text
> ----------------------------
>
>                 Key: TIKA-3202
>                 URL: https://issues.apache.org/jira/browse/TIKA-3202
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.24.1
>            Reporter: marek kapowicki
>            Priority: Major
>         Attachments: text_and_image.pdf
>
>
> I m using tika 1.24.1 together with tesseract from docker image 
> apache/tika:1.24-full
> The header X-Tika-PDFocrStrategy: OCR_AND_TEXT occurs the issue
> the output from pdf processing is duplicated:
> The output from the attached pdf file is:
> {code:java}
> There is some text 
> [image: image0.jpg]
> There is some textT
> here is an image!!
> {code}
> the curl to reproduce:
> {code:java}
> curl -H "X-Tika-PDFextractInlineImages: true" -H "X-Tika-PDFocrStrategy: 
> OCR_AND_TEXT" -T text_and_image.pdf  http://localhost:9998/tika
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3202) Tika duplicates the ocr text

Reply via email to