[
https://issues.apache.org/jira/browse/TIKA-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
marek kapowicki closed TIKA-3202.
---------------------------------
Resolution: Works for Me
> Tika duplicates the ocr text
> ----------------------------
>
> Key: TIKA-3202
> URL: https://issues.apache.org/jira/browse/TIKA-3202
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.24.1
> Reporter: marek kapowicki
> Priority: Major
> Attachments: text_and_image.pdf
>
>
> I m using tika 1.24.1 together with tesseract from docker image
> apache/tika:1.24-full
> The headerĀ X-Tika-PDFocrStrategy: OCR_AND_TEXT occurs the issue
> the output from pdf processing is duplicated:
> The output from the attached pdf file is:
> {code:java}
> There is some text
> [image: image0.jpg]
> There is some textT
> here is an image!!
> {code}
> the curl to reproduce:
> {code:java}
> curl -H "X-Tika-PDFextractInlineImages: true" -H "X-Tika-PDFocrStrategy:
> OCR_AND_TEXT" -T text_and_image.pdf http://localhost:9998/tika
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)