[ https://issues.apache.org/jira/browse/TIKA-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200397#comment-17200397 ]
marek kapowicki commented on TIKA-3202: --------------------------------------- ONLY_OCR and no_ocr works fine. But now I can see how ocr_and_text is designed The goal i wanted to achieve is to use ocr for images and text extraction for electronic text - and I made the wrong assumption that this flag is for this purpose > Tika duplicates the ocr text > ---------------------------- > > Key: TIKA-3202 > URL: https://issues.apache.org/jira/browse/TIKA-3202 > Project: Tika > Issue Type: Bug > Affects Versions: 1.24.1 > Reporter: marek kapowicki > Priority: Major > Attachments: text_and_image.pdf > > > I m using tika 1.24.1 together with tesseract from docker image > apache/tika:1.24-full > The headerĀ X-Tika-PDFocrStrategy: OCR_AND_TEXT occurs the issue > the output from pdf processing is duplicated: > The output from the attached pdf file is: > {code:java} > There is some text > [image: image0.jpg] > There is some textT > here is an image!! > {code} > the curl to reproduce: > {code:java} > curl -H "X-Tika-PDFextractInlineImages: true" -H "X-Tika-PDFocrStrategy: > OCR_AND_TEXT" -T text_and_image.pdf http://localhost:9998/tika > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)