[ https://issues.apache.org/jira/browse/TIKA-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200395#comment-17200395 ]
Tim Allison commented on TIKA-3202: ----------------------------------- If I understand correctly, that's how it is designed. You can get "only text", "only ocr" or both. We aren't doing anything smart behind the scenes with page locations to figure out where the electronic text is vs where the images are. We're rendering each page and then running OCR on the full page. If you get the xhtml version of the output, we do put the OCR'd text into its own {{<div>}} so that you can programmatically see which text came from the electronic text and which from the OCR. > Tika duplicates the ocr text > ---------------------------- > > Key: TIKA-3202 > URL: https://issues.apache.org/jira/browse/TIKA-3202 > Project: Tika > Issue Type: Bug > Affects Versions: 1.24.1 > Reporter: marek kapowicki > Priority: Major > Attachments: text_and_image.pdf > > > I m using tika 1.24.1 together with tesseract from docker image > apache/tika:1.24-full > The headerĀ X-Tika-PDFocrStrategy: OCR_AND_TEXT occurs the issue > the output from pdf processing is duplicated: > The output from the attached pdf file is: > {code:java} > There is some text > [image: image0.jpg] > There is some textT > here is an image!! > {code} > the curl to reproduce: > {code:java} > curl -H "X-Tika-PDFextractInlineImages: true" -H "X-Tika-PDFocrStrategy: > OCR_AND_TEXT" -T text_and_image.pdf http://localhost:9998/tika > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)