[ https://issues.apache.org/jira/browse/TIKA-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305682#comment-15305682 ]
Tim Allison commented on TIKA-1990: ----------------------------------- Will take a look on Tuesday. Thank you for opening this issue and attaching an example file. > Broken .jpg inline image from .pdf files > ---------------------------------------- > > Key: TIKA-1990 > URL: https://issues.apache.org/jira/browse/TIKA-1990 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Kukushkin Alexander > Assignee: Tim Allison > Attachments: cv-1.jpg, cv.pdf, image0.jpg > > > Hello, > I am using tika-server-1.13.jar . I run it like this "java -jar > tika-server-1.13.jar --host=localhost --port=9998" . To be able to extract > inline images from pdf files I changed > "org/apache/tika/parser/pdf/PDFParser.properties" inside a jar. Set > "extractInlineImages" to "true". Everything works perfectly except one thing: > images from .pdf files that have .jpg extension are extracted broken. Images > with .jpeg, .png extension are extracted fine. .jpg from .doc, .docx and .rtf > are extracted fine. Problem seems to appear only with .pdf with .jpg images. > There is an example of pdf document in attachment . To extract images I do > "curl -T cv.pdf -H "Accept: application/zip" http://localhost:9998/unpack > > cv.zip" . Inside cv.zip there is broken image0.jpg . > At the same time if I use pdfbox-app-2.0.1.jar and run "java -jar > pdfbox-app-2.0.1.jar ExtractImages cv.pdf" I get correct image cv-1.jpg > Why does it work like this? -- This message was sent by Atlassian JIRA (v6.3.4#6332)