Kukushkin Alexander created TIKA-1990:
-----------------------------------------

             Summary: Broken .jpg inline image from .pdf files
                 Key: TIKA-1990
                 URL: https://issues.apache.org/jira/browse/TIKA-1990
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Kukushkin Alexander


Hello,

I am using tika-server-1.13.jar . I run it like this "java -jar 
tika-server-1.13.jar --host=localhost --port=9998" . To be able to extract 
inline images from pdf files I changed 
"org/apache/tika/parser/pdf/PDFParser.properties" inside a jar. Set 
"extractInlineImages" to "true". Everything works perfectly except one thing: 
images from .pdf files that have .jpg extension are extracted broken. Images 
with .jpeg, .png extension are extracted fine. .jpg from .doc, .docx and .rtf 
are extracted fine. Problem seems to appear only with .pdf with .jpg images.

Here is an example of pdf document https://yadi.sk/i/hUkjQg-as5LhB . To extract 
images I do "curl -T cv.pdf -H "Accept: application/zip" 
http://localhost:9998/unpack > cv.zip" . Inside cv.zip there is broken 
image0.jpg https://yadi.sk/d/CUotGmHVs5LoK .

At the same time if I use pdfbox-app-2.0.1.jar and run it like this "java -jar 
pdfbox-app-2.0.1.jar ExtractImages cv.pdf" I get correct image cv-1.jpg 
https://yadi.sk/i/4wGTjCeXs5LvQ

Why does it happen?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to