Kukushkin Alexander created TIKA-1990:
-----------------------------------------
Summary: Broken .jpg inline image from .pdf files
Key: TIKA-1990
URL: https://issues.apache.org/jira/browse/TIKA-1990
Project: Tika
Issue Type: Bug
Components: parser
Reporter: Kukushkin Alexander
Hello,
I am using tika-server-1.13.jar . I run it like this "java -jar
tika-server-1.13.jar --host=localhost --port=9998" . To be able to extract
inline images from pdf files I changed
"org/apache/tika/parser/pdf/PDFParser.properties" inside a jar. Set
"extractInlineImages" to "true". Everything works perfectly except one thing:
images from .pdf files that have .jpg extension are extracted broken. Images
with .jpeg, .png extension are extracted fine. .jpg from .doc, .docx and .rtf
are extracted fine. Problem seems to appear only with .pdf with .jpg images.
Here is an example of pdf document https://yadi.sk/i/hUkjQg-as5LhB . To extract
images I do "curl -T cv.pdf -H "Accept: application/zip"
http://localhost:9998/unpack > cv.zip" . Inside cv.zip there is broken
image0.jpg https://yadi.sk/d/CUotGmHVs5LoK .
At the same time if I use pdfbox-app-2.0.1.jar and run it like this "java -jar
pdfbox-app-2.0.1.jar ExtractImages cv.pdf" I get correct image cv-1.jpg
https://yadi.sk/i/4wGTjCeXs5LvQ
Why does it happen?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)