[
https://issues.apache.org/jira/browse/TIKA-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kukushkin Alexander updated TIKA-1990:
--------------------------------------
Description:
Hello,
I am using tika-server-1.13.jar . I run it like this "java -jar
tika-server-1.13.jar --host=localhost --port=9998" . To be able to extract
inline images from pdf files I changed
"org/apache/tika/parser/pdf/PDFParser.properties" inside a jar. Set
"extractInlineImages" to "true". Everything works perfectly except one thing:
images from .pdf files that have .jpg extension are extracted broken. Images
with .jpeg, .png extension are extracted fine. .jpg from .doc, .docx and .rtf
are extracted fine. Problem seems to appear only with .pdf with .jpg images.
There is an example of pdf document in attachment . To extract images I do
"curl -T cv.pdf -H "Accept: application/zip" http://localhost:9998/unpack >
cv.zip" . Inside cv.zip there is broken image0.jpg .
At the same time if I use pdfbox-app-2.0.1.jar and run "java -jar
pdfbox-app-2.0.1.jar ExtractImages cv.pdf" I get correct image cv-1.jpg
Why does it work like this?
was:
Hello,
I am using tika-server-1.13.jar . I run it like this "java -jar
tika-server-1.13.jar --host=localhost --port=9998" . To be able to extract
inline images from pdf files I changed
"org/apache/tika/parser/pdf/PDFParser.properties" inside a jar. Set
"extractInlineImages" to "true". Everything works perfectly except one thing:
images from .pdf files that have .jpg extension are extracted broken. Images
with .jpeg, .png extension are extracted fine. .jpg from .doc, .docx and .rtf
are extracted fine. Problem seems to appear only with .pdf with .jpg images.
Here is an example of pdf document https://yadi.sk/i/hUkjQg-as5LhB . To extract
images I do "curl -T cv.pdf -H "Accept: application/zip"
http://localhost:9998/unpack > cv.zip" . Inside cv.zip there is broken
image0.jpg https://yadi.sk/d/CUotGmHVs5LoK .
At the same time if I use pdfbox-app-2.0.1.jar and run it like this "java -jar
pdfbox-app-2.0.1.jar ExtractImages cv.pdf" I get correct image cv-1.jpg
https://yadi.sk/i/4wGTjCeXs5LvQ
Why does it happen?
> Broken .jpg inline image from .pdf files
> ----------------------------------------
>
> Key: TIKA-1990
> URL: https://issues.apache.org/jira/browse/TIKA-1990
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Kukushkin Alexander
> Attachments: cv-1.jpg, cv.pdf, image0.jpg
>
>
> Hello,
> I am using tika-server-1.13.jar . I run it like this "java -jar
> tika-server-1.13.jar --host=localhost --port=9998" . To be able to extract
> inline images from pdf files I changed
> "org/apache/tika/parser/pdf/PDFParser.properties" inside a jar. Set
> "extractInlineImages" to "true". Everything works perfectly except one thing:
> images from .pdf files that have .jpg extension are extracted broken. Images
> with .jpeg, .png extension are extracted fine. .jpg from .doc, .docx and .rtf
> are extracted fine. Problem seems to appear only with .pdf with .jpg images.
> There is an example of pdf document in attachment . To extract images I do
> "curl -T cv.pdf -H "Accept: application/zip" http://localhost:9998/unpack >
> cv.zip" . Inside cv.zip there is broken image0.jpg .
> At the same time if I use pdfbox-app-2.0.1.jar and run "java -jar
> pdfbox-app-2.0.1.jar ExtractImages cv.pdf" I get correct image cv-1.jpg
> Why does it work like this?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)