[jira] [Updated] (TIKA-1990) Broken .jpg inline image from .pdf files

Kukushkin Alexander (JIRA) Sat, 28 May 2016 05:00:33 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kukushkin Alexander updated TIKA-1990:
--------------------------------------
    Description: 
Hello,

I am using tika-server-1.13.jar . I run it like this "java -jar 
tika-server-1.13.jar --host=localhost --port=9998" . To be able to extract 
inline images from pdf files I changed 
"org/apache/tika/parser/pdf/PDFParser.properties" inside a jar. Set 
"extractInlineImages" to "true". Everything works perfectly except one thing: 
images from .pdf files that have .jpg extension are extracted broken. Images 
with .jpeg, .png extension are extracted fine. .jpg from .doc, .docx and .rtf 
are extracted fine. Problem seems to appear only with .pdf with .jpg images.

There is an example of pdf document in attachment . To extract images I do 
"curl -T cv.pdf -H "Accept: application/zip" http://localhost:9998/unpack > 
cv.zip" . Inside cv.zip there is broken image0.jpg .

At the same time if I use pdfbox-app-2.0.1.jar and run "java -jar 
pdfbox-app-2.0.1.jar ExtractImages cv.pdf" I get correct image cv-1.jpg

Why does it work like this?

  was:
Hello,

I am using tika-server-1.13.jar . I run it like this "java -jar 
tika-server-1.13.jar --host=localhost --port=9998" . To be able to extract 
inline images from pdf files I changed 
"org/apache/tika/parser/pdf/PDFParser.properties" inside a jar. Set 
"extractInlineImages" to "true". Everything works perfectly except one thing: 
images from .pdf files that have .jpg extension are extracted broken. Images 
with .jpeg, .png extension are extracted fine. .jpg from .doc, .docx and .rtf 
are extracted fine. Problem seems to appear only with .pdf with .jpg images.

Here is an example of pdf document https://yadi.sk/i/hUkjQg-as5LhB . To extract 
images I do "curl -T cv.pdf -H "Accept: application/zip" 
http://localhost:9998/unpack > cv.zip" . Inside cv.zip there is broken 
image0.jpg https://yadi.sk/d/CUotGmHVs5LoK .

At the same time if I use pdfbox-app-2.0.1.jar and run it like this "java -jar 
pdfbox-app-2.0.1.jar ExtractImages cv.pdf" I get correct image cv-1.jpg 
https://yadi.sk/i/4wGTjCeXs5LvQ

Why does it happen?


> Broken .jpg inline image from .pdf files
> ----------------------------------------
>
>                 Key: TIKA-1990
>                 URL: https://issues.apache.org/jira/browse/TIKA-1990
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Kukushkin Alexander
>         Attachments: cv-1.jpg, cv.pdf, image0.jpg
>
>
> Hello,
> I am using tika-server-1.13.jar . I run it like this "java -jar 
> tika-server-1.13.jar --host=localhost --port=9998" . To be able to extract 
> inline images from pdf files I changed 
> "org/apache/tika/parser/pdf/PDFParser.properties" inside a jar. Set 
> "extractInlineImages" to "true". Everything works perfectly except one thing: 
> images from .pdf files that have .jpg extension are extracted broken. Images 
> with .jpeg, .png extension are extracted fine. .jpg from .doc, .docx and .rtf 
> are extracted fine. Problem seems to appear only with .pdf with .jpg images.
> There is an example of pdf document in attachment . To extract images I do 
> "curl -T cv.pdf -H "Accept: application/zip" http://localhost:9998/unpack > 
> cv.zip" . Inside cv.zip there is broken image0.jpg .
> At the same time if I use pdfbox-app-2.0.1.jar and run "java -jar 
> pdfbox-app-2.0.1.jar ExtractImages cv.pdf" I get correct image cv-1.jpg
> Why does it work like this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1990) Broken .jpg inline image from .pdf files

Reply via email to