[jira] [Updated] (TIKA-1990) Broken .jpg inline image from .pdf files

Kukushkin Alexander (JIRA) Sat, 28 May 2016 04:59:19 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kukushkin Alexander updated TIKA-1990:
--------------------------------------
    Attachment: image0.jpg
                cv.pdf
                cv-1.jpg

> Broken .jpg inline image from .pdf files
> ----------------------------------------
>
>                 Key: TIKA-1990
>                 URL: https://issues.apache.org/jira/browse/TIKA-1990
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Kukushkin Alexander
>         Attachments: cv-1.jpg, cv.pdf, image0.jpg
>
>
> Hello,
> I am using tika-server-1.13.jar . I run it like this "java -jar 
> tika-server-1.13.jar --host=localhost --port=9998" . To be able to extract 
> inline images from pdf files I changed 
> "org/apache/tika/parser/pdf/PDFParser.properties" inside a jar. Set 
> "extractInlineImages" to "true". Everything works perfectly except one thing: 
> images from .pdf files that have .jpg extension are extracted broken. Images 
> with .jpeg, .png extension are extracted fine. .jpg from .doc, .docx and .rtf 
> are extracted fine. Problem seems to appear only with .pdf with .jpg images.
> Here is an example of pdf document https://yadi.sk/i/hUkjQg-as5LhB . To 
> extract images I do "curl -T cv.pdf -H "Accept: application/zip" 
> http://localhost:9998/unpack > cv.zip" . Inside cv.zip there is broken 
> image0.jpg https://yadi.sk/d/CUotGmHVs5LoK .
> At the same time if I use pdfbox-app-2.0.1.jar and run it like this "java 
> -jar pdfbox-app-2.0.1.jar ExtractImages cv.pdf" I get correct image cv-1.jpg 
> https://yadi.sk/i/4wGTjCeXs5LvQ
> Why does it happen?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1990) Broken .jpg inline image from .pdf files

Reply via email to