[
https://issues.apache.org/jira/browse/TIKA-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15307782#comment-15307782
]
Hudson commented on TIKA-1990:
------------------------------
FAILURE: Integrated in tika-2.x-windows #9 (See
[https://builds.apache.org/job/tika-2.x-windows/9/])
TIKA-1990 -- need to add JPEG filters to embedded stream when handling
(tallison: rev e05dd5bf4145c0e8bbfd585d05a8a4c26d83e2ce)
* tika-app/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java
*
tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
> Broken .jpg inline image from .pdf files
> ----------------------------------------
>
> Key: TIKA-1990
> URL: https://issues.apache.org/jira/browse/TIKA-1990
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Kukushkin Alexander
> Assignee: Tim Allison
> Fix For: 2.0, 1.14
>
> Attachments: cv-1.jpg, cv.pdf, image0.jpg
>
>
> Hello,
> I am using tika-server-1.13.jar . I run it like this "java -jar
> tika-server-1.13.jar --host=localhost --port=9998" . To be able to extract
> inline images from pdf files I changed
> "org/apache/tika/parser/pdf/PDFParser.properties" inside a jar. Set
> "extractInlineImages" to "true". Everything works perfectly except one thing:
> images from .pdf files that have .jpg extension are extracted broken. Images
> with .jpeg, .png extension are extracted fine. .jpg from .doc, .docx and .rtf
> are extracted fine. Problem seems to appear only with .pdf with .jpg images.
> There is an example of pdf document in attachment . To extract images I do
> "curl -T cv.pdf -H "Accept: application/zip" http://localhost:9998/unpack >
> cv.zip" . Inside cv.zip there is broken image0.jpg .
> At the same time if I use pdfbox-app-2.0.1.jar and run "java -jar
> pdfbox-app-2.0.1.jar ExtractImages cv.pdf" I get correct image cv-1.jpg
> Why does it work like this?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)