[
https://issues.apache.org/jira/browse/PDFBOX-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-3037:
------------------------------------
Attachment: 001131.pdf
> Text extraction decodes image files
> -----------------------------------
>
> Key: PDFBOX-3037
> URL: https://issues.apache.org/jira/browse/PDFBOX-3037
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Tilman Hausherr
> Assignee: Tilman Hausherr
> Fix For: 2.0.0
>
> Attachments: 001131.pdf
>
>
> I get this with text extraction of file 001131.pdf:
> {code}
> java.io.IOException: Could not read JPEG 2000 (JPX) image
> at org.apache.pdfbox.filter.JPXFilter.readJPX(JPXFilter.java:90)
> at org.apache.pdfbox.filter.JPXFilter.decode(JPXFilter.java:59)
> at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
> at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
> at
> org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:234)
> at
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:145)
> at
> org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:69)
> at
> org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:342)
> at
> org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:50)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:819)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:476)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:448)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
> {code}
> This shouldn't happen, i.e. we shouldn't even try to decode images when
> extracting text, this is a waste of time and memory.
> The cause is this in DrawObject:
> {code}
> PDXObject xobject = context.getResources().getXObject(name);
> {code}
> it results in the object being created and its contents being decoded.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]