Tilman Hausherr created PDFBOX-3037:
---------------------------------------
Summary: Text extraction decodes image files
Key: PDFBOX-3037
URL: https://issues.apache.org/jira/browse/PDFBOX-3037
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
Fix For: 2.0.0
I get this with text extraction of file 001131.pdf:
{code}
java.io.IOException: Could not read JPEG 2000 (JPX) image
at org.apache.pdfbox.filter.JPXFilter.readJPX(JPXFilter.java:90)
at org.apache.pdfbox.filter.JPXFilter.decode(JPXFilter.java:59)
at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
at
org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:234)
at
org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:145)
at
org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:69)
at
org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:342)
at
org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:50)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:819)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:476)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:448)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
{code}
This shouldn't happen, i.e. we shouldn't even try to decode images when
extracting text, this is a waste of time and memory.
The cause is this in DrawObject:
{code}
PDXObject xobject = context.getResources().getXObject(name);
{code}
it results in the object being created and its contents being decoded.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]