Tilman Hausherr created PDFBOX-3037:
---------------------------------------

             Summary: Text extraction decodes image files
                 Key: PDFBOX-3037
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3037
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.0
            Reporter: Tilman Hausherr
            Assignee: Tilman Hausherr
             Fix For: 2.0.0


I get this with text extraction of file 001131.pdf:
{code}
java.io.IOException: Could not read JPEG 2000 (JPX) image
        at org.apache.pdfbox.filter.JPXFilter.readJPX(JPXFilter.java:90)
        at org.apache.pdfbox.filter.JPXFilter.decode(JPXFilter.java:59)
        at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
        at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
        at 
org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:234)
        at 
org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:145)
        at 
org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:69)
        at 
org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:342)
        at 
org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:50)
        at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:819)
        at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:476)
        at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:448)
        at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
{code}
This shouldn't happen, i.e. we shouldn't even try to decode images when 
extracting text, this is a waste of time and memory.

The cause is this in DrawObject:
{code}
PDXObject xobject =  context.getResources().getXObject(name);
{code}
it results in the object being created and its contents being decoded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to