[ 
https://issues.apache.org/jira/browse/PDFBOX-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated PDFBOX-1299:
---------------------------------------

    Attachment: TX0819_2009-07-27_Windstream-TCG_Agreement.pdf

Test PDF showing the problem.  I got the PDF from
http://acrobatusers.com/gallery/pdf_portfolio_gallery, specifically
http://acrobatusers.com/assets/uploads/gallery/Tracey_Prather_31-Dec-2010_211843_2011Portfolio.pdf

In this PDF, at offset=446726, we have a "4 0 obj" stream, with
Length=368286.

If you skip ahead by that length, the next object is "5 0 obj".

But, unfortunately, within those bytes is an "endstream" on its own
line, just before offset=714247 (this "belongs" to the embedded PDF),
and that causes readUntilEndOfStream to stop too early, leading to
this IOException when running ExtractText (on current trunk):

{noformat}
Exception in thread "main" java.io.IOException: Unknown dir object c=']' 
cInt=93 peek=']' peekInt=93 757109
        at 
org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1215)
        at 
org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:216)
        at 
org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:342)
        at 
org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1117)
ununun  at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:557)
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1090)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1055)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:980)
        at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:196)
        at org.apache.pdfbox.ExtractText.main(ExtractText.java:76)
{noformat}

                
> BaseParser.readUntilEndOfStream can stop too early, causing IOException on 
> valid PDFs
> -------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1299
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1299
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.6.0
>            Reporter: Michael McCandless
>         Attachments: TX0819_2009-07-27_Windstream-TCG_Agreement.pdf
>
>
> The purpose of BaseParser.readUntilEndOfStream is to scan ahead,
> copying bytes to the output, stopping once it sees "endstream".
> The problem with this approach is sometimes the stream data itself
> contains endstream causing readUntilEndOfStream to stop too early.
> This can legitimately happen when the stream is an embedded PDF; I'll
> attach a test PDF showing this.
> However, the stream dict declares the stream length (in bytes)...  so
> it seems like we should be respecting that length (if present) and
> simply copy over that many bytes, instead of scanning the stream bytes
> for endstream?  This should be a lot faster too...
> I imagine we always scan so that we are more robust if the length is
> missing/invalid?  Is that why this method was used?  (I don't know the
> history here...).  If so, maybe we can have an option to use
> the declared stream length if present.
> I have a patch to use the declared stream length (if present), and it enables
> at least this test PDF to correctly parse.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to