[ 
https://issues.apache.org/jira/browse/PDFBOX-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009197#comment-13009197
 ] 

Adam Nichols commented on PDFBOX-978:
-------------------------------------

Fixed in revision 1083858.  Thanks again.

> unreading of trailing content after 'endobj' is missing new line byte (fix 
> included)
> ------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-978
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-978
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>            Assignee: Adam Nichols
>             Fix For: 1.6.0
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> I have several journal PDFs where the last xref section starts like
> endobj xref
> 0 92
> 0000000000 65535 f
> 0000000044 00000 n
> in this cases the PDF parser reads the endobj line completely and unreads " 
> xref".
> However the newline (in this case ^D) is lost. This is already documented in 
> the
> method readline() within PDFParser:
> "Note: if you later unread the results of this function, you'll
> need to add a newline character to the end of the string."
> Currently I get an error like: "expected='obj' actual='655'" because the 
> 'xref' is read as 'xref0'.
> The fix:
> in PDFParser insert before line 579 (the unreading of trailing characters 
> after 'endobj') the lines:
> // add a space first in place of the newline consumed by readline()
> pdfSource.unread( SPACE_BYTE );
> thus we get:
>                 if (endObjectKey.startsWith( "endobj" ) ) 
>                 {
>                     /*
>                      * Some PDF files don't contain a new line after endobj 
> so we 
>                      * need to make sure that the next object number is 
> getting read separately
>                      * and not part of the endobj keyword. Ex. Some files 
> would have "endobj28"
>                      * instead of "endobj"
>                      */
>                     // add a space first in place of the newline consumed by 
> readline()
>                     pdfSource.unread( SPACE_BYTE );
>                     pdfSource.unread( endObjectKey.substring( 6 
> ).getBytes("ISO-8859-1") );
>                 } 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to