[ 
https://issues.apache.org/jira/browse/PDFBOX-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009055#comment-13009055
 ] 

Timo Boehme commented on PDFBOX-978:
------------------------------------

The patch was applied to another code block as intended by me. The patched 
region is ok, but the problem stated in my report persists.
Thus the patch should also be applied 2 blocks above within block starting with
if (endObjectKey.startsWith( "endobj" ) ) 

> unreading of trailing content after 'endobj' is missing new line byte (fix 
> included)
> ------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-978
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-978
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>            Assignee: Adam Nichols
>             Fix For: 1.6.0
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> I have several journal PDFs where the last xref section starts like
> endobj xref
> 0 92
> 0000000000 65535 f
> 0000000044 00000 n
> in this cases the PDF parser reads the endobj line completely and unreads " 
> xref".
> However the newline (in this case ^D) is lost. This is already documented in 
> the
> method readline() within PDFParser:
> "Note: if you later unread the results of this function, you'll
> need to add a newline character to the end of the string."
> Currently I get an error like: "expected='obj' actual='655'" because the 
> 'xref' is read as 'xref0'.
> The fix:
> in PDFParser insert before line 579 (the unreading of trailing characters 
> after 'endobj') the lines:
> // add a space first in place of the newline consumed by readline()
> pdfSource.unread( SPACE_BYTE );
> thus we get:
>                 if (endObjectKey.startsWith( "endobj" ) ) 
>                 {
>                     /*
>                      * Some PDF files don't contain a new line after endobj 
> so we 
>                      * need to make sure that the next object number is 
> getting read separately
>                      * and not part of the endobj keyword. Ex. Some files 
> would have "endobj28"
>                      * instead of "endobj"
>                      */
>                     // add a space first in place of the newline consumed by 
> readline()
>                     pdfSource.unread( SPACE_BYTE );
>                     pdfSource.unread( endObjectKey.substring( 6 
> ).getBytes("ISO-8859-1") );
>                 } 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to