unreading of trailing content after 'endobj' is missing new line byte (fix
included)
------------------------------------------------------------------------------------
Key: PDFBOX-978
URL: https://issues.apache.org/jira/browse/PDFBOX-978
Project: PDFBox
Issue Type: Bug
Components: Parsing
Affects Versions: 1.6.0
Reporter: Timo Boehme
I have several journal PDFs where the last xref section starts like
endobj xref
0 92
0000000000 65535 f
0000000044 00000 n
in this cases the PDF parser reads the endobj line completely and unreads "
xref".
However the newline (in this case ^D) is lost. This is already documented in the
method readline() within PDFParser:
"Note: if you later unread the results of this function, you'll
need to add a newline character to the end of the string."
Currently I get an error like: "expected='obj' actual='655'" because the 'xref'
is read as 'xref0'.
The fix:
in PDFParser insert before line 579 (the unreading of trailing characters after
'endobj') the lines:
// add a space first in place of the newline consumed by readline()
pdfSource.unread( SPACE_BYTE );
thus we get:
if (endObjectKey.startsWith( "endobj" ) )
{
/*
* Some PDF files don't contain a new line after endobj so
we
* need to make sure that the next object number is getting
read separately
* and not part of the endobj keyword. Ex. Some files would
have "endobj28"
* instead of "endobj"
*/
// add a space first in place of the newline consumed by
readline()
pdfSource.unread( SPACE_BYTE );
pdfSource.unread( endObjectKey.substring( 6
).getBytes("ISO-8859-1") );
}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira