[
https://issues.apache.org/jira/browse/PDFBOX-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006855#comment-13006855
]
Timo Boehme commented on PDFBOX-979:
------------------------------------
I have some bogus PDF files where content starts immediately after '%%EOF':
startxref
302041
%%EOF333 0 obj<</Length 15/Root
In order to handle it like in the 'endobj' case I test if we start with '%%EOF'
and unread all following content.
New fixed version:
String eof = "";
if(!pdfSource.isEOF())
eof = readLine(); // if there's more data to read, get the
EOF flag
// verify that EOF exists
if(!"%%EOF".equals(eof)) {
if( eof.startsWith( "%%EOF" ) ) {
// content after marker -> unread with first
space byte for read newline
pdfSource.unread( SPACE_BYTE ); // we read a
whole line; add space as newline replacement
pdfSource.unread( eof.substring( 5
).getBytes("ISO-8859-1") );
} else {
// PDF does not conform to spec, we should warn
someone
log.warn("expected='%%EOF' actual='" + eof + "'");
// if we're not at the end of a file, just put it
back and move on
if(!pdfSource.isEOF()) {
pdfSource.unread( SPACE_BYTE ); // we read a
whole line; add space as newline replacement
pdfSource.unread(eof.getBytes("ISO-8859-1"));
}
}
}
> errors in %%EOF handling (fix included)
> ---------------------------------------
>
> Key: PDFBOX-979
> URL: https://issues.apache.org/jira/browse/PDFBOX-979
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 1.6.0
> Reporter: Timo Boehme
>
> The '%%EOF' handling in PDFParser has several errors. The current
> implementation (start from line 467):
> String eof = "";
> if(!pdfSource.isEOF())
> readLine(); // if there's more data to read, get the EOF
> flag
>
> // verify that EOF exists
> if("%%EOF".equals(eof)) {
> // PDF does not conform to spec, we should warn someone
> log.warn("expected='%%EOF' actual='" + eof + "'");
> // if we're not at the end of a file, just put it back
> and move on
> if(!pdfSource.isEOF())
> pdfSource.unread(eof.getBytes("ISO-8859-1"));
> }
> The problems:
> - eof variable gets no value
> - comparison if("%%EOF".equals(eof)) must be negated
> - unreading must first add a newline or space byte because we read with
> readline() (like in bug PDFBOX-978)
> Corrected version:
> String eof = "";
> if(!pdfSource.isEOF())
> eof = readLine(); // if there's more data to read, get
> the EOF flag
>
> // verify that EOF exists
> if(!"%%EOF".equals(eof)) {
> // PDF does not conform to spec, we should warn someone
> log.warn("expected='%%EOF' actual='" + eof + "'");
> // if we're not at the end of a file, just put it back
> and move on
> if(!pdfSource.isEOF()) {
> pdfSource.unread( SPACE_BYTE ); // we read a
> whole line; add space as newline replacement
> pdfSource.unread(eof.getBytes("ISO-8859-1"));
> }
> }
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira