[ 
https://issues.apache.org/jira/browse/PDFBOX-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006855#comment-13006855
 ] 

Timo Boehme commented on PDFBOX-979:
------------------------------------

I have some bogus PDF files where content starts immediately after '%%EOF':

startxref
302041
%%EOF333 0 obj<</Length 15/Root

In order to handle it like in the 'endobj' case I test if we start with '%%EOF' 
and unread all following content.
New fixed version:

                String eof = "";
                if(!pdfSource.isEOF())
                    eof = readLine(); // if there's more data to read, get the 
EOF flag
                
                // verify that EOF exists
                if(!"%%EOF".equals(eof)) {
                          if( eof.startsWith( "%%EOF" ) ) {
                                // content after marker -> unread with first 
space byte for read newline
                                pdfSource.unread( SPACE_BYTE ); // we read a 
whole line; add space as newline replacement
                                pdfSource.unread( eof.substring( 5 
).getBytes("ISO-8859-1") );
                          } else {
                            // PDF does not conform to spec, we should warn 
someone
                            log.warn("expected='%%EOF' actual='" + eof + "'");
                            // if we're not at the end of a file, just put it 
back and move on
                            if(!pdfSource.isEOF()) {
                                pdfSource.unread( SPACE_BYTE ); // we read a 
whole line; add space as newline replacement
                                pdfSource.unread(eof.getBytes("ISO-8859-1"));
                            }
                          }
                }


> errors in %%EOF handling (fix included)
> ---------------------------------------
>
>                 Key: PDFBOX-979
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-979
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>
> The '%%EOF' handling in PDFParser has several errors. The current 
> implementation (start from line 467):
>                 String eof = "";
>                 if(!pdfSource.isEOF())
>                     readLine(); // if there's more data to read, get the EOF 
> flag
>                 
>                 // verify that EOF exists
>                 if("%%EOF".equals(eof)) {
>                     // PDF does not conform to spec, we should warn someone
>                     log.warn("expected='%%EOF' actual='" + eof + "'");
>                     // if we're not at the end of a file, just put it back 
> and move on
>                     if(!pdfSource.isEOF())
>                         pdfSource.unread(eof.getBytes("ISO-8859-1"));
>                 }
> The problems:
> - eof variable gets no value
> - comparison if("%%EOF".equals(eof)) must be negated
> - unreading must first add a newline or space byte because we read with 
> readline() (like in bug PDFBOX-978)
> Corrected version:
>                 String eof = "";
>                 if(!pdfSource.isEOF())
>                     eof = readLine(); // if there's more data to read, get 
> the EOF flag
>                 
>                 // verify that EOF exists
>                 if(!"%%EOF".equals(eof)) {
>                     // PDF does not conform to spec, we should warn someone
>                     log.warn("expected='%%EOF' actual='" + eof + "'");
>                     // if we're not at the end of a file, just put it back 
> and move on
>                     if(!pdfSource.isEOF()) {
>                               pdfSource.unread( SPACE_BYTE ); // we read a 
> whole line; add space as newline replacement
>                         pdfSource.unread(eof.getBytes("ISO-8859-1"));
>                     }
>                 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to