[ 
https://issues.apache.org/jira/browse/PDFBOX-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860470#action_12860470
 ] 

[email protected] commented on PDFBOX-276:
----------------------------------------------

This describe a fix to the file: org.apache.pdfbox.pdfparser.BaseParser.java
I did not debug this in the trunk version, I have code from a few months ago.


I debugged this problem, because I was hitting the same issue with a document I 
had.
The issue is not with PDFBox incorrectly parsing the document. 
The real problem is the document creator 
        Amyuni PDF Converter Version 1.58 - Developer Licence N° 9B7449F2-8245
incorrectly generated the name in the Title. 
The name was generated as  (c:\)
However the PDF specification states that the backslash PARENTHESIS   "\)" is 
used to create the string literal character ')" within a string literal.
The String is required to have an open parentheses and a close parentheses
However because the \ eats the close parentheses, PDF box cannot find the 
correct closing character,
it goes and eats several lines until it reaches the end of file.
I opened this document with the Adobe Reader, 
I looked in 
       File -> Properties
Adobe reader cannot identify the title or the other attributes either, however 
it does not crash when reading the document.


The documentation for the behavior is described in 
http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf
7.3.4.2 Literal Strings of the PDF Specification.


(This is a string) 
This is how to escape a PARENTHESIS within a string.    " \)  RIGHT PARENTHESIS 
(29h) "

The document contains this syntax " /Title (c:\) " which incorrectly fails to 
encode the backslash character 
The correct encoding should be   " /Title (c:\\) " 


Here is the text from the PDF file that was attached to this bug.

/Title (c:\)
/Producer (Amyuni PDF Converter)
/Version (Version 1.58 - Developer Licence N° 9B7449F2-8245)
/CreationDate (1/8/2003 12:18:53)

I don't think this can be fixed without compromising the content of the 
document.
We could just discard the information  the way Adobe PDF Reader does when we 
reach the >> or "endobj" line
However, I came up with a work around.

<<
/Title (c:\)
/Producer (Amyuni PDF Converter)
/Version (Version 1.58 - Developer Licence N° 9B7449F2-8245)
/CreationDate (1/8/2003 12:18:53)
>>
endobj


Looking at the code in more depth, there seems to be another patch for a 
similar issue.
In this case another vendor made a similar mistake in the title generation.

In the file, org.apache.pdfbox.pdfparser.BaseParser.java

                //lets handle the special case seen in Bull  River Rules and 
Regulations.pdf
                //The dictionary looks like this
                //    2 0 obj
                //    <<
                //        /Type /Info
                //        /Creator (PaperPort http://www.scansoft.com)
                //        /Producer (sspdflib 1.0 http://www.scansoft.com)
                //        /Title ( (5)
                //        /Author ()
                //        /Subject ()


I noticed this a little later and realized that I needed the same code in a 
different place,

So I clipped it and made it into a method, which is now called from 2 places.


//================================ Change 1

    /**
     * This is really a bug in the Document creators code, but it caused a crash
     * in PDFBox, the first bug was in this format:
     * /Title ( (5)
     * /Creator which was patched in 1 place.
     * However it missed the case where the Close Paren was escaped
     * 
     * The second bug was in this format 
     * /Title (c:\)
     * /Producer 
     * 
     * This patch  moves this code out of the parseCOSString method, so it can 
be used twice.
     * 
     * 
     * @param bracesParameter the number of braces currently open.
     * 
     * @return the corrected value of the brace counter
     * @throws IOException
     */
        private int checkForMissingCloseParen(final int bracesParameter) throws 
IOException {
                int braces=bracesParameter-1;
            byte[] nextThreeBytes = new byte[3];
            int amountRead = pdfSource.read(nextThreeBytes);
        
            //lets handle the special case seen in Bull  River Rules and 
Regulations.pdf
            //The dictionary looks like this
            //    2 0 obj
            //    <<
            //        /Type /Info
            //        /Creator (PaperPort http://www.scansoft.com)
            //        /Producer (sspdflib 1.0 http://www.scansoft.com)
            //        /Title ( (5)
            //        /Author ()
            //        /Subject ()
            //
            // Notice the /Title, the braces are not even but they should
            // be.  So lets assume that if we encounter an this scenario
            //   <end_brace><new_line><opening_slash> then that
            // means that there is an error in the pdf and assume that
            // was the end of the document.
            if( amountRead == 3 )
            {
                if( nextThreeBytes[0] == 0x0d &&
                    nextThreeBytes[1] == 0x0a &&
                    nextThreeBytes[2] == 0x2f )
                {
                    braces = 0;
                }
            }
            pdfSource.unread( nextThreeBytes, 0, amountRead );
            return braces;
        }
// =================================End of Change 1



Now in the method where it was originally defined, I removed the code and 
called the new method.
=============================== Change 2 

            if(ch == closeBrace)
            {
                braces=checkForMissingCloseParen(braces);
                if( braces != 0 )
                {
                    retval.append( ch );
                }

==============================End of Change 2

Then where there was a test for a \( I added another method call to check for 
the same case.


============================== Change 3

                    case ')':
                        // PDFBox 276 /Title (c:\)
                        braces=checkForMissingCloseParen(braces);
                        if( braces != 0 )
                        {
                            retval.append( ch );
                        }
                        else {
                                retval.append('\\');
                        }
                        break;
                    case '(':
                    case '\\':
                        retval.append( next );
                        break;
 


================================ End of Change 3.


Peter Lenahan


> IOException on parsing a PDF file
> ---------------------------------
>
>                 Key: PDFBOX-276
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-276
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>            Priority: Minor
>         Attachments: PDFBOX276-NotIndexedDocument.pdf
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1722594
> Originally submitted by doublep-enw on 2007-05-21 05:10.
> When parsing the attached file, PDFBox throws the following exception:
> java.io.IOException: expected='/' actual='?'--1 
> org.pdfbox.io.pushbackinputstr...@159f498
>     at org.pdfbox.pdfparser.BaseParser.parseCOSName(BaseParser.java:774)
>     at org.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:217)
>     at org.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:910)
>     at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:432)
>     at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
> The file does look strange inside, but PDF viewers don't seem to care.
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1722594&file_id=229983
> NotIndexedDocument.pdf (application/pdf), 8728 bytes
> unparseable file

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to