[
https://issues.apache.org/jira/browse/PDFBOX-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860470#action_12860470
]
[email protected] commented on PDFBOX-276:
----------------------------------------------
This describe a fix to the file: org.apache.pdfbox.pdfparser.BaseParser.java
I did not debug this in the trunk version, I have code from a few months ago.
I debugged this problem, because I was hitting the same issue with a document I
had.
The issue is not with PDFBox incorrectly parsing the document.
The real problem is the document creator
Amyuni PDF Converter Version 1.58 - Developer Licence N° 9B7449F2-8245
incorrectly generated the name in the Title.
The name was generated as (c:\)
However the PDF specification states that the backslash PARENTHESIS "\)" is
used to create the string literal character ')" within a string literal.
The String is required to have an open parentheses and a close parentheses
However because the \ eats the close parentheses, PDF box cannot find the
correct closing character,
it goes and eats several lines until it reaches the end of file.
I opened this document with the Adobe Reader,
I looked in
File -> Properties
Adobe reader cannot identify the title or the other attributes either, however
it does not crash when reading the document.
The documentation for the behavior is described in
http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf
7.3.4.2 Literal Strings of the PDF Specification.
(This is a string)
This is how to escape a PARENTHESIS within a string. " \) RIGHT PARENTHESIS
(29h) "
The document contains this syntax " /Title (c:\) " which incorrectly fails to
encode the backslash character
The correct encoding should be " /Title (c:\\) "
Here is the text from the PDF file that was attached to this bug.
/Title (c:\)
/Producer (Amyuni PDF Converter)
/Version (Version 1.58 - Developer Licence N° 9B7449F2-8245)
/CreationDate (1/8/2003 12:18:53)
I don't think this can be fixed without compromising the content of the
document.
We could just discard the information the way Adobe PDF Reader does when we
reach the >> or "endobj" line
However, I came up with a work around.
<<
/Title (c:\)
/Producer (Amyuni PDF Converter)
/Version (Version 1.58 - Developer Licence N° 9B7449F2-8245)
/CreationDate (1/8/2003 12:18:53)
>>
endobj
Looking at the code in more depth, there seems to be another patch for a
similar issue.
In this case another vendor made a similar mistake in the title generation.
In the file, org.apache.pdfbox.pdfparser.BaseParser.java
//lets handle the special case seen in Bull River Rules and
Regulations.pdf
//The dictionary looks like this
// 2 0 obj
// <<
// /Type /Info
// /Creator (PaperPort http://www.scansoft.com)
// /Producer (sspdflib 1.0 http://www.scansoft.com)
// /Title ( (5)
// /Author ()
// /Subject ()
I noticed this a little later and realized that I needed the same code in a
different place,
So I clipped it and made it into a method, which is now called from 2 places.
//================================ Change 1
/**
* This is really a bug in the Document creators code, but it caused a crash
* in PDFBox, the first bug was in this format:
* /Title ( (5)
* /Creator which was patched in 1 place.
* However it missed the case where the Close Paren was escaped
*
* The second bug was in this format
* /Title (c:\)
* /Producer
*
* This patch moves this code out of the parseCOSString method, so it can
be used twice.
*
*
* @param bracesParameter the number of braces currently open.
*
* @return the corrected value of the brace counter
* @throws IOException
*/
private int checkForMissingCloseParen(final int bracesParameter) throws
IOException {
int braces=bracesParameter-1;
byte[] nextThreeBytes = new byte[3];
int amountRead = pdfSource.read(nextThreeBytes);
//lets handle the special case seen in Bull River Rules and
Regulations.pdf
//The dictionary looks like this
// 2 0 obj
// <<
// /Type /Info
// /Creator (PaperPort http://www.scansoft.com)
// /Producer (sspdflib 1.0 http://www.scansoft.com)
// /Title ( (5)
// /Author ()
// /Subject ()
//
// Notice the /Title, the braces are not even but they should
// be. So lets assume that if we encounter an this scenario
// <end_brace><new_line><opening_slash> then that
// means that there is an error in the pdf and assume that
// was the end of the document.
if( amountRead == 3 )
{
if( nextThreeBytes[0] == 0x0d &&
nextThreeBytes[1] == 0x0a &&
nextThreeBytes[2] == 0x2f )
{
braces = 0;
}
}
pdfSource.unread( nextThreeBytes, 0, amountRead );
return braces;
}
// =================================End of Change 1
Now in the method where it was originally defined, I removed the code and
called the new method.
=============================== Change 2
if(ch == closeBrace)
{
braces=checkForMissingCloseParen(braces);
if( braces != 0 )
{
retval.append( ch );
}
==============================End of Change 2
Then where there was a test for a \( I added another method call to check for
the same case.
============================== Change 3
case ')':
// PDFBox 276 /Title (c:\)
braces=checkForMissingCloseParen(braces);
if( braces != 0 )
{
retval.append( ch );
}
else {
retval.append('\\');
}
break;
case '(':
case '\\':
retval.append( next );
break;
================================ End of Change 3.
Peter Lenahan
> IOException on parsing a PDF file
> ---------------------------------
>
> Key: PDFBOX-276
> URL: https://issues.apache.org/jira/browse/PDFBOX-276
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Priority: Minor
> Attachments: PDFBOX276-NotIndexedDocument.pdf
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1722594
> Originally submitted by doublep-enw on 2007-05-21 05:10.
> When parsing the attached file, PDFBox throws the following exception:
> java.io.IOException: expected='/' actual='?'--1
> org.pdfbox.io.pushbackinputstr...@159f498
> at org.pdfbox.pdfparser.BaseParser.parseCOSName(BaseParser.java:774)
> at org.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:217)
> at org.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:910)
> at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:432)
> at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
> The file does look strange inside, but PDF viewers don't seem to care.
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1722594&file_id=229983
> NotIndexedDocument.pdf (application/pdf), 8728 bytes
> unparseable file
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.