[ 
https://issues.apache.org/jira/browse/PDFBOX-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13806720#comment-13806720
 ] 

Arjohn Kampman edited comment on PDFBOX-1607 at 10/28/13 4:27 PM:
------------------------------------------------------------------

Unfortunately, using the non sequential parser is not an option for us yet 
since that has other parsing problems. So I have investigated this parsing 
problem today.

First of all this problem has been introduced in svn revision 1451638 as part 
of PDFBOX-1513. Looking at the changes that have been made to {{BaseParser}} in 
that revision, I fail to see how {{sBuf}} is related to the length of 
{{strmBuf}} in this line:

{{sBuf.deleteCharAt(strmBuf.length-1);}}

This looks like a genuine bug to me. The intention of this line was probably to 
discard the last character if the buffer contains an odd number of 
hexadecimals. If this line is fixed then the problematic documents parse 
successfully, albeit with an error being logged. That error is the result of 
the {{wasLastParsedObjectEOF}} in {{PDFParser.parse()}} being reset to false.

The attached patch fixes both issues. That patch is based on today's trunk 
code. Please consider applying this patch.


was (Author: arjohn):
Unfortunately, using the non sequential parser is not an option for us yet 
since that has other parsing problems. So I have investigated this parsing 
problem today.

First of all this problem has been introduced in svn revision 1451638 as part 
of PDFBOX-1513. Looking at the changes that have been made to {{BaseParser}} in 
that revision, I fail to how {{sBuf}} is related to the length of {{strmBuf}} 
in this line:

{{sBuf.deleteCharAt(strmBuf.length-1);}}

This looks like a genuine bug to me. The intention of this line was probably to 
discard the last character if the buffer contains an odd number of 
hexadecimals. If this line is fixed then the problematic documents parse 
successfully, albeit with an error being logged. That error is the result of 
the {{wasLastParsedObjectEOF}} in {{PDFParser.parse()}} being reset to false.

The attached patch fixes both issues. That patch is based on today's trunk 
code. Please consider applying this patch.

> StringIndexOutOfBoundsException in PDFParser
> --------------------------------------------
>
>                 Key: PDFBOX-1607
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1607
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.1
>         Environment: Windows 7, JRE 1.7.0_15-b03
>            Reporter: Alex Alishevskikh
>         Attachments: pdfbox-1607-fix.patch, pdf-govdocs-036902.pdf, 
> pdf-govdocs-107566.pdf
>
>
> I have few test files parsed fine in PDFBox 1.7.1 but not in 1.8.1:
> java.lang.StringIndexOutOfBoundsException: String index out of range: 2047
>      at 
> java.lang.AbstractStringBuilder.deleteCharAt(AbstractStringBuilder.java:762)
>      at java.lang.StringBuilder.deleteCharAt(StringBuilder.java:258)
>      at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1000)
>      at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808)
>      at 
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1241)
>      at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:558)
>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:188)



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to