[
https://issues.apache.org/jira/browse/PDFBOX-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13806720#comment-13806720
]
Arjohn Kampman edited comment on PDFBOX-1607 at 10/28/13 4:27 PM:
------------------------------------------------------------------
Unfortunately, using the non sequential parser is not an option for us yet
since that has other parsing problems. So I have investigated this parsing
problem today.
First of all this problem has been introduced in svn revision 1451638 as part
of PDFBOX-1513. Looking at the changes that have been made to {{BaseParser}} in
that revision, I fail to see how {{sBuf}} is related to the length of
{{strmBuf}} in this line:
{{sBuf.deleteCharAt(strmBuf.length-1);}}
This looks like a genuine bug to me. The intention of this line was probably to
discard the last character if the buffer contains an odd number of
hexadecimals. If this line is fixed then the problematic documents parse
successfully, albeit with an error being logged. That error is the result of
the {{wasLastParsedObjectEOF}} in {{PDFParser.parse()}} being reset to false.
The attached patch fixes both issues. That patch is based on today's trunk
code. Please consider applying this patch.
was (Author: arjohn):
Unfortunately, using the non sequential parser is not an option for us yet
since that has other parsing problems. So I have investigated this parsing
problem today.
First of all this problem has been introduced in svn revision 1451638 as part
of PDFBOX-1513. Looking at the changes that have been made to {{BaseParser}} in
that revision, I fail to how {{sBuf}} is related to the length of {{strmBuf}}
in this line:
{{sBuf.deleteCharAt(strmBuf.length-1);}}
This looks like a genuine bug to me. The intention of this line was probably to
discard the last character if the buffer contains an odd number of
hexadecimals. If this line is fixed then the problematic documents parse
successfully, albeit with an error being logged. That error is the result of
the {{wasLastParsedObjectEOF}} in {{PDFParser.parse()}} being reset to false.
The attached patch fixes both issues. That patch is based on today's trunk
code. Please consider applying this patch.
> StringIndexOutOfBoundsException in PDFParser
> --------------------------------------------
>
> Key: PDFBOX-1607
> URL: https://issues.apache.org/jira/browse/PDFBOX-1607
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 1.8.1
> Environment: Windows 7, JRE 1.7.0_15-b03
> Reporter: Alex Alishevskikh
> Attachments: pdfbox-1607-fix.patch, pdf-govdocs-036902.pdf,
> pdf-govdocs-107566.pdf
>
>
> I have few test files parsed fine in PDFBox 1.7.1 but not in 1.8.1:
> java.lang.StringIndexOutOfBoundsException: String index out of range: 2047
> at
> java.lang.AbstractStringBuilder.deleteCharAt(AbstractStringBuilder.java:762)
> at java.lang.StringBuilder.deleteCharAt(StringBuilder.java:258)
> at
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1000)
> at
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808)
> at
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1241)
> at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:558)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:188)
--
This message was sent by Atlassian JIRA
(v6.1#6144)