[
https://issues.apache.org/jira/browse/PDFBOX-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13562602#comment-13562602
]
Thomas Chojecki commented on PDFBOX-1171:
-----------------------------------------
I'm running in some problems with this patch. We have some documents with
manipulated COSStrings and the parser will run into the IOException that break
the whole parsing process. I know that this are malformed document but letting
the parser crash IMO isn't the best solution. Maybe returning a empty string or
parsing as much bytes as possible will be a better way.
> Parsing hexadecimal strings is not strict enough + FIX
> ------------------------------------------------------
>
> Key: PDFBOX-1171
> URL: https://issues.apache.org/jira/browse/PDFBOX-1171
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 1.7.0
> Reporter: Timo Boehme
> Assignee: Andreas Lehmkühler
> Priority: Minor
> Fix For: 1.7.0
>
> Attachments: BaseParser.diff
>
>
> Hexadecimal strings (strings in '<','>') are parsed in BaseParser with the
> same method parsing literal strings (strings in '(',')'). Since in literal
> strings the parsing of escape sequences and parentheses is quite tricky,
> there are a number of rules to capture problematic cases. However for
> hexadecimal strings this is not needed. Here we known of the allowed
> restricted character set and we don't have to count opening and closing
> brackets.
> The problem with the relaxed parsing (and therefore this is marked as bug) is
> with parsing documents containing trash data between objects (I have a number
> of them - however confidential ones - produced by verypdf.com, which seems to
> got updated a lot and after an endobj it contains e.g. <PrY... - simply some
> remainings from old objects). This trash would be no problem when parsing
> with an ISO conforming parser since these ranges would be ignored, but with
> the current sequential parser it is parsed and the best one can hope is that
> the trash is found to be not parseable and the parser searches for a new
> starting point via PDFParser#skipToNextObj. This is now where the problem
> with the relaxed hexadecimal parsing is: as in the example the opening '<'
> triggers a hexadecimal string parsing it because of later '<' it goes until
> end of document, reading all valid objects as string content. With a more
> strict parsing we would find that it is not a hexadecimal string with the
> second character.
> I have therefore added a method for parsing hexadecimal strings (see attached
> diff) which fails (IOException) if an invalid character is read (this method
> also runs faster than previous parser since there are only small number of
> cases to test).
> With this change I can now parse the mentioned (correct) documents (with
> forced parsing) which wasn't possible before.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira