Parsing hexadecimal strings is not strict enough + FIX
------------------------------------------------------
Key: PDFBOX-1171
URL: https://issues.apache.org/jira/browse/PDFBOX-1171
Project: PDFBox
Issue Type: Bug
Components: Parsing
Affects Versions: 1.7.0
Reporter: Timo Boehme
Priority: Minor
Hexadecimal strings (strings in '<','>') are parsed in BaseParser with the same
method parsing literal strings (strings in '(',')'). Since in literal strings
the parsing of escape sequences and parentheses is quite tricky, there are a
number of rules to capture problematic cases. However for hexadecimal strings
this is not needed. Here we known of the allowed restricted character set and
we don't have to count opening and closing brackets.
The problem with the relaxed parsing (and therefore this is marked as bug) is
with parsing documents containing trash data between objects (I have a number
of them - however confidential ones - produced by verypdf.com, which seems to
got updated a lot and after an endobj it contains e.g. <PrY... - simply some
remainings from old objects). This trash would be no problem when parsing with
an ISO conforming parser since these ranges would be ignored, but with the
current sequential parser it is parsed and the best one can hope is that the
trash is found to be not parseable and the parser searches for a new starting
point via PDFParser#skipToNextObj. This is now where the problem with the
relaxed hexadecimal parsing is: as in the example the opening '<' triggers a
hexadecimal string parsing it because of later '<' it goes until end of
document, reading all valid objects as string content. With a more strict
parsing we would find that it is not a hexadecimal string with the second
character.
I have therefore added a method for parsing hexadecimal strings (see attached
diff) which fails (IOException) if an invalid character is read (this method
also runs faster than previous parser since there are only small number of
cases to test).
With this change I can now parse the mentioned (correct) documents (with forced
parsing) which wasn't possible before.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira