[jira] [Created] (PDFBOX-1171) Parsing hexadecimal strings is not strict enough + FIX

Timo Boehme (Created) (JIRA) Tue, 15 Nov 2011 07:31:16 -0800

Parsing hexadecimal strings is not strict enough + FIX
------------------------------------------------------


                 Key: PDFBOX-1171
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1171
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 1.7.0
            Reporter: Timo Boehme
            Priority: Minor


Hexadecimal strings (strings in '<','>') are parsed in BaseParser with the same 
method parsing literal strings (strings in '(',')'). Since in literal strings 
the parsing of escape sequences and parentheses is quite tricky, there are a 
number of rules to capture problematic cases. However for hexadecimal strings 
this is not needed. Here we known of the allowed restricted character set and 
we don't have to count opening and closing brackets.

The problem with the relaxed parsing (and therefore this is marked as bug) is 
with parsing documents containing trash data between objects (I have a number 
of them - however confidential ones - produced by verypdf.com, which seems to 
got updated a lot and after an endobj it contains e.g. <PrY... - simply some 
remainings from old objects). This trash would be no problem when parsing with 
an ISO conforming parser since these ranges would be ignored, but with the 
current sequential parser it is parsed and the best one can hope is that the 
trash is found to be not parseable and the parser searches for a new starting 
point via PDFParser#skipToNextObj. This is now where the problem with the 
relaxed hexadecimal parsing is: as in the example the opening '<' triggers a 
hexadecimal string parsing it because of later '<' it goes until end of 
document, reading all valid objects as string content. With a more strict 
parsing we would find that it is not a hexadecimal string with the second 
character.

I have therefore added a method for parsing hexadecimal strings (see attached 
diff) which fails (IOException) if an invalid character is read (this method 
also runs faster than previous parser since there are only small number of 
cases to test).

With this change I can now parse the mentioned (correct) documents (with forced 
parsing) which wasn't possible before.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PDFBOX-1171) Parsing hexadecimal strings is not strict enough + FIX

Reply via email to