[ 
https://issues.apache.org/jira/browse/PDFBOX-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-1171.
----------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.7.0
         Assignee: Andreas Lehmkühler

I added the proposed patch in revision 1209127 with some minor changes. I also 
refactored the parseCOSString method.

Thaks for the controbution
                
> Parsing hexadecimal strings is not strict enough + FIX
> ------------------------------------------------------
>
>                 Key: PDFBOX-1171
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1171
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.7.0
>            Reporter: Timo Boehme
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>             Fix For: 1.7.0
>
>         Attachments: BaseParser.diff
>
>
> Hexadecimal strings (strings in '<','>') are parsed in BaseParser with the 
> same method parsing literal strings (strings in '(',')'). Since in literal 
> strings the parsing of escape sequences and parentheses is quite tricky, 
> there are a number of rules to capture problematic cases. However for 
> hexadecimal strings this is not needed. Here we known of the allowed 
> restricted character set and we don't have to count opening and closing 
> brackets.
> The problem with the relaxed parsing (and therefore this is marked as bug) is 
> with parsing documents containing trash data between objects (I have a number 
> of them - however confidential ones - produced by verypdf.com, which seems to 
> got updated a lot and after an endobj it contains e.g. <PrY... - simply some 
> remainings from old objects). This trash would be no problem when parsing 
> with an ISO conforming parser since these ranges would be ignored, but with 
> the current sequential parser it is parsed and the best one can hope is that 
> the trash is found to be not parseable and the parser searches for a new 
> starting point via PDFParser#skipToNextObj. This is now where the problem 
> with the relaxed hexadecimal parsing is: as in the example the opening '<' 
> triggers a hexadecimal string parsing it because of later '<' it goes until 
> end of document, reading all valid objects as string content. With a more 
> strict parsing we would find that it is not a hexadecimal string with the 
> second character.
> I have therefore added a method for parsing hexadecimal strings (see attached 
> diff) which fails (IOException) if an invalid character is read (this method 
> also runs faster than previous parser since there are only small number of 
> cases to test).
> With this change I can now parse the mentioned (correct) documents (with 
> forced parsing) which wasn't possible before.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to