[ 
https://issues.apache.org/jira/browse/PDFBOX-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13562602#comment-13562602
 ] 

Thomas Chojecki commented on PDFBOX-1171:
-----------------------------------------

I'm running in some problems with this patch. We have some documents with 
manipulated COSStrings and the parser will run into the IOException that break 
the whole parsing process. I know that this are malformed document but letting 
the parser crash IMO isn't the best solution. Maybe returning a empty string or 
parsing as much bytes as possible will be a better way.
                
> Parsing hexadecimal strings is not strict enough + FIX
> ------------------------------------------------------
>
>                 Key: PDFBOX-1171
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1171
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.7.0
>            Reporter: Timo Boehme
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>             Fix For: 1.7.0
>
>         Attachments: BaseParser.diff
>
>
> Hexadecimal strings (strings in '<','>') are parsed in BaseParser with the 
> same method parsing literal strings (strings in '(',')'). Since in literal 
> strings the parsing of escape sequences and parentheses is quite tricky, 
> there are a number of rules to capture problematic cases. However for 
> hexadecimal strings this is not needed. Here we known of the allowed 
> restricted character set and we don't have to count opening and closing 
> brackets.
> The problem with the relaxed parsing (and therefore this is marked as bug) is 
> with parsing documents containing trash data between objects (I have a number 
> of them - however confidential ones - produced by verypdf.com, which seems to 
> got updated a lot and after an endobj it contains e.g. <PrY... - simply some 
> remainings from old objects). This trash would be no problem when parsing 
> with an ISO conforming parser since these ranges would be ignored, but with 
> the current sequential parser it is parsed and the best one can hope is that 
> the trash is found to be not parseable and the parser searches for a new 
> starting point via PDFParser#skipToNextObj. This is now where the problem 
> with the relaxed hexadecimal parsing is: as in the example the opening '<' 
> triggers a hexadecimal string parsing it because of later '<' it goes until 
> end of document, reading all valid objects as string content. With a more 
> strict parsing we would find that it is not a hexadecimal string with the 
> second character.
> I have therefore added a method for parsing hexadecimal strings (see attached 
> diff) which fails (IOException) if an invalid character is read (this method 
> also runs faster than previous parser since there are only small number of 
> cases to test).
> With this change I can now parse the mentioned (correct) documents (with 
> forced parsing) which wasn't possible before.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to