Andreas Lehmkühler created PDFBOX-2035:
------------------------------------------

             Summary: Ignore badly formatted toUnicode CMaps
                 Key: PDFBOX-2035
                 URL: https://issues.apache.org/jira/browse/PDFBOX-2035
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing, PDModel
    Affects Versions: 1.8.4, 2.0.0
            Reporter: Cheng Leong
            Assignee: Andreas Lehmkühler
             Fix For: 1.8.5, 2.0.0


Copied from PDFBOX-399:

Submitting a patch for ignoring badly-formatted CMap ToUnicode instructions.
This allows parsing of some ToUnicode resource streams that would otherwise 
throw exceptions which were silently consumed. This allows text extraction to 
get the correctly mapped characters.

Specifically parse token<hex> adjacency without whitespace separating them, eat 
all whitespace within a hex value, and return a partially constructed CMap 
instead of throwing an exception.

I don't see a problem with the previous test case example (BlackHat...) but 
I've modified the test case based on an example from the wild: 
http://www.itsix.com/media/experienced_java_developer.pdf

edit: forgot to mention that this patch was designed on 1.8.3, but also worked 
on trunk.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to