Andreas Lehmkühler created PDFBOX-2035:
------------------------------------------
Summary: Ignore badly formatted toUnicode CMaps
Key: PDFBOX-2035
URL: https://issues.apache.org/jira/browse/PDFBOX-2035
Project: PDFBox
Issue Type: Bug
Components: Parsing, PDModel
Affects Versions: 1.8.4, 2.0.0
Reporter: Cheng Leong
Assignee: Andreas Lehmkühler
Fix For: 1.8.5, 2.0.0
Copied from PDFBOX-399:
Submitting a patch for ignoring badly-formatted CMap ToUnicode instructions.
This allows parsing of some ToUnicode resource streams that would otherwise
throw exceptions which were silently consumed. This allows text extraction to
get the correctly mapped characters.
Specifically parse token<hex> adjacency without whitespace separating them, eat
all whitespace within a hex value, and return a partially constructed CMap
instead of throwing an exception.
I don't see a problem with the previous test case example (BlackHat...) but
I've modified the test case based on an example from the wild:
http://www.itsix.com/media/experienced_java_developer.pdf
edit: forgot to mention that this patch was designed on 1.8.3, but also worked
on trunk.
--
This message was sent by Atlassian JIRA
(v6.2#6252)