[
https://issues.apache.org/jira/browse/PDFBOX-399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875772#comment-13875772
]
Cheng Leong edited comment on PDFBOX-399 at 1/19/14 4:17 AM:
-------------------------------------------------------------
Submitting a patch for ignoring badly-formatted CMap ToUnicode instructions.
This allows parsing of some ToUnicode resource streams that would otherwise
throw exceptions which were silently consumed. This allows text extraction to
get the correctly mapped characters.
Specifically parse token<hex> adjacency without whitespace separating them, eat
all whitespace within a hex value, and return a partially constructed CMap
instead of throwing an exception.
I don't see a problem with the previous test case example (BlackHat...) but
I've modified the test case based on an example from the wild:
[http://www.itsix.com/media/experienced_java_developer.pdf|^experienced_java_developer.pdf]
edit: forgot to mention that this patch was designed on 1.8.3, but also worked
on trunk.
was (Author: [email protected]):
Submitting a patch for ignoring badly-formatted CMap ToUnicode instructions.
This allows parsing of some ToUnicode resource streams that would otherwise
throw exceptions which were silently consumed. This allows text extraction to
get the correctly mapped characters.
Specifically parse token<hex> adjacency without whitespace separating them, eat
all whitespace within a hex value, and return a partially constructed CMap
instead of throwing an exception.
I don't see a problem with the previous test case example (BlackHat...) but
I've modified the test case based on an example from the wild:
http://www.itsix.com/media/experienced_java_developer.pdf
> Gibberish Output
> ----------------
>
> Key: PDFBOX-399
> URL: https://issues.apache.org/jira/browse/PDFBOX-399
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Reporter: Sushil Duseja
> Attachments: BlackHat-DC-09-Marlinspike-Defeating-SSL.pdf,
> PDFBOX-399__Ignore_badly-formatted_CMap_ToUnicode_instructions.patch,
> experienced_java_developer.pdf
>
>
> While extracting text from a pdf file using PDFBox, I get garbage output
> (*À¾´»*) for a special text value "2007"; this text ("2007") is written in
> CLRDingbats font.
> Any pointer(s)?
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)