Andreas Meier created PDFBOX-4141:
-------------------------------------
Summary: Suppress control characters?
Key: PDFBOX-4141
URL: https://issues.apache.org/jira/browse/PDFBOX-4141
Project: PDFBox
Issue Type: Improvement
Components: Parsing
Reporter: Andreas Meier
At the moment pdfbox extracts all types of characters.
Therefore control characters that occur will also be extracted.
Unfortunately some of these control characters might deform text.
For example 'MESSAGE WAITING' (U+0095) [MW]
I attached some files and a screenshot how text is printed when MESSAGE WAITING
is present.
Should PDFBox handle this type of characters? Maybe suppress them in
PDFTextStripper?
I know that PDFBox works correctly in this case, a feature to turn off or
suppress special characters might produce better output than the default
Setting unless some control characters are used for any further processing!?
Feedback appreciated.
What other programs do:
a) ignore control characters (Okular PDF Viewer - KDE)
b) exchange them (Adobe Reader wrote a dot "." in place of MW)
Regards
Andreas
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]