Andreas Meier created PDFBOX-4141:
-------------------------------------

             Summary: Suppress control characters?
                 Key: PDFBOX-4141
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4141
             Project: PDFBox
          Issue Type: Improvement
          Components: Parsing
            Reporter: Andreas Meier


At the moment pdfbox extracts all types of characters.
Therefore control characters that occur will also be extracted.

Unfortunately some of these control characters might deform text.
For example 'MESSAGE WAITING' (U+0095) [MW]

I attached some files and a screenshot how text is printed when MESSAGE WAITING 
is present.

Should PDFBox handle this type of characters? Maybe suppress them in 
PDFTextStripper?

I know that PDFBox works correctly in this case, a feature to turn off or 
suppress special characters might produce better output than the default 
Setting unless some control characters are used for any further processing!?


Feedback appreciated.


What other programs do:
a) ignore control characters (Okular PDF Viewer - KDE)
b) exchange them  (Adobe Reader wrote a dot "." in place of MW)


Regards

Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to