Tim Allison commented on PDFBOX-4141:

I'm inclined to [~tilman]'s solution (extending {{PDFTextStripper}}) because I 
agree that different users will have different use cases for what is meant by 

bq.  I think it is important to talk about this topic.
+1 this is important...I agree!

> Suppress control characters?
> ----------------------------
>                 Key: PDFBOX-4141
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4141
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.8
>            Reporter: Andreas Meier
>            Priority: Minor
>         Attachments: 000016.pdf, Mapping_default_to_adobe.csv, 
> Test_with_MW.pdf, Test_with_MW.txt, Test_with_MW_AdobeReader_export.txt, 
> Test_with_MW_linux.jpg, Test_without_MW.txt
> At the moment pdfbox extracts all types of characters.
> Therefore control characters that occur will also be extracted.
> Unfortunately some of these control characters might deform text.
> For example 'MESSAGE WAITING' (U+0095) [MW]
> I attached some files and a screenshot how text is printed when MESSAGE 
> WAITING is present.
> Should PDFBox handle this type of characters? Maybe suppress them in 
> PDFTextStripper?
> I know that PDFBox works correctly in this case, a feature to turn off or 
> suppress special characters might produce better output than the default 
> Setting unless some control characters are used for any further processing!?
> Feedback appreciated.
> What other programs do:
> a) ignore control characters (Okular PDF Viewer - KDE)
> b) exchange them  (Adobe Reader wrote a dot "." in place of MW)
> Regards
> Andreas

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to