[
https://issues.apache.org/jira/browse/PDFBOX-4141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388222#comment-16388222
]
Tilman Hausherr commented on PDFBOX-4141:
-----------------------------------------
What is the meaning of the table columns? Convert code to the left into code to
the right? But in your table, there is the line "0020,Missing". 0x20 is a
space. You don't want us to optionally remove the space, do you?
I'm also wondering whether it is always the right thing to convert or remove
such codes. If such a code is between two words, then wouldn't it be better to
replace it with a space? If an application just wants to extract words, then
the conversion wouldn't be needed at all, such characters would count as
separators.
> Suppress control characters?
> ----------------------------
>
> Key: PDFBOX-4141
> URL: https://issues.apache.org/jira/browse/PDFBOX-4141
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing
> Reporter: Andreas Meier
> Priority: Minor
> Attachments: Mapping_default_to_adobe.csv, Test_with_MW.pdf,
> Test_with_MW.txt, Test_with_MW_AdobeReader_export.txt,
> Test_with_MW_linux.jpg, Test_without_MW.txt
>
>
> At the moment pdfbox extracts all types of characters.
> Therefore control characters that occur will also be extracted.
> Unfortunately some of these control characters might deform text.
> For example 'MESSAGE WAITING' (U+0095) [MW]
> I attached some files and a screenshot how text is printed when MESSAGE
> WAITING is present.
> Should PDFBox handle this type of characters? Maybe suppress them in
> PDFTextStripper?
> I know that PDFBox works correctly in this case, a feature to turn off or
> suppress special characters might produce better output than the default
> Setting unless some control characters are used for any further processing!?
> Feedback appreciated.
> What other programs do:
> a) ignore control characters (Okular PDF Viewer - KDE)
> b) exchange them (Adobe Reader wrote a dot "." in place of MW)
> Regards
> Andreas
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]