[jira] [Commented] (PDFBOX-4141) Suppress control characters?

Andreas Meier (JIRA) Tue, 06 Mar 2018 23:19:03 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-4141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16389166#comment-16389166
 ]


Andreas Meier commented on PDFBOX-4141:
---------------------------------------

{quote}What is the meaning of the table columns? Convert code to the left into 
code to the right?{quote}

Yes, the first column specifies the codepoint embedded in the pdf, the second 
column stands for the codepoint that was returned when you copy the text from 
within Adobe Reader and paste it to a plain text file.

{quote}But in your table, there is the line "0020,Missing". 0x20 is a space. 
You don't want us to optionally remove the space, do you?{quote}

No, all in all I would suggest to leave most of the c0 codes (on the upper half 
of the file) as well as space(0x20) and del(0x7F) unchanged and only change 
some of the c1 codes (on the lower half of the file).

{quote}If such a code is between two words, then wouldn't it be better to 
replace it with a space?{quote}

Yes, I think this might be a better solution than printing a dot (0x2E) like 
Adobe does.



The problem I see with some of the control character is that they might scatter 
the layout of the extracted text.
0x0C for example might wrap a whole page if the extracted text is copied to a 
text processing program that interprets the control character.


If I am correctly informed, PDFBox determines the layout of the  extracted text 
due to the positioning of the characters. (Please correct me if I am wrong, 
Tilman). If this is the case, some of the control characters might not be 
needed anymore in the extracted text since they might be added accidentally by 
the program that creates the pdf.


I am not so deep in PDFBox, what happens with existing the line Feeds (0x0A) in 
PDFBox?


I don't think that PDFBox should act exactly like Adobe Reader, but there might 
be a reason why Adobe replaces some of the control characters in the output 
text.


Maybe I just see a problem where no problem is, but I think it is important to 
talk about this topic.

> Suppress control characters?
> ----------------------------
>
>                 Key: PDFBOX-4141
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4141
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>            Reporter: Andreas Meier
>            Priority: Minor
>         Attachments: Mapping_default_to_adobe.csv, Test_with_MW.pdf, 
> Test_with_MW.txt, Test_with_MW_AdobeReader_export.txt, 
> Test_with_MW_linux.jpg, Test_without_MW.txt
>
>
> At the moment pdfbox extracts all types of characters.
> Therefore control characters that occur will also be extracted.
> Unfortunately some of these control characters might deform text.
> For example 'MESSAGE WAITING' (U+0095) [MW]
> I attached some files and a screenshot how text is printed when MESSAGE 
> WAITING is present.
> Should PDFBox handle this type of characters? Maybe suppress them in 
> PDFTextStripper?
> I know that PDFBox works correctly in this case, a feature to turn off or 
> suppress special characters might produce better output than the default 
> Setting unless some control characters are used for any further processing!?
> Feedback appreciated.
> What other programs do:
> a) ignore control characters (Okular PDF Viewer - KDE)
> b) exchange them  (Adobe Reader wrote a dot "." in place of MW)
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4141) Suppress control characters?

Reply via email to