[ 
https://issues.apache.org/jira/browse/PDFBOX-4141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16389856#comment-16389856
 ] 

Tilman Hausherr commented on PDFBOX-4141:
-----------------------------------------

{quote}PDFBox determines the layout of the extracted text due to the 
positioning of the characters.
{quote}
Yes and no... PDFBox creates dummy spaces depending on the positions. Most PDFs 
don't have spaces, so heuristics are used to make a guess. But only one space 
is added, so if there is a lot of room between two characters, the text 
extraction won't show it.

In standard mode, the text appears in the sequence as it is in the page content 
stream. In sorted mode, then it is according to the position.

There is no fixed rule that "unsorted is better" or "sorted is better".

If there are "article beads" (PDFBOX-3110), then it gets even more complicated, 
see the three files PDFBOX-3110-poems-beads* in the source code download, open 
the PDF file in Adobe Acrobat or run DrawPrintTextLocations on it and look at 
the green lines in the result image files. Most PDFs don't have beads.
{quote}what happens with existing the line Feeds (0x0A) in PDFBox
{quote}
These shouldn't even be in a PDF, because these aren't glyphs. If they do 
exist, then (I think) they will appear in the extracted text.

My problem is that I am very undecided about what to do, or whether to do 
anything at all. My current opinion is that it should be done by extending 
PDFTextStripper, because there's more than just "convert" or "don't convert".

Most files don't have these control codes. I ran some test code on the test set 
of [digitalcorpora|http://digitalcorpora.org/corp/files/govdocs1/zipfiles/], 
and most "hits" were from files with pure trash, others were because of 
incorrect font unicode mapping.

For example, file 000016.pdf has EF 9B 99 in text extraction which is utf8 for 
unicode F6D9, which should have been a (C) symbol.

Maybe [~talli...@apache.org] can give some opinion on this, i.e. whether PDFBox 
should do some optional cleanup on text extraction, and whether an optional 
cleanup should always be the same.

> Suppress control characters?
> ----------------------------
>
>                 Key: PDFBOX-4141
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4141
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.8
>            Reporter: Andreas Meier
>            Priority: Minor
>         Attachments: 000016.pdf, Mapping_default_to_adobe.csv, 
> Test_with_MW.pdf, Test_with_MW.txt, Test_with_MW_AdobeReader_export.txt, 
> Test_with_MW_linux.jpg, Test_without_MW.txt
>
>
> At the moment pdfbox extracts all types of characters.
> Therefore control characters that occur will also be extracted.
> Unfortunately some of these control characters might deform text.
> For example 'MESSAGE WAITING' (U+0095) [MW]
> I attached some files and a screenshot how text is printed when MESSAGE 
> WAITING is present.
> Should PDFBox handle this type of characters? Maybe suppress them in 
> PDFTextStripper?
> I know that PDFBox works correctly in this case, a feature to turn off or 
> suppress special characters might produce better output than the default 
> Setting unless some control characters are used for any further processing!?
> Feedback appreciated.
> What other programs do:
> a) ignore control characters (Okular PDF Viewer - KDE)
> b) exchange them  (Adobe Reader wrote a dot "." in place of MW)
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to