[ 
https://issues.apache.org/jira/browse/PDFBOX-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256211#comment-17256211
 ] 

Michael Klink commented on PDFBOX-2138:
---------------------------------------

If you want to get a text extraction mechanism that extracts exactly the text 
visible, you have to do that (consider clipping paths) and more: In particular 
you also have to take into account that text may be hidden under other stuff, 
e.g. paths or bitmaps, text may have the same color as the background to start 
with or the colors may be later made equal by some operation in a funny blend 
mode. This includes some interesting decisions like how to deal with glyphs 
whose respective area is partially subject to clipping, covering, or color 
equalization.

In my opinion all this is beyond the scope of text extraction. Text extraction 
is more akin to copy&paste in Adobe Reader, and that also catches invisible 
text. E.g. in the document at hand, depending on how you mark text, you can 
copy&paste invisible text.

> Corrupted words when using PDFTextStripper
> ------------------------------------------
>
>                 Key: PDFBOX-2138
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2138
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.5, 1.8.6, 2.0.0
>         Environment: Windows 7 / 64 bit
>            Reporter: Walter Kehl
>            Priority: Major
>             Fix For: 4.0.0
>
>         Attachments: PDFBOX-2138-noClip.pdf, PDFBOX-2138-noClip.png, 
> PDFBOX-2138.pdf, PDFBOX-2138.txt, banking-banana-skins-2014.pdf, 
> banking-banana-skins-2014.txt
>
>
> >> I am using PDFTextStripper (embedded into another application) to get 
> >> the raw text of PDFs so far with good results but recently a PDF file 
> >> has appeared where the output of the PDFTextStripper was corrupted. I 
> >> got sentences like:
> >>
> >>    
> >>
> >> "There is al o con ern that b nkers may be pushed to misprice risk 
> >> (No. 6) by the pres ures of c mpetition and an abunda ce of central b 
> >> nk-provided liquidity."
> > Additionally some portions of text appear 
> > twice in the output: first correctly and then corrupted. I have 
> > attached an output created with PDFBox's command line options.
> > If you compare lines 357- 365 with lines 421-429 you see that it is 
> > the same paragraph, first ok and then with characters missing. In the 
> > original source this paragraph is unique.
> > The same seems to happen for the other instances where text is corrupted.
> I also tried it directly on the command line with the same results: input and 
> output files are attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to