[ 
https://issues.apache.org/jira/browse/PDFBOX-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256106#comment-17256106
 ] 

Andreas Lehmkühler commented on PDFBOX-2138:
--------------------------------------------

[~mkl] thanks for the explanation. I already saw those clipping path operations 
but I hoped that there maybe some other mechanism to choose the right marked 
content. Hence we have to adjust the extraction code to take the clipping path 
information into account so that the "invisible" text isn't extracted anymore

> Corrupted words when using PDFTextStripper
> ------------------------------------------
>
>                 Key: PDFBOX-2138
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2138
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.5, 1.8.6, 2.0.0
>         Environment: Windows 7 / 64 bit
>            Reporter: Walter Kehl
>            Priority: Major
>             Fix For: 4.0.0
>
>         Attachments: PDFBOX-2138-noClip.pdf, PDFBOX-2138-noClip.png, 
> PDFBOX-2138.pdf, PDFBOX-2138.txt, banking-banana-skins-2014.pdf, 
> banking-banana-skins-2014.txt
>
>
> >> I am using PDFTextStripper (embedded into another application) to get 
> >> the raw text of PDFs so far with good results but recently a PDF file 
> >> has appeared where the output of the PDFTextStripper was corrupted. I 
> >> got sentences like:
> >>
> >>    
> >>
> >> "There is al o con ern that b nkers may be pushed to misprice risk 
> >> (No. 6) by the pres ures of c mpetition and an abunda ce of central b 
> >> nk-provided liquidity."
> > Additionally some portions of text appear 
> > twice in the output: first correctly and then corrupted. I have 
> > attached an output created with PDFBox's command line options.
> > If you compare lines 357- 365 with lines 421-429 you see that it is 
> > the same paragraph, first ok and then with characters missing. In the 
> > original source this paragraph is unique.
> > The same seems to happen for the other instances where text is corrupted.
> I also tried it directly on the command line with the same results: input and 
> output files are attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to