[
https://issues.apache.org/jira/browse/PDFBOX-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256057#comment-17256057
]
Michael Klink commented on PDFBOX-2138:
---------------------------------------
There is no magic involved in these marked content sequences. What's
responsible for which text shows and which not, are clipping path definitions.
To demonstrate this I've removed clipping path definitions from the page
content stream: [^PDFBOX-2138-noClip.pdf]
As you can see, now multiple versions of the text are shown.
!PDFBOX-2138-noClip.png!
> Corrupted words when using PDFTextStripper
> ------------------------------------------
>
> Key: PDFBOX-2138
> URL: https://issues.apache.org/jira/browse/PDFBOX-2138
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.5, 1.8.6, 2.0.0
> Environment: Windows 7 / 64 bit
> Reporter: Walter Kehl
> Priority: Major
> Fix For: 4.0.0
>
> Attachments: PDFBOX-2138-noClip.pdf, PDFBOX-2138-noClip.png,
> PDFBOX-2138.pdf, PDFBOX-2138.txt, banking-banana-skins-2014.pdf,
> banking-banana-skins-2014.txt
>
>
> >> I am using PDFTextStripper (embedded into another application) to get
> >> the raw text of PDFs so far with good results but recently a PDF file
> >> has appeared where the output of the PDFTextStripper was corrupted. I
> >> got sentences like:
> >>
> >>
> >>
> >> "There is al o con ern that b nkers may be pushed to misprice risk
> >> (No. 6) by the pres ures of c mpetition and an abunda ce of central b
> >> nk-provided liquidity."
> > Additionally some portions of text appear
> > twice in the output: first correctly and then corrupted. I have
> > attached an output created with PDFBox's command line options.
> > If you compare lines 357- 365 with lines 421-429 you see that it is
> > the same paragraph, first ok and then with characters missing. In the
> > original source this paragraph is unique.
> > The same seems to happen for the other instances where text is corrupted.
> I also tried it directly on the command line with the same results: input and
> output files are attached.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]