[
https://issues.apache.org/jira/browse/PDFBOX-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr resolved PDFBOX-3019.
-------------------------------------
Resolution: Fixed
Assignee: Tilman Hausherr
> Unwanted spaces in text extraction
> ----------------------------------
>
> Key: PDFBOX-3019
> URL: https://issues.apache.org/jira/browse/PDFBOX-3019
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Ben McCann
> Assignee: Tilman Hausherr
> Labels: regression
> Fix For: 2.0.0
>
> Attachments: jbl-example-com.pdf
>
>
> From testing on my internal dataset I believe there might be some regression
> in the effectiveness of PDFTextStripper.
> Here's an [example
> doc|http://rampages.us/rhodesc1/wp-content/uploads/sites/4737/2014/07/Resume-Connor-Rhodes.pdf]
> I found on the web, which converted better in 1.8 than 2.0. Notice that it
> extracts "J e a n e t t e A c o s t a ; S e r v i c e M a n a g e r a t
> M a d F o x B r e w i n g C o m p a n y". It doesn't seem like there's
> very much space between the letters in the pdf, so it's curious to me that it
> didn't do too well.
> I realize this is an area where we probably can't strive for perfection. Yet,
> it does seem to me that from 1.8 to 2.0 we may have taken a step backwards. I
> believe there's some sort of regression test for PDFToImage which exports a
> set of pdfs to images at two different commits and looks at what the
> differences are. Do we have the same sort of thing for PDFTextStripper? If
> not, can we build one by pulling docs off the public web? I'd be willing to
> contribute to this endeavor.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]