[
https://issues.apache.org/jira/browse/PDFBOX-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-3435:
------------------------------------
Attachment: PDFBOX-3435-20.txt
PDFBOX-3435-18.txt
> Text extraction - words on same line detection failing in 2.x
> -------------------------------------------------------------
>
> Key: PDFBOX-3435
> URL: https://issues.apache.org/jira/browse/PDFBOX-3435
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Reporter: Lee van Hooff
> Attachments: PDFBOX-3435-18.txt, PDFBOX-3435-20.txt,
> text-extraction-issues.pdf
>
>
> The ability to extract a line of text as it appears in the PDF is no longer
> working in the 2.x version of pdfbox.
> java -jar pdfbox-app-1.8.4.jar ExtractText -console -sort
> ~/Desktop/text-extraction-issues.pdf
> results in:
> {noformat}
> . . .
> Your Code Our Code Description
> Qty Price Ex Total Ex
> 11SP 100129630 IRWIN VICE-GRIP 11 C-CLAMP SWIVEL PAD
> 4 00.00 000.00
> IR-0352 100094584 IRWIN 600MM TOOL BAG
> 1 00.00 00.00
> EM81.9 100088913 EMPIRE TORPEDO LEVEL ALUMINIUM
> 1 00.00 00.00
> 20566-618R 100023443 LENOX RECIPRO BLADE 150X20X0.9MM 18TPI 5P
> 3 0.00 00.00
> . . .
> {noformat}
> while
> java -jar pdfbox-app-2.0.2.jar ExtractText -console -sort
> ~/Desktop/text-extraction-issues.pdf
> results in:
> {noformat}
> . . .
> Your Code Our Code Description
> Qty Price Ex Total Ex
> IRWIN VICE-GRIP 11 C-CLAMP SWIVEL PAD
> 11SP 100129630 4 00.00 000.00
> IRWIN 600MM TOOL BAG
> IR-0352 100094584 1 00.00 00.00
> EMPIRE TORPEDO LEVEL ALUMINIUM
> EM81.9 100088913 1 00.00 00.00
> LENOX RECIPRO BLADE 150X20X0.9MM 18TPI 5P
> 20566-618R 100023443 3 0.00 00.00
> . . .
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]