Lee van Hooff created PDFBOX-3435:
-------------------------------------
Summary: Text extraction - words on same line detection failing in
2.x
Key: PDFBOX-3435
URL: https://issues.apache.org/jira/browse/PDFBOX-3435
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 2.0.2, 2.0.1, 2.0.0
Reporter: Lee van Hooff
Attachments: text-extraction-issues.pdf
The ability to extract a line of text as it appears in the PDF is no longer
working in the 2.x version of pdfbox.
java -jar pdfbox-app-1.8.4.jar ExtractText -console -sort
~/Desktop/text-extraction-issues.pdf
results in:
{noformat}
. . .
Your Code Our Code Description
Qty Price Ex Total Ex
11SP 100129630 IRWIN VICE-GRIP 11 C-CLAMP SWIVEL PAD
4 00.00 000.00
IR-0352 100094584 IRWIN 600MM TOOL BAG
1 00.00 00.00
EM81.9 100088913 EMPIRE TORPEDO LEVEL ALUMINIUM
1 00.00 00.00
20566-618R 100023443 LENOX RECIPRO BLADE 150X20X0.9MM 18TPI 5P
3 0.00 00.00
. . .
{noformat}
while
java -jar pdfbox-app-2.0.2.jar ExtractText -console -sort
~/Desktop/text-extraction-issues.pdf
results in:
{noformat}
. . .
Your Code Our Code Description
Qty Price Ex Total Ex
IRWIN VICE-GRIP 11 C-CLAMP SWIVEL PAD
11SP 100129630 4 00.00 000.00
IRWIN 600MM TOOL BAG
IR-0352 100094584 1 00.00 00.00
EMPIRE TORPEDO LEVEL ALUMINIUM
EM81.9 100088913 1 00.00 00.00
LENOX RECIPRO BLADE 150X20X0.9MM 18TPI 5P
20566-618R 100023443 3 0.00 00.00
. . .
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]