[
https://issues.apache.org/jira/browse/PDFBOX-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408055#comment-15408055
]
Tilman Hausherr edited comment on PDFBOX-3435 at 8/4/16 4:27 PM:
-----------------------------------------------------------------
The cause of the problem was that the fonts in your have a (0 0 0 0) bounding
box, and the text was not exactly at same height, so PDFBox thought it was on
different lines. The modification is to use the CapHeight in this case. I have
attached the text extraction (oops, I didn't use the sort option, but the
output seems to be the same) with both versions.
was (Author: tilman):
The cause of the problem was that the fonts in your have a (0 0 0 0) bounding
box, and the text was not exactly at same height, so PDFBox thought it was on
different lines. The modification is to use the CapHeight in this case. I have
attached the text extraction with both versions.
> Text extraction - words on same line detection failing in 2.x
> -------------------------------------------------------------
>
> Key: PDFBOX-3435
> URL: https://issues.apache.org/jira/browse/PDFBOX-3435
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.0.3, 2.1.0
> Reporter: Lee van Hooff
> Attachments: PDFBOX-3435-18.txt, PDFBOX-3435-20.txt,
> text-extraction-issues.pdf
>
>
> The ability to extract a line of text as it appears in the PDF is no longer
> working in the 2.x version of pdfbox.
> java -jar pdfbox-app-1.8.4.jar ExtractText -console -sort
> ~/Desktop/text-extraction-issues.pdf
> results in:
> {noformat}
> . . .
> Your Code Our Code Description
> Qty Price Ex Total Ex
> 11SP 100129630 IRWIN VICE-GRIP 11 C-CLAMP SWIVEL PAD
> 4 00.00 000.00
> IR-0352 100094584 IRWIN 600MM TOOL BAG
> 1 00.00 00.00
> EM81.9 100088913 EMPIRE TORPEDO LEVEL ALUMINIUM
> 1 00.00 00.00
> 20566-618R 100023443 LENOX RECIPRO BLADE 150X20X0.9MM 18TPI 5P
> 3 0.00 00.00
> . . .
> {noformat}
> while
> java -jar pdfbox-app-2.0.2.jar ExtractText -console -sort
> ~/Desktop/text-extraction-issues.pdf
> results in:
> {noformat}
> . . .
> Your Code Our Code Description
> Qty Price Ex Total Ex
> IRWIN VICE-GRIP 11 C-CLAMP SWIVEL PAD
> 11SP 100129630 4 00.00 000.00
> IRWIN 600MM TOOL BAG
> IR-0352 100094584 1 00.00 00.00
> EMPIRE TORPEDO LEVEL ALUMINIUM
> EM81.9 100088913 1 00.00 00.00
> LENOX RECIPRO BLADE 150X20X0.9MM 18TPI 5P
> 20566-618R 100023443 3 0.00 00.00
> . . .
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]