[
https://issues.apache.org/jira/browse/PDFBOX-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962021#comment-14962021
]
Ben McCann edited comment on PDFBOX-3028 at 10/17/15 6:36 PM:
--------------------------------------------------------------
Copying more details here from
https://issues.apache.org/jira/browse/PDFBOX-3019?focusedCommentId=14957228:
Note you can run PrintTextLocations with:
{code}
$ java -cp
app/target/pdfbox-app-2.0.0-SNAPSHOT.jar:examples/target/pdfbox-examples-2.0.0-SNAPSHOT.jar
org.apache.pdfbox.examples.util.PrintTextLocations jbl-example-com.pdf
{code}
Your file is decoded as {{jb [email protected]}}. Here's an output of
PrintTextLocations :
{code}
String[301.307 ,370.56 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=2.7355347]
String[226.08 ,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=1.820404]j
String[228.76042,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=5.648163]b
String[235.32764,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=1.820404]l
String[238.00806,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=7.871994]@
...
{code}
So "b" starts at 228.76042, has a width of 5.648163. The sum is
{{228.76042 + 5.648163 = 234.408583}}
But the next character "l" starts at 235.32764. The gap is {{235.32764 -
234.408583 = 0.919057}}. In the source code, {{spacingTolerance}} is set to
{{0.5}}. There is a comment there: "we will need to estimate where to add
spaces, these are used to help guess".
That space size seems to be pretty small. The size of the space for that font
is 2.7355347, more than 5 times the {{spacingTolerance}}. However there is some
code in PDFTextStripper that does more than just compare (search for
{{getSpacingTolerance()}}).
was (Author: chengas123):
$ java -cp
app/target/pdfbox-app-2.0.0-SNAPSHOT.jar:examples/target/pdfbox-examples-2.0.0-SNAPSHOT.jar
org.apache.pdfbox.examples.util.PrintTextLocations jbl-example-com.pdf
String[301.307 ,370.56 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=2.7355347]
String[226.08 ,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=1.820404]j
String[228.76042,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=5.648163]b
String[235.32764,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=1.820404]l
String[238.00806,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=7.871994]@
String[246.83847,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=5.106964]e
String[252.03284,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=4.733032]x
String[256.85327,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=5.106964]a
String[262.04764,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=8.196716]m
String[270.33176,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=5.648163]p
String[276.06732,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=1.820404]l
String[278.06732,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=5.106964]e
String[283.2617 ,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=2.7355347].
String[286.8769 ,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=5.106964]c
String[292.07126,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=5.471039]o
String[297.6297 ,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=8.196716]m
> Text extraction broken for jbl example
> --------------------------------------
>
> Key: PDFBOX-3028
> URL: https://issues.apache.org/jira/browse/PDFBOX-3028
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Ben McCann
> Attachments: jbl-example-com.pdf, spacing-test.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]