[ 
https://issues.apache.org/jira/browse/PDFBOX-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962021#comment-14962021
 ] 

Ben McCann edited comment on PDFBOX-3028 at 10/17/15 6:36 PM:
--------------------------------------------------------------

Copying more details here from 
https://issues.apache.org/jira/browse/PDFBOX-3019?focusedCommentId=14957228:

Note you can run PrintTextLocations with:

{code}
$ java -cp 
app/target/pdfbox-app-2.0.0-SNAPSHOT.jar:examples/target/pdfbox-examples-2.0.0-SNAPSHOT.jar
 org.apache.pdfbox.examples.util.PrintTextLocations jbl-example-com.pdf
{code}

Your file is decoded as {{jb [email protected]}}. Here's an output of 
PrintTextLocations :
{code}
String[301.307  ,370.56 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=2.7355347] 
String[226.08   ,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=1.820404]j
String[228.76042,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=5.648163]b
String[235.32764,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=1.820404]l
String[238.00806,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=7.871994]@
...
{code}

So "b" starts at 228.76042, has a width of 5.648163. The sum is
{{228.76042 + 5.648163 = 234.408583}}
But the next character "l" starts at 235.32764. The gap is {{235.32764 - 
234.408583 = 0.919057}}. In the source code, {{spacingTolerance}} is set to 
{{0.5}}. There is a comment there: "we will need to estimate where to add 
spaces, these are used to help guess".
That space size seems to be pretty small. The size of the space for that font 
is 2.7355347, more than 5 times the {{spacingTolerance}}. However there is some 
code in PDFTextStripper that does more than just compare (search for 
{{getSpacingTolerance()}}).


was (Author: chengas123):
$ java -cp 
app/target/pdfbox-app-2.0.0-SNAPSHOT.jar:examples/target/pdfbox-examples-2.0.0-SNAPSHOT.jar
 org.apache.pdfbox.examples.util.PrintTextLocations jbl-example-com.pdf 
String[301.307  ,370.56 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=2.7355347] 
String[226.08   ,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=1.820404]j
String[228.76042,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=5.648163]b
String[235.32764,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=1.820404]l
String[238.00806,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=7.871994]@
String[246.83847,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=5.106964]e
String[252.03284,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=4.733032]x
String[256.85327,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=5.106964]a
String[262.04764,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=8.196716]m
String[270.33176,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=5.648163]p
String[276.06732,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=1.820404]l
String[278.06732,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=5.106964]e
String[283.2617 ,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=2.7355347].
String[286.8769 ,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=5.106964]c
String[292.07126,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=5.471039]o
String[297.6297 ,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=8.196716]m


> Text extraction broken for jbl example
> --------------------------------------
>
>                 Key: PDFBOX-3028
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3028
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Ben McCann
>         Attachments: jbl-example-com.pdf, spacing-test.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to