[ 
https://issues.apache.org/jira/browse/PDFBOX-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962021#comment-14962021
 ] 

Ben McCann edited comment on PDFBOX-3028 at 10/17/15 8:03 PM:
--------------------------------------------------------------

This logic seems a bit schizophrenic. The amount of space required before the 
next word is started varies quite a bit. 

j: averageCharWidth is 2.2779694 - Using deltaCharWidth 0.68339086
b: averageCharWidth is 3.963066 - Using deltaCharWidth 1.1889199
l: averageCharWidth is 2.891735 - Using deltaCharWidth 0.8675206
@: averageCharWidth is 5.3818645 - Using deltaSpace 1.3677602
e: averageCharWidth is 5.2444143 - Using deltaSpace 1.3677602
x: averageCharWidth is 4.9887233 - Using deltaSpace 1.3677602
a: averageCharWidth is 5.047844 - Using deltaSpace 1.3677602
m: averageCharWidth is 6.62228 - Using deltaSpace 1.3677602
p: averageCharWidth is 6.1352215 - Using deltaSpace 1.3677602
l: averageCharWidth is 3.9778128 - Using deltaCharWidth 1.1933439
e: averageCharWidth is 4.5423884 - Using deltaCharWidth 1.3627166
.: averageCharWidth is 3.6389616 - Using deltaCharWidth 1.0916885
c: averageCharWidth is 4.372963 - Using deltaCharWidth 1.3118889
o: averageCharWidth is 4.922001 - Using deltaSpace 1.3677602
m: averageCharWidth is 6.5593586 - Using deltaSpace 1.367760

expectedStartOfNextWordX < positionX for l: 235.27611 < 235.32764

So basically what's happening is because "l" is a skinny letter, we expect 
there to be very little whitespace before it. When there turns out to be a 
normal amount of whitespace before it we interpret that whitespace to be a 
space character. This seems pretty odd to me that it should matter if a letter 
is skinny or fat.


was (Author: chengas123):
Copying more details here from 
https://issues.apache.org/jira/browse/PDFBOX-3019?focusedCommentId=14957228:

Note you can run PrintTextLocations with:

{code}
$ java -cp 
app/target/pdfbox-app-2.0.0-SNAPSHOT.jar:examples/target/pdfbox-examples-2.0.0-SNAPSHOT.jar
 org.apache.pdfbox.examples.util.PrintTextLocations jbl-example-com.pdf
{code}

Your file is decoded as {{jb [email protected]}}. Here's an output of 
PrintTextLocations :
{code}
String[301.307  ,370.56 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=2.7355347] 
String[226.08   ,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=1.820404]j
String[228.76042,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=5.648163]b
String[235.32764,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=1.820404]l
String[238.00806,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=7.871994]@
String[246.83847,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=5.106964]e
String[252.03284,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=4.733032]x
String[256.85327,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=5.106964]a
String[262.04764,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=8.196716]m
String[270.33176,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=5.648163]p
String[276.06732,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=1.820404]l
String[278.06732,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=5.106964]e
String[283.2617 ,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=2.7355347].
String[286.8769 ,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=5.106964]c
String[292.07126,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=5.471039]o
String[297.6297 ,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=8.196716]m
{code}

So "b" starts at 228.76042, has a width of 5.648163. The sum is
{{228.76042 + 5.648163 = 234.408583}}
But the next character "l" starts at 235.32764. The gap is {{235.32764 - 
234.408583 = 0.919057}}. In the source code, {{spacingTolerance}} is set to 
{{0.5}}. There is a comment there: "we will need to estimate where to add 
spaces, these are used to help guess".
That space size seems to be pretty small. The size of the space for that font 
is 2.7355347, more than 5 times the {{spacingTolerance}}. However there is some 
code in PDFTextStripper that does more than just compare (search for 
{{getSpacingTolerance()}}).

> Text extraction broken for jbl example
> --------------------------------------
>
>                 Key: PDFBOX-3028
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3028
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Ben McCann
>         Attachments: jbl-example-com.pdf, spacing-test.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to