[
https://issues.apache.org/jira/browse/PDFBOX-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959082#comment-14959082
]
Ben McCann edited comment on PDFBOX-3028 at 10/17/15 8:00 PM:
--------------------------------------------------------------
This file has a single word in it "[email protected]". However, it's not being
extracted properly and a space is being inserted into the text
Copying more details here from
https://issues.apache.org/jira/browse/PDFBOX-3019?focusedCommentId=14957228:
Note you can run PrintTextLocations with:
{code}
$ java -cp
app/target/pdfbox-app-2.0.0-SNAPSHOT.jar:examples/target/pdfbox-examples-2.0.0-SNAPSHOT.jar
org.apache.pdfbox.examples.util.PrintTextLocations jbl-example-com.pdf
{code}
Your file is decoded as {{jb [email protected]}}. Here's an output of
PrintTextLocations :
{code}
String[301.307 ,370.56 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=2.7355347]
String[226.08 ,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=1.820404]j
String[228.76042,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=5.648163]b
String[235.32764,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=1.820404]l
String[238.00806,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=7.871994]@
String[246.83847,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=5.106964]e
String[252.03284,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=4.733032]x
String[256.85327,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=5.106964]a
String[262.04764,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=8.196716]m
String[270.33176,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=5.648163]p
String[276.06732,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=1.820404]l
String[278.06732,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=5.106964]e
String[283.2617 ,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=2.7355347].
String[286.8769 ,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=5.106964]c
String[292.07126,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=5.471039]o
String[297.6297 ,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204
width=8.196716]m
{code}
So "b" starts at 228.76042, has a width of 5.648163. The sum is
{{228.76042 + 5.648163 = 234.408583}}
But the next character "l" starts at 235.32764. The gap is {{235.32764 -
234.408583 = 0.919057}}. In the source code, {{spacingTolerance}} is set to
{{0.5}}. There is a comment there: "we will need to estimate where to add
spaces, these are used to help guess".
That space size seems to be pretty small. The size of the space for that font
is 2.7355347, more than 5 times the {{spacingTolerance}}. However there is some
code in PDFTextStripper that does more than just compare (search for
{{getSpacingTolerance()}}).
was (Author: chengas123):
This file has a single word in it "[email protected]". However, it's not being
extracted properly and a space is being inserted into the text
> Text extraction broken for jbl example
> --------------------------------------
>
> Key: PDFBOX-3028
> URL: https://issues.apache.org/jira/browse/PDFBOX-3028
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Ben McCann
> Attachments: jbl-example-com.pdf, spacing-test.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]