[jira] [Comment Edited] (PDFBOX-3019) Unwanted spaces in text extraction

Tilman Hausherr (JIRA) Wed, 14 Oct 2015 13:54:21 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957228#comment-14957228
 ]


Tilman Hausherr edited comment on PDFBOX-3019 at 10/14/15 8:53 PM:
-------------------------------------------------------------------

Your file is decoded as {{jb [email protected]}}. Here's an output of 
PrintTextLocations:
{code}
String[301.307,370.56 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=2.7355347] 
String[226.08,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=1.820404]j
String[228.76042,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=5.648163]b
String[235.32764,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=1.820404]l
String[238.00806,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=7.871994]@
...
{code}

So "b" starts at 228.76042, has a width of 5.648163. The sum is
{{228.76042 + 5.648163 = 234.408583}}

But the next character "l" starts at 235.32764. The gap is {{235.32764 - 
234.408583 = 0.919057}}. In the source code, {{spacingTolerance}} is set to 
{{0.5}}. There is a comment there: "we will need to estimate where to add 
spaces, these are used to help guess".

That space size seems to be pretty small. The size of the space for that font 
is 2.7355347, more than 5 times the {{spacingTolerance}}. However there is some 
code in PDFTextStripper that does more than just compare (search for 
{{getSpacingTolerance()}}).

After writing all this I wanted to test with 1.8. However I get the same 
result, i.e. {{jb [email protected]}}, so this is a dead end.

So I looked at the resume. Here's Jeanette:

2.0 ("J e a n e t t e"):
{code}
String[190.0801,417.6 fs=1.0 xscale=9.84 height=7.38492 space=0.6565249 
width=4.919998]J
String[195.9103,417.6 fs=1.0 xscale=9.84 height=7.38492 space=0.6565249 
width=5.106964]e
String[201.92746,417.6 fs=1.0 xscale=9.84 height=7.38492 space=0.6565249 
width=5.106949]a
String[207.94461,417.6 fs=1.0 xscale=9.84 height=7.38492 space=0.6565249 
width=5.284088]n
String[214.14873,417.6 fs=1.0 xscale=9.84 height=7.38492 space=0.6565249 
width=5.106964]e
String[220.1659,417.6 fs=1.0 xscale=9.84 height=7.38492 space=0.6565249 
width=2.9126434]t
String[223.95921,417.6 fs=1.0 xscale=9.84 height=7.38492 space=0.6565249 
width=2.9126434]t
String[227.75253,417.6 fs=1.0 xscale=9.84 height=7.38492 space=0.6565249 
width=5.106964]e
String[233.7686,417.6 fs=1.0 xscale=9.84 height=7.38492 space=0.6565249 
width=2.7355194] 
{code}

1.8 ("Jeanette"):
{code}
String[190.0801,417.6 fs=1.0 xscale=9.84 height=7.38492 space=2.7355201 
width=4.919998]J
String[195.9103,417.6 fs=1.0 xscale=9.84 height=7.38492 space=2.7355201 
width=5.106964]e
String[201.92746,417.6 fs=1.0 xscale=9.84 height=7.38492 space=2.7355201 
width=5.106964]a
String[207.94461,417.6 fs=1.0 xscale=9.84 height=7.38492 space=2.7355201 
width=5.284073]n
String[214.14873,417.6 fs=1.0 xscale=9.84 height=7.38492 space=2.7355201 
width=5.106964]e
String[220.1659,417.6 fs=1.0 xscale=9.84 height=7.38492 space=2.7355201 
width=2.9126434]t
String[223.95921,417.6 fs=1.0 xscale=9.84 height=7.38492 space=2.7355201 
width=2.9126434]t
String[227.75253,417.6 fs=1.0 xscale=9.84 height=7.38492 space=2.7355201 
width=5.106964]e
String[233.7686,417.6 fs=1.0 xscale=9.84 height=7.38492 space=2.7355201 
width=2.7355194] 
{code}

The space width is different.



1.8:
{code}
float spaceWidthDisp = spaceWidthText * fontSizeText * horizontalScalingText 
                                    * textMatrix.getXScale() * ctm.getXScale();
{code}
spaceWidthText: 0.27800003
fontSizeText: 1.0
horizontalScalingText: 1.0
textMatrix.getXScale(): 41.0
ctm.getXScale(): 0.24


2.0:
{code}
float spaceWidthDisplay = spaceWidthText * fontSizeText * horizontalScalingText 
*
                textRenderingMatrix.getScalingFactorX()  * 
ctm.getScalingFactorX();
{code}
spaceWidthText: 0.27800003
fontSizeText: 1.0
horizontalScalingText: 1.0
textRenderingMatrix.getScalingFactorX(): 9.84
ctm.getScalingFactorX(): 0.24


9.84 happens to be {{41 * 0.24}}.


Here's the relevant part of the content stream:

{code}
  q
    0.24 0 0 0.24 190.0801 374.4 cm
    BT
      0.0925 Tc
      41 0 0 41 0 0 Tm
      /TT3 1 Tf
      [ (Jean) -1 (et) 3 (t) 3 (e) ] TJ
    ET
  Q
{code}

So I removed {{* ctm.getScalingFactorX()}} from the 2.0 code and it built 
successfully and extracts as you wish it. That the text extraction tests pass 
with and without the change mean that the text extraction tests *really* need a 
better test set. {{spaceWidthDisplay}} is used only in text extraction, not in 
rendering, which could be why the problem wasn't discovered.

Btw, earlier in the code, there is this:

{code}
Matrix textRenderingMatrix = parameters.multiply(textMatrix).multiply(ctm);
{code}
So the textRenderingMatrix is no longer a textRenderingMatrix, it is rather an 
"effectiveTextRenderingMatrix".


was (Author: tilman):
Your file is decoded as {{jb [email protected]}}. Here's an output of 
PrintTextLocations:
{code}
String[301.307,370.56 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=2.7355347] 
String[226.08,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=1.820404]j
String[228.76042,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=5.648163]b
String[235.32764,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=1.820404]l
String[238.00806,394.08 fs=1.0 xscale=9.84 height=7.38492 space=2.7355204 
width=7.871994]@
...
{code}

So "b" starts at 228.76042, has a width of 5.648163. The sum is
{{228.76042 + 5.648163 = 234.408583}}

But the next character "l" starts at 235.32764. The gap is {{235.32764 - 
234.408583 = 0.919057}}. In the source code, {{spacingTolerance}} is set to 
{{0.5}}. There is a comment there: "we will need to estimate where to add 
spaces, these are used to help guess".

That space size seems to be pretty small. The size of the space for that font 
is 2.7355347, more than 5 times the {{spacingTolerance}}. However there is some 
code in PDFTextStripper that does more than just compare (search for 
{{getSpacingTolerance()}}).

After writing all this I wanted to test with 1.8. However I get the same 
result, i.e. {{jb [email protected]}}, so this is a dead end.

So I looked at the resume. Here's Jeanette:

2.0 ("J e a n e t t e"):
{code}
String[190.0801,417.6 fs=1.0 xscale=9.84 height=7.38492 space=0.6565249 
width=4.919998]J
String[195.9103,417.6 fs=1.0 xscale=9.84 height=7.38492 space=0.6565249 
width=5.106964]e
String[201.92746,417.6 fs=1.0 xscale=9.84 height=7.38492 space=0.6565249 
width=5.106949]a
String[207.94461,417.6 fs=1.0 xscale=9.84 height=7.38492 space=0.6565249 
width=5.284088]n
String[214.14873,417.6 fs=1.0 xscale=9.84 height=7.38492 space=0.6565249 
width=5.106964]e
String[220.1659,417.6 fs=1.0 xscale=9.84 height=7.38492 space=0.6565249 
width=2.9126434]t
String[223.95921,417.6 fs=1.0 xscale=9.84 height=7.38492 space=0.6565249 
width=2.9126434]t
String[227.75253,417.6 fs=1.0 xscale=9.84 height=7.38492 space=0.6565249 
width=5.106964]e
String[233.7686,417.6 fs=1.0 xscale=9.84 height=7.38492 space=0.6565249 
width=2.7355194] 
{code}

1.8 ("Jeanette"):
{code}
String[190.0801,417.6 fs=1.0 xscale=9.84 height=7.38492 space=2.7355201 
width=4.919998]J
String[195.9103,417.6 fs=1.0 xscale=9.84 height=7.38492 space=2.7355201 
width=5.106964]e
String[201.92746,417.6 fs=1.0 xscale=9.84 height=7.38492 space=2.7355201 
width=5.106964]a
String[207.94461,417.6 fs=1.0 xscale=9.84 height=7.38492 space=2.7355201 
width=5.284073]n
String[214.14873,417.6 fs=1.0 xscale=9.84 height=7.38492 space=2.7355201 
width=5.106964]e
String[220.1659,417.6 fs=1.0 xscale=9.84 height=7.38492 space=2.7355201 
width=2.9126434]t
String[223.95921,417.6 fs=1.0 xscale=9.84 height=7.38492 space=2.7355201 
width=2.9126434]t
String[227.75253,417.6 fs=1.0 xscale=9.84 height=7.38492 space=2.7355201 
width=5.106964]e
String[233.7686,417.6 fs=1.0 xscale=9.84 height=7.38492 space=2.7355201 
width=2.7355194] 
{code}

The space width is different.



1.8:
{code}
float spaceWidthDisp = spaceWidthText * fontSizeText * horizontalScalingText 
                                    * textMatrix.getXScale() * ctm.getXScale();
{code}
spaceWidthText: 0.27800003
fontSizeText: 1.0
horizontalScalingText: 1.0
textMatrix.getXScale(): 41.0
ctm.getXScale(): 0.24


2.0:
{code}
float spaceWidthDisplay = spaceWidthText * fontSizeText * horizontalScalingText 
*
                textRenderingMatrix.getScalingFactorX()  * 
ctm.getScalingFactorX();
{code}
spaceWidthText: 0.27800003
fontSizeText: 1.0
horizontalScalingText: 1.0
textRenderingMatrix.getScalingFactorX(): 9.84
ctm.getScalingFactorX(): 0.24


9.84 happens to be {{41 * 0.24}}.


Here's the relevant part of the content stream:

{code}
  q
    0.24 0 0 0.24 190.0801 374.4 cm
    BT
      0.0925 Tc
      41 0 0 41 0 0 Tm
      /TT3 1 Tf
      [ (Jean) -1 (et) 3 (t) 3 (e) ] TJ
    ET
  Q
{code}

So I removed {{ * ctm.getScalingFactorX()}} from the 2.0 code and it built 
successfully and extracts as you wish it. That the text extraction tests pass 
with and without the change mean that the text extraction tests *really* need a 
better test set. {{spaceWidthDisplay}} is used only in text extraction, not in 
rendering, which could be why the problem wasn't discovered.

Btw, earlier in the code, there is this:

{code}
Matrix textRenderingMatrix = parameters.multiply(textMatrix).multiply(ctm);
{code}
So the textRenderingMatrix is no longer a textRenderingMatrix, it is rather an 
"effectiveTextRenderingMatrix".

> Unwanted spaces in text extraction
> ----------------------------------
>
>                 Key: PDFBOX-3019
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3019
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Ben McCann
>              Labels: regression
>             Fix For: 2.0.0
>
>         Attachments: jbl-example-com.pdf
>
>
> From testing on my internal dataset I believe there might be some regression 
> in the effectiveness of PDFTextStripper.
> Here's an [example 
> doc|http://rampages.us/rhodesc1/wp-content/uploads/sites/4737/2014/07/Resume-Connor-Rhodes.pdf]
>  I found on the web, which converted better in 1.8 than 2.0. Notice that it 
> extracts "J e a n e t t e  A c o s t a ;  S e r v i c e  M a n a g e r  a t  
> M a d  F o x  B r e w i n g  C o m p a n y". It doesn't seem like there's 
> very much space between the letters in the pdf, so it's curious to me that it 
> didn't do too well.
> I realize this is an area where we probably can't strive for perfection. Yet, 
> it does seem to me that from 1.8 to 2.0 we may have taken a step backwards. I 
> believe there's some sort of regression test for PDFToImage which exports a 
> set of pdfs to images at two different commits and looks at what the 
> differences are. Do we have the same sort of thing for PDFTextStripper? If 
> not, can we build one by pulling docs off the public web? I'd be willing to 
> contribute to this endeavor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-3019) Unwanted spaces in text extraction

Reply via email to