[ 
https://issues.apache.org/jira/browse/PDFBOX-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959232#comment-14959232
 ] 

Tilman Hausherr commented on PDFBOX-3028:
-----------------------------------------

I have removed the 2.0 target. I doubt that this will be solved in time.

Some debug output:

positionX: 235.32764, char: l
deltaSpace2: 1.3677602, getSpacingTolerance: 0.5
deltaCharWidth: 0.8675206, endOfLastTextX: 234.40858
expectedStartOfNextWordX: 235.27611
separator: expectedStartOfNextWordX: 235.27611, positionX: 235.32764

I recommend a look at the source code and at the comments of PDFTextStripper. 
It is using a strategy to use either an average width, or the space width to 
decide at what point a glyph has to start to be considered a new word. Here the 
previous glyph (b) ended at 234.40858. A delta is calculated to be 0.8675206. 
Because the next glyph position (235.32764) is after 235.27611, the algorithm 
assumes a new word.

The spacingTolerance can be set externally... Before changing the algorithm, 
very extensive tests should be done first. And I suspect it will bring 
different problems, the comment indicates that the current strategy is based on 
research.

> Text extraction broken for jbl example
> --------------------------------------
>
>                 Key: PDFBOX-3028
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3028
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Ben McCann
>         Attachments: jbl-example-com.pdf, spacing-test.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to