[ 
https://issues.apache.org/jira/browse/PDFBOX-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962557#comment-14962557
 ] 

John Hewson commented on PDFBOX-3028:
-------------------------------------

{quote}
Does anyone know if it's because that's a really screwy pdf or if it's because 
there's some bug in the way we calculate the width of a space?
{quote}

I couldn't say but PDFont#getSpaceWidth() has some pretty questionable 
behaviour such as assuming char 32 is a space and using the average font width 
as a fallback. (0.25em would be a far better fallback),

{quote}
sometimes a space is an actual unicode character for a space and sometimes it's 
just two characters not being near each other?
{quote}

Yes, documents generated using LaTeX are a good example of the latter.

> Text extraction broken for jbl example
> --------------------------------------
>
>                 Key: PDFBOX-3028
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3028
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Ben McCann
>         Attachments: jbl-example-com.pdf, spacing-test.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to