[jira] [Closed] (PDFBOX-4287) Pdf box does not detecting some white spaces while converting a pdf file to html

Tilman Hausherr (JIRA) Sat, 04 Aug 2018 01:27:51 -0700


     [ 
https://issues.apache.org/jira/browse/PDFBOX-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tilman Hausherr closed PDFBOX-4287.
-----------------------------------
    Resolution: Invalid

Closing as invalid, the problem is in PDFDomTree. Please open an issue there. 
It's possible that PDFDomTree does not create spaces for areas between glyphs 
like we do. Most PDFs don't have spaces in it, text extraction creates them.
https://github.com/radkovo/Pdf2Dom

> Pdf box does not detecting some white spaces while converting a pdf file to 
> html
> --------------------------------------------------------------------------------
>
>                 Key: PDFBOX-4287
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4287
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.9
>            Reporter: Vedang
>            Priority: Major
>              Labels: PDFDomTree
>         Attachments: demo.html, demo.pdf, demo.txt
>
>
> Converter is not able to detect whitespace between two words due to which 
> some words are getting concatenated.
> Check the comparison below: 
> [!https://i.stack.imgur.com/9PT8k.png!|https://i.stack.imgur.com/9PT8k.png]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Closed] (PDFBOX-4287) Pdf box does not detecting some white spaces while converting a pdf file to html

Reply via email to