[jira] [Commented] (PDFBOX-4758) Text Extractor does not handle common typographic ligatures

Michael Reynolds (Jira) Thu, 30 Jan 2020 12:23:18 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026979#comment-17026979
 ]


Michael Reynolds commented on PDFBOX-4758:
------------------------------------------

The unit test contains test cases with failing outputs, it would be acceptable 
to either extract the normalized characters (preferable) or the ligatures so 
that it is possible to correct them post-extraction. In these test cases it 
appears that the information is lost altogether.

> Text Extractor does not handle common typographic ligatures
> -----------------------------------------------------------
>
>                 Key: PDFBOX-4758
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4758
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.1, 2.0.18
>            Reporter: Michael Reynolds
>            Priority: Major
>         Attachments: TestExtractText.java, libreoffice-ligatures-test.pdf, 
> msword-ligatures-test.pdf
>
>
> TextExtractor mishandles typographic ligatures. I've attached test documents 
> from both Microsoft Word and LibreOffice.
> I've checked PDFBox's output against xPDF on CentOS, and the ligatures are 
> properly handled with that utililty, so it appears that this is a PDFBox 
> defect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4758) Text Extractor does not handle common typographic ligatures

Reply via email to