[ 
https://issues.apache.org/jira/browse/PDFBOX-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031003#comment-17031003
 ] 

Michael Reynolds edited comment on PDFBOX-4758 at 2/5/20 8:30 PM:
------------------------------------------------------------------

Okay so I see what was wrong. I tried running from the jar and noticed that you 
used -sort whereas in my unit test I put -
{code:java}
-sort
{code}
 which means i just didn't read the source close enough.

After running it, the LibreOffice document indeed comes out mostly right, there 
is an extra space added but that's not really the end of the world. The 
Microsoft word one is a disaster still but that's not as important. This looks 
like this is something that can be fixed in the downstream library. I'm going 
to close this issue thanks.


was (Author: reynoldsm88):
Okay so I see what was wrong. I tried running from the jar and noticed that you 
used `-sort` whereas in my unit test I put `--sort`, which means i just didn't 
read the source close enough.

After running it, the LibreOffice document indeed comes out mostly right, there 
is an extra space added but that's not really the end of the world. The 
Microsoft word one is a disaster still but that's not as important. This looks 
like this is something that can be fixed in the downstream library. I'm going 
to close this issue thanks.

> Text Extractor does not handle common typographic ligatures
> -----------------------------------------------------------
>
>                 Key: PDFBOX-4758
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4758
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.1, 2.0.18
>            Reporter: Michael Reynolds
>            Priority: Major
>         Attachments: TestExtractText.java, libreoffice-ligatures-test.pdf, 
> msword-ligatures-test.pdf
>
>
> TextExtractor mishandles typographic ligatures. I've attached test documents 
> from both Microsoft Word and LibreOffice.
> I've checked PDFBox's output against xPDF on CentOS, and the ligatures are 
> properly handled with that utililty, so it appears that this is a PDFBox 
> defect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to