[jira] [Commented] (PDFBOX-4758) Text Extractor does not handle common typographic ligatures

Michael Reynolds (Jira) Fri, 31 Jan 2020 14:06:42 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027859#comment-17027859
 ]


Michael Reynolds commented on PDFBOX-4758:
------------------------------------------

I updated the tests accordingly (will reupload the reproducer). Note that I 
added your suggestion of adding *--sort* option:

{code:java}
// Some comments here
public void testExtractLibreOfficeLigatures() {
        ByteArrayOutputStream outBytes = new ByteArrayOutputStream();
        PrintStream stdout = System.out;
        System.setOut( new PrintStream( outBytes ) );
        String result = null;
        try {
            ExtractText.main( new 
String[]{"src/test/resources/org/apache/pdfbox/libreoffice-ligatures-test.pdf", 
"-console", "--sort", "-encoding UTF-8"} );
            result = outBytes.toString( "UTF-8" );
            boolean isAcceptableExtraction = result.equals( "conﬂict, 
diﬀerence, bafﬂing, ﬁnished, aﬃrmation" ) || result.equals( "conflict, 
difference, baffling, finished, affirmation" );
            assertTrue( isAcceptableExtraction );
        } catch ( IOException e ) {
            fail( e.getMessage() );
            e.printStackTrace();
        } finally {
            System.setOut( stdout );
            System.out.println( "Libre Office Extraction was:" );
            System.out.println( result );
        }
    }

    public void testExtractMicrosoftWordLigatures() {
        ByteArrayOutputStream outBytes = new ByteArrayOutputStream();
        PrintStream stdout = System.out;
        System.setOut( new PrintStream( outBytes ) );
        String result = null;
        try {
            ExtractText.main( new 
String[]{"src/test/resources/org/apache/pdfbox/msword-ligatures-test.pdf", 
"-console", "--sort", "-encoding UTF-8"} );
            result = outBytes.toString( "UTF-8" );
            boolean isAcceptableExtraction = result.equals( "conﬂict, 
diﬀerence, bafﬂing, ﬁnished, aﬃrmation" ) || result.equals( "conflict, 
difference, baffling, finished, affirmation" );
            assertTrue( isAcceptableExtraction );
        } catch ( IOException e ) {
            fail( e.getMessage() );
            e.printStackTrace();
        } finally {
            System.setOut( stdout );
            System.out.println( "Microsoft Word Extraction was:" );
            System.out.println( result );
        }
    }
{code}

The output is still not correct, here is what I'm getting on a Mac OSX local 
development environment (I will test on centos asap).

{noformat}
Microsoft Word Extraction was:
con$lict,       difference,     baf$ling,       $inished,       affirmation 

Libre Office Extraction was:
conflict, di erence,  baffling, finished, a rmationff ffi

[ERROR] Tests run: 3, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 0.72 s 
<<< FAILURE! - in org.apache.pdfbox.tools.TestExtractText

{noformat}


> Text Extractor does not handle common typographic ligatures
> -----------------------------------------------------------
>
>                 Key: PDFBOX-4758
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4758
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.1, 2.0.18
>            Reporter: Michael Reynolds
>            Priority: Major
>         Attachments: TestExtractText.java, libreoffice-ligatures-test.pdf, 
> msword-ligatures-test.pdf
>
>
> TextExtractor mishandles typographic ligatures. I've attached test documents 
> from both Microsoft Word and LibreOffice.
> I've checked PDFBox's output against xPDF on CentOS, and the ligatures are 
> properly handled with that utililty, so it appears that this is a PDFBox 
> defect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4758) Text Extractor does not handle common typographic ligatures

Reply via email to