[
https://issues.apache.org/jira/browse/PDFBOX-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027859#comment-17027859
]
Michael Reynolds commented on PDFBOX-4758:
------------------------------------------
I updated the tests accordingly (will reupload the reproducer). Note that I
added your suggestion of adding *--sort* option:
{code:java}
// Some comments here
public void testExtractLibreOfficeLigatures() {
ByteArrayOutputStream outBytes = new ByteArrayOutputStream();
PrintStream stdout = System.out;
System.setOut( new PrintStream( outBytes ) );
String result = null;
try {
ExtractText.main( new
String[]{"src/test/resources/org/apache/pdfbox/libreoffice-ligatures-test.pdf",
"-console", "--sort", "-encoding UTF-8"} );
result = outBytes.toString( "UTF-8" );
boolean isAcceptableExtraction = result.equals( "conflict,
difference, baffling, finished, affirmation" ) || result.equals( "conflict,
difference, baffling, finished, affirmation" );
assertTrue( isAcceptableExtraction );
} catch ( IOException e ) {
fail( e.getMessage() );
e.printStackTrace();
} finally {
System.setOut( stdout );
System.out.println( "Libre Office Extraction was:" );
System.out.println( result );
}
}
public void testExtractMicrosoftWordLigatures() {
ByteArrayOutputStream outBytes = new ByteArrayOutputStream();
PrintStream stdout = System.out;
System.setOut( new PrintStream( outBytes ) );
String result = null;
try {
ExtractText.main( new
String[]{"src/test/resources/org/apache/pdfbox/msword-ligatures-test.pdf",
"-console", "--sort", "-encoding UTF-8"} );
result = outBytes.toString( "UTF-8" );
boolean isAcceptableExtraction = result.equals( "conflict,
difference, baffling, finished, affirmation" ) || result.equals( "conflict,
difference, baffling, finished, affirmation" );
assertTrue( isAcceptableExtraction );
} catch ( IOException e ) {
fail( e.getMessage() );
e.printStackTrace();
} finally {
System.setOut( stdout );
System.out.println( "Microsoft Word Extraction was:" );
System.out.println( result );
}
}
{code}
The output is still not correct, here is what I'm getting on a Mac OSX local
development environment (I will test on centos asap).
{noformat}
Microsoft Word Extraction was:
con$lict, difference, baf$ling, $inished, affirmation
Libre Office Extraction was:
conflict, di erence, baffling, finished, a rmationff ffi
[ERROR] Tests run: 3, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 0.72 s
<<< FAILURE! - in org.apache.pdfbox.tools.TestExtractText
{noformat}
> Text Extractor does not handle common typographic ligatures
> -----------------------------------------------------------
>
> Key: PDFBOX-4758
> URL: https://issues.apache.org/jira/browse/PDFBOX-4758
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.1, 2.0.18
> Reporter: Michael Reynolds
> Priority: Major
> Attachments: TestExtractText.java, libreoffice-ligatures-test.pdf,
> msword-ligatures-test.pdf
>
>
> TextExtractor mishandles typographic ligatures. I've attached test documents
> from both Microsoft Word and LibreOffice.
> I've checked PDFBox's output against xPDF on CentOS, and the ligatures are
> properly handled with that utililty, so it appears that this is a PDFBox
> defect.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]