Angela Onslow created TIKA-2054:
-----------------------------------

             Summary: Problem with ligatures converting from PDF to HTML with 
Tika
                 Key: TIKA-2054
                 URL: https://issues.apache.org/jira/browse/TIKA-2054
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.13, 1.11
            Reporter: Angela Onslow


When converting certain PDFs from PDF to HTML I am having trouble with ligature 
characters being displayed as U+FFFD � REPLACEMENT CHARACTER

I have tried using Apache Tika 1.11 and 1.13, converting on the command line 
using the .jar and get the same results.

If I use pdfbox-app-2.0.1.jar and 'ExtractText' with the icu4j-57_1.jar in the 
path and I convert to text rather than HTML then I am able to at least preserve 
information about what each ligature was originally, even if they are still 
represented as unprintable characters. 

I.e. if I run the following from the command line:
java -jar pdfbox-app-1.8.12.jar ExtractText 'test.pdf' 'test.txt'

Then the resulting test.txt when viewed in Sublime2 has "fi" represented as the 
 US (unit separator character), "ff" represented as RS, "fl" represented as GS 
and "ffl" reperesented as FS, which I could then replace with the appropriate 
characters.

I was under the impression Tika uses icu4j, is there a way to get the same 
behaviour I see with PDFBox with Tika when converting from PDF to HTML? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to