Unicode text getting mangled via TextToPDF + PDFTextStripper
------------------------------------------------------------

                 Key: PDFBOX-903
                 URL: https://issues.apache.org/jira/browse/PDFBOX-903
             Project: PDFBox
          Issue Type: Bug
    Affects Versions: 1.4.0
            Reporter: Nick Burch
         Attachments: TestUnicodeText.java

I'm trying to round trip some text through PDFBox, but I'm finding that along 
the way unicode text is getting mangled and coming back as the wrong characters.

The process I'm following is to use TextToPDF to generate a PDF, then reading 
it back in again with PDFTextStripper. I'm not sure if the problem is coming 
about during generation or reading yet, but I've a nasty feeling there might be 
an issue with both. (I've seen issues with code that does one part of the other)

Attached is a unit test written against trunk. It creates a series of Reader 
objects based on both ASCII and non-ASCII text, creates a PDF using TextToPDF, 
then compares the text. It includes a test that verifies that the corruption 
isn't caused by the readers, and another that fails showing that the text was 
corrupted by the roundtrip.

Ideally the test would also look in the dictionary to check what was stored 
there, but I don't know enough about the file format to manage that. Will 
hopefully look into that shortly though.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to