[jira] [Commented] (PDFBOX-903) Unicode text getting mangled via TextToPDF + PDFTextStripper

Joseph Vychtrle (JIRA) Tue, 31 May 2011 07:50:36 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13041603#comment-13041603
 ]


Joseph Vychtrle commented on PDFBOX-903:
----------------------------------------

Seconded...

There is also a nullpointer exception in the test case when using pdfbox 
1.6.0-SNAPSHOT

Because of this issue, it is impossible to do something like this. Even though 
unicode fonts are used, unless this issue is fixed, it won't work...

        private static void createPdfFromTxtFile(File from, File to) throws 
IOException, FileNotFoundException, COSVisitorException {
                File f = new 
File(EntitiesGenerator.class.getClassLoader().getResource("fonts/SomeUnicodeTrueTypeFonts.ttf").toURI());

                PDDocument document = null;
                try {
                        document = new PDDocument();
                        PDPage page = new PDPage();
                        document.addPage(page);

                        PDFont font = PDTrueTypeFont.loadTTF(document, f);
                        
                        PDPageContentStream contentStream = new 
PDPageContentStream(document, page);

                        contentStream.beginText();
                        contentStream.setFont(font, 5);
                        
contentStream.drawString(FileUtils.readFileToString(from));
                        contentStream.endText();
                        contentStream.close();
                        document.save(new FileOutputStream(to));
                } finally {
                        if (document != null)
                                document.close();
                }
        }

> Unicode text getting mangled via TextToPDF + PDFTextStripper
> ------------------------------------------------------------
>
>                 Key: PDFBOX-903
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-903
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.4.0
>            Reporter: Nick Burch
>         Attachments: TestUnicodeText.java, TestUnicodeText.java
>
>
> I'm trying to round trip some text through PDFBox, but I'm finding that along 
> the way unicode text is getting mangled and coming back as the wrong 
> characters.
> The process I'm following is to use TextToPDF to generate a PDF, then reading 
> it back in again with PDFTextStripper. I'm not sure if the problem is coming 
> about during generation or reading yet, but I've a nasty feeling there might 
> be an issue with both. (I've seen issues with code that does one part of the 
> other)
> Attached is a unit test written against trunk. It creates a series of Reader 
> objects based on both ASCII and non-ASCII text, creates a PDF using 
> TextToPDF, then compares the text. It includes a test that verifies that the 
> corruption isn't caused by the readers, and another that fails showing that 
> the text was corrupted by the roundtrip.
> Ideally the test would also look in the dictionary to check what was stored 
> there, but I don't know enough about the file format to manage that. Will 
> hopefully look into that shortly though.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-903) Unicode text getting mangled via TextToPDF + PDFTextStripper

Reply via email to