[jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

Antti Lankila (JIRA) Fri, 13 Jun 2014 02:02:31 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030399#comment-14030399
 ]


Antti Lankila commented on PDFBOX-922:
--------------------------------------

Anyway, let's take a look at the changes required in PDFBox to get the text 
writing to work properly.

- drawString() in PDPageContentStream just writes the text into PDF as any 
COSString would choose to represent it. This is not the right thing to do. When 
the font is a CID keyed font, every glyph is 16 bit wide by definition, and 
COSString won't necessarily notice and write it correctly. Therefore, 
drawString() must know what font is currently being drawn, and ask that font to 
encode the String to whatever byte sequence it takes to draw those glyphs. So, 
PDFont must be added to the drawString() API, and PDFont ought to have a method 
for "public byte[] encode(String)". I would suggest encoding displayable text 
always as (<hex chars>) sequences because this encoding is simplest to 
implement and the easiest to make bug free.

- PDFont needs a clearly specified API which performs java String to unicode 
encoding transformation. The process is usually called encoding, and the 
reverse process of taking a byte array and interpreting it to String is called 
decoding. Observe that there are no methods in PDFont called decode(), and I 
have a hard time figuring out what any one of these methods actually do, 
because everything seems to be called "encode" or "lookup". It seems that the 
encode(byte[], int int) performs decoding, so it should be renamed such. In 
general I'd recommend pushing the encode/decode job down to the font layer. 
Provide just two methods: "byte[] encode(String)" and "String decode(byte[])". 
Their job is to convert between the byte sequences required by that font and 
java Strings, and they handle full runs of text, not just single characters. 
They will then use single- or multibyte encodings as the font requires without 
the higher level having to do crazy stuff like processEncodedText() currently 
does in PDFStreamEngine.

- When implementing encoding, never ask for the char[] array of a Java String. 
Instead, "for (int i = 0, cp; i < string.length(); i += 
Character.charCount(cp)) { cp = string.codePointAt(i); ... now encode the 
codepoint ... }". This will handle the UTF-16 surrogate pairs correctly.

- There are unfortunately very many ways to encode text in PDF, and especially 
if text needs to be decodable from the byte stream generated by other programs, 
the full complexity must be faced and implemented. These are to be solved in a 
case-by-case basis in the PDFont hierarchy. The PDFont highest class methods 
for encode and decode should be defined as abstract to reflect the fact that 
encoding depends on the particular subtype of the font. It seems that Type1, 
TrueType, Type3, and CIDType0 and CIDType2 fonts require different handling 
from each other. It may be that for some of these fonts the implementation is 
same because the actual mechanics can be handled by varying the Encoding 
instance, though.

> True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-922
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-922
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Writing
>    Affects Versions: 1.3.1
>         Environment: JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2.0
>            Reporter: Thanos Agelatos
>            Assignee: Andreas Lehmkühler
>         Attachments: pdfbox-unicode.diff, pdfbox-unicode2.diff
>
>
> PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it 
> creates, making it impossible to create PDFs in any language apart from 
> English and ones supported in WinAnsiEncoding. This behaviour is caused 
> because method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding inside, 
> and there is no Identity-H or Identity-V Encoding classes provided (to set 
> afterwards via PDFont.setFont() )
> This excludes the following languages plus many others:
> - Greek
> - Bulgarian
> - Swedish
> - Baltic languages
> - Malteze 
> The PDF created contains garbled characters and/or squares.
> Simple test case:
>                 PDDocument doc = null;
>               try {
>                       doc = new PDDocument();
>                       PDPage page = new PDPage();
>                       doc.addPage(page);
>                       // extract fonts for fields
>                       byte[] arialNorm = extractFont("arial.ttf");
>                       //byte[] arialBold = extractFont("arialbd.ttf"); 
>                       //PDFont font = PDType1Font.HELVETICA;
>                       PDFont font = PDTrueTypeFont.loadTTF(doc, new 
> ByteArrayInputStream(arialNorm));
>                       
>                       PDPageContentStream contentStream = new 
> PDPageContentStream(doc, page);
>                       contentStream.beginText();
>                       contentStream.setFont(font, 12);
>                       contentStream.moveTextPositionByAmount(100, 700);
>                       contentStream.drawString("Hello world from PDFBox 
> ελληνικά"); // text here may appear garbled; insert any text in Greek or 
> Bulgarian or Malteze
>                       contentStream.endText();
>                       contentStream.close();
>                       doc.save("pdfbox.pdf");
>                       System.out.println(" created!");
>               } catch (Exception ioe) {
>                       ioe.printStackTrace();
>               } finally {
>                       if (doc != null) {
>                               try { doc.close(); } catch (Exception e) {}
>                       }
>               }



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

Reply via email to