[jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

John Hewson (JIRA) Sat, 14 Jun 2014 13:55:20 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031695#comment-14031695
 ]


John Hewson commented on PDFBOX-922:
------------------------------------

{quote}
drawString() in PDPageContentStream just writes the text into PDF as any 
COSString would choose to represent it. This is not the right thing to do. When 
the font is a CID keyed font, every glyph is 16 bit wide by definition, and 
COSString won't necessarily notice and write it correctly.
{quote}

Not quite: every CID can be up to 16-bits wide, but many (or for < 256 glyphs, 
all) will fit inside 8 bits.

Therefore, drawString() must know what font is currently being drawn, and ask 
that font to encode the String to whatever byte sequence it takes to draw those 
glyphs. So, PDFont must be added to the drawString() API, and PDFont ought to 
have a method for "public byte[] encode(String)".

drawString() is only valid after setFont() has been called, so it doesn't need 
adding to the API, we can just use the current font. PDFont#encode is a good 
idea, yes.

{quote}
PDFont needs a clearly specified API which performs java String to 
font-specific encoding transformation.
{quote}

Yes, as above.

{quote}
Observe that there are no methods in PDFont called decode(), and I have a hard 
time figuring out what any one of these methods actually do, because everything 
seems to be called "encode" or "lookup". It seems that the encode(byte[], int 
int) performs decoding, so it should be renamed such.
{quote}

Yes, I don't know if anybody knows what methods are actually doing, including 
the original author.

{quote}
In general I'd recommend pushing the encode/decode job down to the font layer. 
Provide just two methods: "byte[] encode(String)" and "String decode(byte[])". 
Their job is to convert between the byte sequences required by that font and 
java Strings, and they handle full runs of text, not just single characters. 
They will then use single- or multibyte encodings as the font requires without 
the higher level having to do crazy stuff like processEncodedText() currently 
does in PDFStreamEngine.
{quote}

processEncodedText() is indeed crazy and needs fixing, but what you propose 
won't work because the 16-bit string encoding is not set by the font, it's set 
on a per-string basis by having that string start with a BOM.

{quote}
There are unfortunately very many ways to encode text in PDF, and especially if 
text needs to be decodable from the byte stream generated by other programs, 
the full complexity must be faced and implemented. These are to be solved in a 
case-by-case basis in the PDFont hierarchy. The PDFont highest class methods 
for encode and decode should be defined as abstract to reflect the fact that 
encoding depends on the particular subtype of the font.
{quote}

Yes, though as far as decoding the correct text is concerned all you have to do 
is make sure that the ToUnicode map is built correctly - you can put any old 
garbage in the actual strings (any many PDFs do). 

{quote}
It may be that for some of these fonts the implementation is same because the 
actual mechanics can be handled by varying the Encoding instance, though.
{quote}

Maybe, though the Encoding class is for Type1 fonts (and equivalent, e.g. 
Type1C) only.

> True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-922
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-922
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Writing
>    Affects Versions: 1.3.1
>         Environment: JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2.0
>            Reporter: Thanos Agelatos
>            Assignee: Andreas Lehmkühler
>         Attachments: pdfbox-unicode.diff, pdfbox-unicode2.diff
>
>
> PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it 
> creates, making it impossible to create PDFs in any language apart from 
> English and ones supported in WinAnsiEncoding. This behaviour is caused 
> because method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding inside, 
> and there is no Identity-H or Identity-V Encoding classes provided (to set 
> afterwards via PDFont.setFont() )
> This excludes the following languages plus many others:
> - Greek
> - Bulgarian
> - Swedish
> - Baltic languages
> - Malteze 
> The PDF created contains garbled characters and/or squares.
> Simple test case:
>                 PDDocument doc = null;
>               try {
>                       doc = new PDDocument();
>                       PDPage page = new PDPage();
>                       doc.addPage(page);
>                       // extract fonts for fields
>                       byte[] arialNorm = extractFont("arial.ttf");
>                       //byte[] arialBold = extractFont("arialbd.ttf"); 
>                       //PDFont font = PDType1Font.HELVETICA;
>                       PDFont font = PDTrueTypeFont.loadTTF(doc, new 
> ByteArrayInputStream(arialNorm));
>                       
>                       PDPageContentStream contentStream = new 
> PDPageContentStream(doc, page);
>                       contentStream.beginText();
>                       contentStream.setFont(font, 12);
>                       contentStream.moveTextPositionByAmount(100, 700);
>                       contentStream.drawString("Hello world from PDFBox 
> ελληνικά"); // text here may appear garbled; insert any text in Greek or 
> Bulgarian or Malteze
>                       contentStream.endText();
>                       contentStream.close();
>                       doc.save("pdfbox.pdf");
>                       System.out.println(" created!");
>               } catch (Exception ioe) {
>                       ioe.printStackTrace();
>               } finally {
>                       if (doc != null) {
>                               try { doc.close(); } catch (Exception e) {}
>                       }
>               }



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

Reply via email to