[jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

Antti Lankila (JIRA) Sat, 14 Jun 2014 14:25:23 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031702#comment-14031702
 ]


Antti Lankila commented on PDFBOX-922:
--------------------------------------

Going to combine two posts into one...

"You said that you were using "Identity-H for charcode -> CID, and Identity for 
CID -> GID", which doesn't involve updating any cmaps."

Ah. I meant the cmap table in TTF actually. They do have cmaps which map from 
some specific encoding's values to glyph indexes. I can understand that my 
phrasing was confusing.

Full ack on the CIDToGIDMap approach. That is a way to allow manipulating a 
font without having to re-encode text already written with the font.

There must be some confusion about the 0x10000 CID limit. I simply meant that 
assuming a font contains a glyph which has unicode codepoint above 0x10000, it 
follows that rendering that glyph requires the CIDs to not be treated as UCS-2 
values, because there is no way to represent that codepoint in UCS-2. I was 
mostly trying to weigh between different alternatives.  I still like identity 
mappings because that means that conversion from unicode to appropriate GID is 
the simplest possible, at least for TTF fonts with Windows Unicode cmap table.

On to the next one...

"Not quite: every CID can be up to 16-bits wide, but many (or for < 256 glyphs, 
all) will fit inside 8 bits. The byte-width of a string is controlled by 
whether or not it starts with a BOM, not which font it uses."

In my experience this is not the case. I know the standard says that PDF String 
encoding is controlled by a BOM appearing at the beginning, but this probably 
refers to other kinds of text, not the kind of text you print on a page! For 
instance, according to my testing, if you actually write text in CID keyed 
font, your BOM will be treated as CID and mapped to a character -- or if you 
try to write with a font that is defined to have 8-bit characters, prepending 
it with a BOM just generates the BOM's characters in the text. It was this 
latter behavior that I spotted originally -- I tried to generate the three dots 
("…") character with PDFont.HELVETICA, and saw the BOM characters appear in the 
text string, along with extra spaces between glyphs that were the null bytes in 
UTF-16 encoding.


> True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-922
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-922
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Writing
>    Affects Versions: 1.3.1
>         Environment: JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2.0
>            Reporter: Thanos Agelatos
>            Assignee: Andreas Lehmkühler
>         Attachments: pdfbox-unicode.diff, pdfbox-unicode2.diff
>
>
> PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it 
> creates, making it impossible to create PDFs in any language apart from 
> English and ones supported in WinAnsiEncoding. This behaviour is caused 
> because method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding inside, 
> and there is no Identity-H or Identity-V Encoding classes provided (to set 
> afterwards via PDFont.setFont() )
> This excludes the following languages plus many others:
> - Greek
> - Bulgarian
> - Swedish
> - Baltic languages
> - Malteze 
> The PDF created contains garbled characters and/or squares.
> Simple test case:
>                 PDDocument doc = null;
>               try {
>                       doc = new PDDocument();
>                       PDPage page = new PDPage();
>                       doc.addPage(page);
>                       // extract fonts for fields
>                       byte[] arialNorm = extractFont("arial.ttf");
>                       //byte[] arialBold = extractFont("arialbd.ttf"); 
>                       //PDFont font = PDType1Font.HELVETICA;
>                       PDFont font = PDTrueTypeFont.loadTTF(doc, new 
> ByteArrayInputStream(arialNorm));
>                       
>                       PDPageContentStream contentStream = new 
> PDPageContentStream(doc, page);
>                       contentStream.beginText();
>                       contentStream.setFont(font, 12);
>                       contentStream.moveTextPositionByAmount(100, 700);
>                       contentStream.drawString("Hello world from PDFBox 
> ελληνικά"); // text here may appear garbled; insert any text in Greek or 
> Bulgarian or Malteze
>                       contentStream.endText();
>                       contentStream.close();
>                       doc.save("pdfbox.pdf");
>                       System.out.println(" created!");
>               } catch (Exception ioe) {
>                       ioe.printStackTrace();
>               } finally {
>                       if (doc != null) {
>                               try { doc.close(); } catch (Exception e) {}
>                       }
>               }



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

Reply via email to