[jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

Antti Lankila (JIRA) Fri, 13 Jun 2014 00:44:24 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030350#comment-14030350
 ]


Antti Lankila commented on PDFBOX-922:
--------------------------------------

I do not really understand what makes you say that. Isn't subsetted font 
basically just a wholly different font file, just having a bunch of glyphs 
removed from the original one? For instance, assuming it is a TTF file, you 
drop bunch of glyphs and then update the cmaps to reference the appropriate 
glyph indexes, and then you have a new TTF file. If so, I can't see the problem 
because you are providing all the same information as with the original font, 
only with less glyphs included.

On the other hand, I do understand that if you write the text stream using 
encoding of one font, then change the definition of the TTF font without 
re-encoding the text, then you definitely run into problems. But the only 
possible way to keep CID stable is to define a standard for them, such as that 
CIDs are UCS-2. This can be done, but as far as I can tell this limits code 
points to the less than 0x10000 range because CID font writing writes 16 bit 
character indexes by definition, and there is no notion of the surrogate pairs 
of UTF-16. It might not be a real problem in practice, but it's nevertheless a 
limitation that the identity mapping for glyph indexes does not have. The only 
limitation of the latter approach is that single font can't have more than 
65536 glyphs.

BTW, I've been quiet on this front because I solved my immediate problem by 
switching to a PDF rendering library called jPod. It's not so advanced as 
pdfbox, and it didn't support unicode text either, but it was possible to get 
CID keyed fonts to work on it without touching the library itself, just through 
providing appropriate COS objects and setting up an encoding based on the 
font's Windows Unicode cmap. I even managed to set up a working copypaste by 
providing the ToUnicode postscript program, so I got everything working nicely 
using that 2008-era library, but I had to write most of the PDF object 
factories myself.

> True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-922
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-922
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Writing
>    Affects Versions: 1.3.1
>         Environment: JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2.0
>            Reporter: Thanos Agelatos
>            Assignee: Andreas Lehmkühler
>         Attachments: pdfbox-unicode.diff, pdfbox-unicode2.diff
>
>
> PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it 
> creates, making it impossible to create PDFs in any language apart from 
> English and ones supported in WinAnsiEncoding. This behaviour is caused 
> because method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding inside, 
> and there is no Identity-H or Identity-V Encoding classes provided (to set 
> afterwards via PDFont.setFont() )
> This excludes the following languages plus many others:
> - Greek
> - Bulgarian
> - Swedish
> - Baltic languages
> - Malteze 
> The PDF created contains garbled characters and/or squares.
> Simple test case:
>                 PDDocument doc = null;
>               try {
>                       doc = new PDDocument();
>                       PDPage page = new PDPage();
>                       doc.addPage(page);
>                       // extract fonts for fields
>                       byte[] arialNorm = extractFont("arial.ttf");
>                       //byte[] arialBold = extractFont("arialbd.ttf"); 
>                       //PDFont font = PDType1Font.HELVETICA;
>                       PDFont font = PDTrueTypeFont.loadTTF(doc, new 
> ByteArrayInputStream(arialNorm));
>                       
>                       PDPageContentStream contentStream = new 
> PDPageContentStream(doc, page);
>                       contentStream.beginText();
>                       contentStream.setFont(font, 12);
>                       contentStream.moveTextPositionByAmount(100, 700);
>                       contentStream.drawString("Hello world from PDFBox 
> ελληνικά"); // text here may appear garbled; insert any text in Greek or 
> Bulgarian or Malteze
>                       contentStream.endText();
>                       contentStream.close();
>                       doc.save("pdfbox.pdf");
>                       System.out.println(" created!");
>               } catch (Exception ioe) {
>                       ioe.printStackTrace();
>               } finally {
>                       if (doc != null) {
>                               try { doc.close(); } catch (Exception e) {}
>                       }
>               }



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

Reply via email to