[jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

Antti Lankila (JIRA) Mon, 16 Jun 2014 04:13:32 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14032334#comment-14032334
 ]


Antti Lankila commented on PDFBOX-922:
--------------------------------------

Ah... there are multiple ways to understand what "identity mapping" meant. I've 
been using it in sense that PDF standard uses: that Identity means f(x) = x, 
and that implies that once you have CIDToGIDMap as Identity and Encoding as 
Identity-H, then all the character codes and CIDs are just GIDs. When I discuss 
about the possibility that CID values would be constrained to be valid Unicode 
code points, I use some phrasing such as "CIDs are UCS-2". In this case, of 
course, we would still have Identity-H mapping at the character code -> CID 
layer, but not at the CID to GID layer.

I believe that the notion of subsetting fonts is not a problem as long as 
subsetting is not done after the fact by replacing the FontFile parameter. (Or 
if it is, then CIDToGIDMap must be provided that matches the new glyph IDs, as 
you pointed out.)

Of course, this only applies to truetype fonts. Some font types apparently 
defined CIDs to have a particular meaning, and they come with their own CID to 
GID programs. I assume such fonts also provide a meaning for CID that we could 
use, such as the unicode value or postscript name for the CID, or some 
predefined encoding map that defines all valid CIDs and their interpretation.

You are right that the CMap will control the code length. I also can't see any 
good reason to generate but 16-bit characters -- all that matters is that 
indexing all the glyphs is possible and I'm going to guess that there are no 
non-composite fonts that have more than 65536 glyphs, so that makes things 
simple on the generating side. However, existing PDF files could have combined 
single/multibyte CMaps, which are then required to have no possibility to 
confuse which CMap is in use so the ranges going for 8-bit codes can't be used 
as the prefix for 16-bit codes, and so on. Rather complicated and I doubt that 
the current code (which is also pretty ugly to look at) is handling things 
correctly -- CodespaceRanges are not sorted by length as far as I can see.

> True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-922
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-922
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Writing
>    Affects Versions: 1.3.1
>         Environment: JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2.0
>            Reporter: Thanos Agelatos
>            Assignee: Andreas Lehmkühler
>         Attachments: pdfbox-unicode.diff, pdfbox-unicode2.diff
>
>
> PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it 
> creates, making it impossible to create PDFs in any language apart from 
> English and ones supported in WinAnsiEncoding. This behaviour is caused 
> because method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding inside, 
> and there is no Identity-H or Identity-V Encoding classes provided (to set 
> afterwards via PDFont.setFont() )
> This excludes the following languages plus many others:
> - Greek
> - Bulgarian
> - Swedish
> - Baltic languages
> - Malteze 
> The PDF created contains garbled characters and/or squares.
> Simple test case:
>                 PDDocument doc = null;
>               try {
>                       doc = new PDDocument();
>                       PDPage page = new PDPage();
>                       doc.addPage(page);
>                       // extract fonts for fields
>                       byte[] arialNorm = extractFont("arial.ttf");
>                       //byte[] arialBold = extractFont("arialbd.ttf"); 
>                       //PDFont font = PDType1Font.HELVETICA;
>                       PDFont font = PDTrueTypeFont.loadTTF(doc, new 
> ByteArrayInputStream(arialNorm));
>                       
>                       PDPageContentStream contentStream = new 
> PDPageContentStream(doc, page);
>                       contentStream.beginText();
>                       contentStream.setFont(font, 12);
>                       contentStream.moveTextPositionByAmount(100, 700);
>                       contentStream.drawString("Hello world from PDFBox 
> ελληνικά"); // text here may appear garbled; insert any text in Greek or 
> Bulgarian or Malteze
>                       contentStream.endText();
>                       contentStream.close();
>                       doc.save("pdfbox.pdf");
>                       System.out.println(" created!");
>               } catch (Exception ioe) {
>                       ioe.printStackTrace();
>               } finally {
>                       if (doc != null) {
>                               try { doc.close(); } catch (Exception e) {}
>                       }
>               }



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

Reply via email to