[jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

Antti Lankila (JIRA) Sun, 01 Jun 2014 01:51:17 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014939#comment-14014939
 ]


Antti Lankila commented on PDFBOX-922:
--------------------------------------

I'm no expert with PDF, but I looked into the problem yesterday and this 
morning, and came up with this.

h5. Candidate specification for Unicode text writing support

1. Each TTF font, when loaded, will be embedded as stream in the document. Two 
font descriptors will be created per call:
* TTF descriptor itself
* CIDFont Type 2 descriptor, which will be referenced by TTF

2. CMap maps from character code to character id (CID). COSString will write 
unicode strings when required, and it's probably simplest if the CIDs are also 
just unicode codepoints.
* Encoding will be Identity-H.
* To support copy-paste, the ToUnicode table needs to be provided, and is also 
identity map.

3. Character id is mapped to glyph id (GID). There are actually two major 
CIDFont types:
* CIDFont Type 0: contains CFF or OpenType fonts that have intrinsic CID->glyph 
mapping.
** this presumably means that the CIDs are font-specific and therefore CID 
table must supply a table from character code to CID that is not Identity-H for 
these fonts.
* CIDFont Type 2: contains TrueType fonts which must have a CIDToGIDMap that 
declares how to map from CID to GID.
** TTF files will probably have a Windows platform Unicode encoding, which is 
the unicode codepoint to glyph id map, and thus the CIDToGIDMap we must write. 
The map can be streamed and compressed and should not take much space.

h5. Consequences of the design

* PDF as a document will be remarkably readable, though COSString tends to use 
hexadecimal format way too often. (Bug to be fixed? I feel that COSString 
should be based on chars (e.g. StringBuilder), not bytes 
(ByteArrayOutputStream).)
* design is relatively simple; the hard work will be writing the CIDToGIDMap 
table, but this will be based on the Windows Unicode encoding table in TTF and 
should be trivial to generate.
* fonts will have all of their characters embedded in the PDF

I can't promise when I have time to implement this, but as far as I understand 
it, something like this is what it takes.

> True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-922
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-922
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Writing
>    Affects Versions: 1.3.1
>         Environment: JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2.0
>            Reporter: Thanos Agelatos
>            Assignee: Andreas Lehmkühler
>
> PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it 
> creates, making it impossible to create PDFs in any language apart from 
> English and ones supported in WinAnsiEncoding. This behaviour is caused 
> because method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding inside, 
> and there is no Identity-H or Identity-V Encoding classes provided (to set 
> afterwards via PDFont.setFont() )
> This excludes the following languages plus many others:
> - Greek
> - Bulgarian
> - Swedish
> - Baltic languages
> - Malteze 
> The PDF created contains garbled characters and/or squares.
> Simple test case:
>                 PDDocument doc = null;
>               try {
>                       doc = new PDDocument();
>                       PDPage page = new PDPage();
>                       doc.addPage(page);
>                       // extract fonts for fields
>                       byte[] arialNorm = extractFont("arial.ttf");
>                       //byte[] arialBold = extractFont("arialbd.ttf"); 
>                       //PDFont font = PDType1Font.HELVETICA;
>                       PDFont font = PDTrueTypeFont.loadTTF(doc, new 
> ByteArrayInputStream(arialNorm));
>                       
>                       PDPageContentStream contentStream = new 
> PDPageContentStream(doc, page);
>                       contentStream.beginText();
>                       contentStream.setFont(font, 12);
>                       contentStream.moveTextPositionByAmount(100, 700);
>                       contentStream.drawString("Hello world from PDFBox 
> ελληνικά"); // text here may appear garbled; insert any text in Greek or 
> Bulgarian or Malteze
>                       contentStream.endText();
>                       contentStream.close();
>                       doc.save("pdfbox.pdf");
>                       System.out.println(" created!");
>               } catch (Exception ioe) {
>                       ioe.printStackTrace();
>               } finally {
>                       if (doc != null) {
>                               try { doc.close(); } catch (Exception e) {}
>                       }
>               }



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

Reply via email to