[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

John Hewson (JIRA) Sat, 14 Apr 2018 17:58:58 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438540#comment-16438540
 ]


John Hewson edited comment on PDFBOX-4189 at 4/15/18 12:57 AM:
---------------------------------------------------------------

Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType (but we can 
relax this a bit, as I explain below).

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

*OpenType*: In general, OpenType layouts consist of glyph _substitutions_ (via 
GSUB) and _positionings_ (via GPOS). Obviously it's not possible to handle 
positionings in PDFont#encode(), so that helps explain why showText() is the 
right place for OpenType, as showText performs both positioning and encoding. 
We also need to keep track of glyphs for subsetting, which is not possible in 
encode().

*Subsetting*: We currently track which glyphs need to be included in a subset 
by using their Unicode code point, but with GSUB enabled we will have to keep 
track of some substituted glyphs via their glyph id (GID), because the glyphs 
which result from a substitution don't necessarily have their own code points 
(no entry in the camp table). This should be easy to add to TTFSubsetter as it 
already tracks glyph ids internally, we just need the ability to pass them in 
too, e.g. addGlyphId(integer). Then PDPageContentStream#showText will be 
responsible for passing the glyph ids. But now we need showText to know about 
those glyph ids, which leads me to....

*Glyph IDs:* The JDK represents text which has been through OpenType layout as 
a 
[GlyphVector|https://docs.oracle.com/javase/7/docs/api/java/awt/font/GlyphVector.html]
 which encapsulates substitutions via GID and positioning via a transform 
associated with each glyph. PDFBox might want to do something similar, I think 
it would even be ok to add this to PDType0Font (because I'm suggesting a 
specific OpenType API so it doesn't interfere with our PDType0Font's 
non-OpenType assumption) in the form of a method such as: {{final 
PDFGlyphVector layout(String text)}} which is called from 
PDPageContentStream#showText instead of encode(text). I also think it would be 
fine to use instanceof to detect this case, because only PDType0Font need have 
this capability. I'm assuming PDFGlyphVector is our own very simple version of 
the JDK's GlyphVector, which is effectively just a vector of (gid, dx, dy) 
tuples. Then all that PDPageContentStream#showText needs to know how to do is 
to draw a PDFGlyphVector on the page, by converting it into the equivalent text 
drawing operations (Tj and the like). Because this patch is just for GSUB, all 
of those positioning values can just be zero, and we need not implemented any 
actual glyph positioning in showText() yet :). Thus GlyphVector will serve 
simply as an array of GIDs.

Phew! That was a lot of information. Just to be clear, the current patch is not 
compatible with subsetting without making some changes. P.S. Make sure any new 
APIs are {{final}}.


was (Author: jahewson):
Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType (but we can 
relax this a bit, as I explain below).

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

*OpenType*: In general, OpenType layouts consist of glyph _substitutions_ (via 
GSUB) and _positionings_ (via GPOS). Obviously it's not possible to handle 
positionings in PDFont#encode(), so that helps explain why showText() is the 
right place for OpenType, as showText performs both positioning and encoding. 
We also need to keep track of glyphs for subsetting, which is not possible in 
encode().

*Subsetting*: We currently track which glyphs need to be included in a subset 
by using their Unicode code point, but with GSUB enabled we will have to keep 
track of some substituted glyphs via their glyph id (GID), because the glyphs 
which result from a substitution don't necessarily have their own code points 
(no entry in the camp table). This should be easy to add to TTFSubsetter as it 
already tracks glyph ids internally, we just need the ability to pass them in 
too, e.g. addGlyphId(integer). Then PDPageContentStream#showText will be 
responsible for passing the glyph ids. But now we need showText to know about 
those glyph ids, which leads me to....

*Glyph IDs:* The JDK represents text which has been through OpenType layout as 
a 
[GlyphVector|https://docs.oracle.com/javase/7/docs/api/java/awt/font/GlyphVector.html]
 which encapsulates substitutions via GID and positioning via a transform 
associated with each glyph. PDFBox might want to do something similar, I think 
it would even be ok to add this to PDType0Font (because I'm suggesting a 
specific OpenType API so it doesn't interfere with our PDType0Font's 
non-OpenType assumption) in the form of a method such as: {{PDFGlyphVector 
layout(String text)}} which is called from PDPageContentStream#showText instead 
of encode(text). I also think it would be fine to use instanceof to detect this 
case, because only PDType0Font need have this capability. I'm assuming 
PDFGlyphVector is our own very simple version of the JDK's GlyphVector, which 
is effectively just a vector of (gid, dx, dy) tuples. Then all that 
PDPageContentStream#showText needs to know how to do is to draw a 
PDFGlyphVector on the page, by converting it into the equivalent text drawing 
operations (Tj and the like).

Phew! That was a lot of information. Just to be clear, the current patch is not 
compatible with subsetting without making some changes.

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -----------------------------------------------------------------------------
>
>                 Key: PDFBOX-4189
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4189
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: FontBox, PDModel
>            Reporter: Palash Ray
>            Priority: Major
>         Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

Reply via email to