[
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438540#comment-16438540
]
John Hewson edited comment on PDFBOX-4189 at 4/15/18 12:57 AM:
---------------------------------------------------------------
Hi guys, this is a really welcome contribution, thank you. With regards to
PDFont#encode(String text) being non-final I can add some insight as I was the
original designer of our current PDFont#encode mechanism.
Basically, the PDFont classes are designed to represent fonts identically to
how they are represented when embedded in PDF files. So there's no support for
OpenType, by design. A Type0 font knows nothing about OpenType (but we can
relax this a bit, as I explain below).
So how can we use OpenType in PDFBox? The answer is that we do it one layer of
abstraction up, during text _layout_ instead of text _encoding_*_._* So you
want to put your glyph substitution code inside PDPageContentStream#showText,
actually you want
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].
That way PDFont#encode(String text) can stay non-final :)
*OpenType*: In general, OpenType layouts consist of glyph _substitutions_ (via
GSUB) and _positionings_ (via GPOS). Obviously it's not possible to handle
positionings in PDFont#encode(), so that helps explain why showText() is the
right place for OpenType, as showText performs both positioning and encoding.
We also need to keep track of glyphs for subsetting, which is not possible in
encode().
*Subsetting*: We currently track which glyphs need to be included in a subset
by using their Unicode code point, but with GSUB enabled we will have to keep
track of some substituted glyphs via their glyph id (GID), because the glyphs
which result from a substitution don't necessarily have their own code points
(no entry in the camp table). This should be easy to add to TTFSubsetter as it
already tracks glyph ids internally, we just need the ability to pass them in
too, e.g. addGlyphId(integer). Then PDPageContentStream#showText will be
responsible for passing the glyph ids. But now we need showText to know about
those glyph ids, which leads me to....
*Glyph IDs:* The JDK represents text which has been through OpenType layout as
a
[GlyphVector|https://docs.oracle.com/javase/7/docs/api/java/awt/font/GlyphVector.html]
which encapsulates substitutions via GID and positioning via a transform
associated with each glyph. PDFBox might want to do something similar, I think
it would even be ok to add this to PDType0Font (because I'm suggesting a
specific OpenType API so it doesn't interfere with our PDType0Font's
non-OpenType assumption) in the form of a method such as: {{final
PDFGlyphVector layout(String text)}} which is called from
PDPageContentStream#showText instead of encode(text). I also think it would be
fine to use instanceof to detect this case, because only PDType0Font need have
this capability. I'm assuming PDFGlyphVector is our own very simple version of
the JDK's GlyphVector, which is effectively just a vector of (gid, dx, dy)
tuples. Then all that PDPageContentStream#showText needs to know how to do is
to draw a PDFGlyphVector on the page, by converting it into the equivalent text
drawing operations (Tj and the like). Because this patch is just for GSUB, all
of those positioning values can just be zero, and we need not implemented any
actual glyph positioning in showText() yet :). Thus GlyphVector will serve
simply as an array of GIDs.
Phew! That was a lot of information. Just to be clear, the current patch is not
compatible with subsetting without making some changes. P.S. Make sure any new
APIs are {{final}}.
was (Author: jahewson):
Hi guys, this is a really welcome contribution, thank you. With regards to
PDFont#encode(String text) being non-final I can add some insight as I was the
original designer of our current PDFont#encode mechanism.
Basically, the PDFont classes are designed to represent fonts identically to
how they are represented when embedded in PDF files. So there's no support for
OpenType, by design. A Type0 font knows nothing about OpenType (but we can
relax this a bit, as I explain below).
So how can we use OpenType in PDFBox? The answer is that we do it one layer of
abstraction up, during text _layout_ instead of text _encoding_*_._* So you
want to put your glyph substitution code inside PDPageContentStream#showText,
actually you want
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].
That way PDFont#encode(String text) can stay non-final :)
*OpenType*: In general, OpenType layouts consist of glyph _substitutions_ (via
GSUB) and _positionings_ (via GPOS). Obviously it's not possible to handle
positionings in PDFont#encode(), so that helps explain why showText() is the
right place for OpenType, as showText performs both positioning and encoding.
We also need to keep track of glyphs for subsetting, which is not possible in
encode().
*Subsetting*: We currently track which glyphs need to be included in a subset
by using their Unicode code point, but with GSUB enabled we will have to keep
track of some substituted glyphs via their glyph id (GID), because the glyphs
which result from a substitution don't necessarily have their own code points
(no entry in the camp table). This should be easy to add to TTFSubsetter as it
already tracks glyph ids internally, we just need the ability to pass them in
too, e.g. addGlyphId(integer). Then PDPageContentStream#showText will be
responsible for passing the glyph ids. But now we need showText to know about
those glyph ids, which leads me to....
*Glyph IDs:* The JDK represents text which has been through OpenType layout as
a
[GlyphVector|https://docs.oracle.com/javase/7/docs/api/java/awt/font/GlyphVector.html]
which encapsulates substitutions via GID and positioning via a transform
associated with each glyph. PDFBox might want to do something similar, I think
it would even be ok to add this to PDType0Font (because I'm suggesting a
specific OpenType API so it doesn't interfere with our PDType0Font's
non-OpenType assumption) in the form of a method such as: {{PDFGlyphVector
layout(String text)}} which is called from PDPageContentStream#showText instead
of encode(text). I also think it would be fine to use instanceof to detect this
case, because only PDType0Font need have this capability. I'm assuming
PDFGlyphVector is our own very simple version of the JDK's GlyphVector, which
is effectively just a vector of (gid, dx, dy) tuples. Then all that
PDPageContentStream#showText needs to know how to do is to draw a
PDFGlyphVector on the page, by converting it into the equivalent text drawing
operations (Tj and the like).
Phew! That was a lot of information. Just to be clear, the current patch is not
compatible with subsetting without making some changes.
> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -----------------------------------------------------------------------------
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
> Issue Type: New Feature
> Components: FontBox, PDModel
> Reporter: Palash Ray
> Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph
> substitution. The GSUB table has been read and used effectively to replace
> some compound words with their respective Glyphs. All tests are passing. I
> have tested this for the Bengali font. Please review these changes and let me
> know if it makes sense to incorporate these.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]