[
https://issues.apache.org/jira/browse/PDFBOX-5808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841093#comment-17841093
]
Tilman Hausherr commented on PDFBOX-5808:
-----------------------------------------
I tested just the tokenizer changes (and related) and now "affine" looks
better. However text extraction doesn't work for "in", regardless whether alone
or in "affine". The cause is probably in the font itself. "ff" maps to unicode
0xfb00, but "in" maps to unicode 0xe0a2 which is "private use" according to
https://www.compart.com/de/unicode/U+E0A2 .
> Add support for GSUB Lookup Type 3
> ----------------------------------
>
> Key: PDFBOX-5808
> URL: https://issues.apache.org/jira/browse/PDFBOX-5808
> Project: PDFBox
> Issue Type: New Feature
> Components: FontBox
> Affects Versions: 3.0.2 PDFBox
> Reporter: Fabrice Calafat
> Priority: Major
>
> Add support for the lookup type 3, Alternate Substitution when handling GSUB:
> [https://learn.microsoft.com/en-us/typography/opentype/spec/gsub#AS]
> The first available substitution glyph can be used (as done in other
> libraries)
>
> Also, the current implementation of CompoundCharacterTokenizer doesn't
> account for collision in ligatures
> For example, if a font supports ligatures for _att_ and {_}en{_}, the current
> implementation will not tokenize properly for the word _attention._ This is
> because the regex implementation doesn't allow for a proper split
>
> I'll open a proposed implementation for the above
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]