[ https://issues.apache.org/jira/browse/PDFBOX-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039896#comment-14039896 ]
John Hewson commented on PDFBOX-2149: ------------------------------------- {quote} it shouldn't be possible for *getFontDescriptor()* to return null, either the font is embedded in which case it must have a FontDescriptor, as this is where the embedded file is stored, or it is a Type 1 system font in which case it will have an AFM file, or it is a TTF system font *in which case its FontDescriptor is populated in PDTrueTypeFont's constructor*. {quote} That's not a quote from the spec, I'm specifically discussing PDFBox's getFontDescriptor() method. I also mention the FontDescriptor being populated by PDFBox for system AFMs and TTFs (i.e. "synthesised" - which is what I've been discussing all along). The behaviour of PDFBox when it encounters a missing FontDescriptor is, in general, to synthesise a new FontDescriptor, but there are cases where this isn't done, which is why I call it a bug: there's an established approach to solving this problem but in one case it's not being done and instead a questionable workaround has been used in its place. But we seem to have figured this out now: I'll work on fixing it. > Font Refactoring > ---------------- > > Key: PDFBOX-2149 > URL: https://issues.apache.org/jira/browse/PDFBOX-2149 > Project: PDFBox > Issue Type: Improvement > Components: FontBox, PDModel > Affects Versions: 2.0.0 > Reporter: John Hewson > Assignee: John Hewson > Attachments: 000039.pdf, 000467.pdf > > > To fix bugs such as PDFBOX-2140 and to enable Unicode TTF embedding we need > to sort out long-standing font/text encoding issues. The main issue is that > encoding is done in an ad-hoc manner, sometimes in the PDFont subclasses, > sometimes elsewhere. For example TTFGlyph2D does its own decoding, and this > code is copy & pasted into PDTrueTypeFont. Likewise, PDFont handles CMaps and > Encodings despite the fact that these two encoding methods are mutually > exclusive. The end result is that the process of reading Encodings/CMaps is > often following rules which are completely invalid for that font type but > mostly work by luck. > Phase 1 > - Refactor PDFont subclasses to remove setXXX methods which allow the object > to be corrupted. Proper use of inheritance can remove all cases where public > setXXX methods are used during font loading. > - Clean up TTF loading and the loadTTF in anticipation of Unicode TTF > embedding, FontBox's TrueTypeFont class is externally mutable via setXXX > methods used only by TTFParser: these can be made package-private. > - the Encoding class and EncodingManager could do with some cleaning up prior > to further refactoring. > - PDSimpleFont does not do anything, its functionality should be moved into > its superclass, PDFont. > - PDFont#determineEncoding() loads CMaps when only Encodings are applicable, > and vice versa. Loading needs to be pushed down into the appropriate > subclasses, as a starting point the relevant code should at least be copied > into the relevant subclasses ready for further refactoring. > - TTFGlyph2D does its own decoding of char codes, rather than using the > font's #encode method (fair enough because #encode is broken) and there's a > copy and pasted version of the same code in PDTrueTypeFont - we need to > consolidate this code into PDTrueTypeFont where it belongs. > Phase 2 > - Refactor loading of CMaps and Encodings from font dictionaries, this will > involve changes to PDFont and its subclasses to delegate loading to > subclasses where it can be properly encapsulated > - May need to alter the class hierarchy w.r.t CIDFont to facilitate this, as > CIDFont isn't really a PDFont - it's parent Type0 font is responsible for its > CMap. We'll see. > Phase 3 > - Refactor the decoding of character codes by PDFont and its subclasses, this > will involve replacing the #getCodeFromArray, #encode and #encodeToCID > methods. > - Fix decoding of content stream character codes in PDFStreamEngine, using > the newly refactored PDFont and using the current font's CMap to determine > the code width. > Phase 4 > - Add support for generating embedded TTFs with Unicode -- This message was sent by Atlassian JIRA (v6.2#6252)