[jira] [Commented] (PDFBOX-2149) Font Refactoring

John Hewson (JIRA) Sat, 21 Jun 2014 10:00:49 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039896#comment-14039896
 ]


John Hewson commented on PDFBOX-2149:
-------------------------------------

{quote}
it shouldn't be possible for *getFontDescriptor()* to return null, either the 
font is embedded in which case it must have a FontDescriptor, as this is where 
the embedded file is stored, or it is a Type 1 system font in which case it 
will have an AFM file, or it is a TTF system font *in which case its 
FontDescriptor is populated in PDTrueTypeFont's constructor*.
{quote}

That's not a quote from the spec, I'm specifically discussing PDFBox's 
getFontDescriptor() method. I also mention the FontDescriptor being populated 
by PDFBox for system AFMs and TTFs (i.e. "synthesised" - which is what I've 
been discussing all along). The behaviour of PDFBox when it encounters a 
missing FontDescriptor is, in general, to synthesise a new FontDescriptor, but 
there are cases where this isn't done, which is why I call it a bug: there's an 
established approach to solving this problem but in one case it's not being 
done and instead a questionable workaround has been used in its place.

But we seem to have figured this out now: I'll work on fixing it.

> Font Refactoring
> ----------------
>
>                 Key: PDFBOX-2149
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2149
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: FontBox, PDModel
>    Affects Versions: 2.0.0
>            Reporter: John Hewson
>            Assignee: John Hewson
>         Attachments: 000039.pdf, 000467.pdf
>
>
> To fix bugs such as PDFBOX-2140 and to enable Unicode TTF embedding we need 
> to sort out long-standing font/text encoding issues. The main issue is that 
> encoding is done in an ad-hoc manner, sometimes in the PDFont subclasses, 
> sometimes elsewhere. For example TTFGlyph2D does its own decoding, and this 
> code is copy & pasted into PDTrueTypeFont. Likewise, PDFont handles CMaps and 
> Encodings despite the fact that these two encoding methods are mutually 
> exclusive. The end result is that the process of reading Encodings/CMaps is 
> often following rules which are completely invalid for that font type but 
> mostly work by luck.
> Phase 1
> - Refactor PDFont subclasses to remove setXXX methods which allow the object 
> to be corrupted. Proper use of inheritance can remove all cases where public 
> setXXX methods are used during font loading.
> - Clean up TTF loading and the loadTTF in anticipation of Unicode TTF 
> embedding, FontBox's TrueTypeFont class is externally mutable via setXXX 
> methods used only by TTFParser: these can be made package-private.
> - the Encoding class and EncodingManager could do with some cleaning up prior 
> to further refactoring.
> - PDSimpleFont does not do anything, its functionality should be moved into 
> its superclass, PDFont.
> - PDFont#determineEncoding() loads CMaps when only Encodings are applicable, 
> and vice versa. Loading needs to be pushed down into the appropriate 
> subclasses, as a starting point the relevant code should at least be copied 
> into the relevant subclasses ready for further refactoring.
> - TTFGlyph2D does its own decoding of char codes, rather than using the 
> font's #encode method (fair enough because #encode is broken) and there's a 
> copy and pasted version of the same code in PDTrueTypeFont - we need to 
> consolidate this code into PDTrueTypeFont where it belongs.
> Phase 2
> - Refactor loading of CMaps and Encodings from font dictionaries, this will 
> involve changes to PDFont and its subclasses to delegate loading to 
> subclasses where it can be properly encapsulated
> - May need to alter the class hierarchy w.r.t CIDFont to facilitate this, as 
> CIDFont isn't really a PDFont - it's parent Type0 font is responsible for its 
> CMap. We'll see.
> Phase 3
> - Refactor the decoding of character codes by PDFont and its subclasses, this 
> will involve replacing the #getCodeFromArray, #encode and #encodeToCID 
> methods.
> - Fix decoding of content stream character codes in PDFStreamEngine, using 
> the newly refactored PDFont and using the current font's CMap to determine 
> the code width.
> Phase 4
> - Add support for generating embedded TTFs with Unicode



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-2149) Font Refactoring

Reply via email to