[jira] [Updated] (PDFBOX-2149) Font Refactoring

John Hewson (JIRA) Fri, 29 Aug 2014 19:53:26 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


John Hewson updated PDFBOX-2149:
--------------------------------

    Description: 
To fix bugs such as PDFBOX-2140 and to enable Unicode TTF embedding we need to 
sort out long-standing font/text encoding issues. The main issue is that 
encoding is done in an ad-hoc manner, sometimes in the PDFont subclasses, 
sometimes elsewhere. For example TTFGlyph2D does its own decoding, and this 
code is copy & pasted into PDTrueTypeFont. Likewise, PDFont handles CMaps and 
Encodings despite the fact that these two encoding methods are mutually 
exclusive. The end result is that the process of reading Encodings/CMaps is 
often following rules which are completely invalid for that font type but 
mostly work by luck.

Phase 1

- Refactor PDFont subclasses to remove setXXX methods which allow the object to 
be corrupted. Proper use of inheritance can remove all cases where public 
setXXX methods are used during font loading.

- Clean up TTF loading and the loadTTF in anticipation of Unicode TTF 
embedding, FontBox's TrueTypeFont class is externally mutable via setXXX 
methods used only by TTFParser: these can be made package-private.

- the Encoding class and EncodingManager could do with some cleaning up prior 
to further refactoring.

- PDSimpleFont does not do anything, its functionality should be moved into its 
superclass, PDFont.

- PDFont#determineEncoding() loads CMaps when only Encodings are applicable, 
and vice versa. Loading needs to be pushed down into the appropriate 
subclasses, as a starting point the relevant code should at least be copied 
into the relevant subclasses ready for further refactoring.

- TTFGlyph2D does its own decoding of char codes, rather than using the font's 
#encode method (fair enough because #encode is broken) and there's a copy and 
pasted version of the same code in PDTrueTypeFont - we need to consolidate this 
code into PDTrueTypeFont where it belongs.

Phase 2

- Refactor loading of CMaps and Encodings from font dictionaries, this will 
involve changes to PDFont and its subclasses to delegate loading to subclasses 
where it can be properly encapsulated

- May need to alter the class hierarchy w.r.t CIDFont to facilitate this, as 
CIDFont isn't really a PDFont - it's parent Type0 font is responsible for its 
CMap. We'll see.

Phase 3

- Refactor the decoding of character codes by PDFont and its subclasses, this 
will involve replacing the #getCodeFromArray, #encode and #encodeToCID methods.

- Fix decoding of content stream character codes in PDFStreamEngine, using the 
newly refactored PDFont and using the current font's CMap to determine the code 
width.


  was:
To fix bugs such as PDFBOX-2140 and to enable Unicode TTF embedding we need to 
sort out long-standing font/text encoding issues. The main issue is that 
encoding is done in an ad-hoc manner, sometimes in the PDFont subclasses, 
sometimes elsewhere. For example TTFGlyph2D does its own decoding, and this 
code is copy & pasted into PDTrueTypeFont. Likewise, PDFont handles CMaps and 
Encodings despite the fact that these two encoding methods are mutually 
exclusive. The end result is that the process of reading Encodings/CMaps is 
often following rules which are completely invalid for that font type but 
mostly work by luck.

Phase 1

- Refactor PDFont subclasses to remove setXXX methods which allow the object to 
be corrupted. Proper use of inheritance can remove all cases where public 
setXXX methods are used during font loading.

- Clean up TTF loading and the loadTTF in anticipation of Unicode TTF 
embedding, FontBox's TrueTypeFont class is externally mutable via setXXX 
methods used only by TTFParser: these can be made package-private.

- the Encoding class and EncodingManager could do with some cleaning up prior 
to further refactoring.

- PDSimpleFont does not do anything, its functionality should be moved into its 
superclass, PDFont.

- PDFont#determineEncoding() loads CMaps when only Encodings are applicable, 
and vice versa. Loading needs to be pushed down into the appropriate 
subclasses, as a starting point the relevant code should at least be copied 
into the relevant subclasses ready for further refactoring.

- TTFGlyph2D does its own decoding of char codes, rather than using the font's 
#encode method (fair enough because #encode is broken) and there's a copy and 
pasted version of the same code in PDTrueTypeFont - we need to consolidate this 
code into PDTrueTypeFont where it belongs.

Phase 2

- Refactor loading of CMaps and Encodings from font dictionaries, this will 
involve changes to PDFont and its subclasses to delegate loading to subclasses 
where it can be properly encapsulated

- May need to alter the class hierarchy w.r.t CIDFont to facilitate this, as 
CIDFont isn't really a PDFont - it's parent Type0 font is responsible for its 
CMap. We'll see.

Phase 3

- Refactor the decoding of character codes by PDFont and its subclasses, this 
will involve replacing the #getCodeFromArray, #encode and #encodeToCID methods.

- Fix decoding of content stream character codes in PDFStreamEngine, using the 
newly refactored PDFont and using the current font's CMap to determine the code 
width.

Phase 4

- Add support for generating embedded TTFs with Unicode


> Font Refactoring
> ----------------
>
>                 Key: PDFBOX-2149
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2149
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: FontBox, PDModel
>    Affects Versions: 2.0.0
>            Reporter: John Hewson
>            Assignee: John Hewson
>             Fix For: 2.0.0
>
>         Attachments: 000039.pdf, 000467.pdf
>
>
> To fix bugs such as PDFBOX-2140 and to enable Unicode TTF embedding we need 
> to sort out long-standing font/text encoding issues. The main issue is that 
> encoding is done in an ad-hoc manner, sometimes in the PDFont subclasses, 
> sometimes elsewhere. For example TTFGlyph2D does its own decoding, and this 
> code is copy & pasted into PDTrueTypeFont. Likewise, PDFont handles CMaps and 
> Encodings despite the fact that these two encoding methods are mutually 
> exclusive. The end result is that the process of reading Encodings/CMaps is 
> often following rules which are completely invalid for that font type but 
> mostly work by luck.
> Phase 1
> - Refactor PDFont subclasses to remove setXXX methods which allow the object 
> to be corrupted. Proper use of inheritance can remove all cases where public 
> setXXX methods are used during font loading.
> - Clean up TTF loading and the loadTTF in anticipation of Unicode TTF 
> embedding, FontBox's TrueTypeFont class is externally mutable via setXXX 
> methods used only by TTFParser: these can be made package-private.
> - the Encoding class and EncodingManager could do with some cleaning up prior 
> to further refactoring.
> - PDSimpleFont does not do anything, its functionality should be moved into 
> its superclass, PDFont.
> - PDFont#determineEncoding() loads CMaps when only Encodings are applicable, 
> and vice versa. Loading needs to be pushed down into the appropriate 
> subclasses, as a starting point the relevant code should at least be copied 
> into the relevant subclasses ready for further refactoring.
> - TTFGlyph2D does its own decoding of char codes, rather than using the 
> font's #encode method (fair enough because #encode is broken) and there's a 
> copy and pasted version of the same code in PDTrueTypeFont - we need to 
> consolidate this code into PDTrueTypeFont where it belongs.
> Phase 2
> - Refactor loading of CMaps and Encodings from font dictionaries, this will 
> involve changes to PDFont and its subclasses to delegate loading to 
> subclasses where it can be properly encapsulated
> - May need to alter the class hierarchy w.r.t CIDFont to facilitate this, as 
> CIDFont isn't really a PDFont - it's parent Type0 font is responsible for its 
> CMap. We'll see.
> Phase 3
> - Refactor the decoding of character codes by PDFont and its subclasses, this 
> will involve replacing the #getCodeFromArray, #encode and #encodeToCID 
> methods.
> - Fix decoding of content stream character codes in PDFStreamEngine, using 
> the newly refactored PDFont and using the current font's CMap to determine 
> the code width.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (PDFBOX-2149) Font Refactoring

Reply via email to