[ https://issues.apache.org/jira/browse/PDFBOX-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107826#comment-14107826 ]
John Hewson edited comment on PDFBOX-2262 at 8/23/14 2:56 AM: -------------------------------------------------------------- My latest commits tackle the multi-byte CMap problem, which wasn't handle correctly in PDFBox previously, and with my previous changes had resulted in a situation where we had bad behaviour due to new code being correct but existing code relying on it being buggy. As I'd already planned this as part of PDFBOX-2149, I took the time to finally refactor CMaps and Encodings, in particular CMaps with variable-length character codes. Hopefully you'll find the new code very easy to understand (there are 457 fewer lines :)), where once we had: {code} int codeLength; for (int i = 0; i < string.length; i += codeLength) { // Decode the value to a Unicode character codeLength = 1; String unicode = font.encode(string, i, codeLength); int[] charCodes; if (unicode == null && i + 1 < string.length) { // maybe a multibyte encoding codeLength++; unicode = font.encode(string, i, codeLength); charCodes = new int[] { font.getCodeFromArray(string, i, codeLength) }; } else { charCodes = new int[] { font.getCodeFromArray(string, i, codeLength) }; } ... {code} We now have: {code} InputStream in = new ByteArrayInputStream(string); while (in.available() > 0) { int code = font.readCode(in); String unicode = font.toUnicode(code); ... {code} Hopefully I didn't break too much in the process, the exceptions on the following files should now be fixed: PDFBOX-1283.pdf <== still has rendering issues PDFBOX-1421.pdf <== still has rendering issues PDFBOX-1422.pdf FOP-2252.pdf freesanstest.pdf None of the other test files with rendering issues are affected, they're still buggy, I'll take a look at them soon. was (Author: jahewson): My latest commits tackle the multi-byte CMap problem, which wasn't handle correctly in PDFBox previously, and with my previous changes had resulted in a situation where we had bad behaviour due to new code being correct but existing code relying on it being buggy. As I'd already planned to this as part of PDFBOX-2149, I took the time to finally refactor CMaps and Encodings, in particular CMaps with variable-length character codes. Hopefully you'll find the new code very easy to understand (there are 457 fewer lines :)), where once we had: {code} int codeLength; for (int i = 0; i < string.length; i += codeLength) { // Decode the value to a Unicode character codeLength = 1; String unicode = font.encode(string, i, codeLength); int[] charCodes; if (unicode == null && i + 1 < string.length) { // maybe a multibyte encoding codeLength++; unicode = font.encode(string, i, codeLength); charCodes = new int[] { font.getCodeFromArray(string, i, codeLength) }; } else { charCodes = new int[] { font.getCodeFromArray(string, i, codeLength) }; } ... {code} We now have: {code} InputStream in = new ByteArrayInputStream(string); while (in.available() > 0) { int code = font.readCode(in); String unicode = font.toUnicode(code); ... {code} Hopefully I didn't break too much in the process, the exceptions on the following files should now be fixed: PDFBOX-1283.pdf <== still has rendering issues PDFBOX-1421.pdf <== still has rendering issues PDFBOX-1422.pdf FOP-2252.pdf freesanstest.pdf None of the other test files with rendering issues are affected, they're still buggy, I'll take a look at them soon. > Remove usage of AWT fonts > ------------------------- > > Key: PDFBOX-2262 > URL: https://issues.apache.org/jira/browse/PDFBOX-2262 > Project: PDFBox > Issue Type: Improvement > Components: PDModel, Rendering > Affects Versions: 2.0.0 > Reporter: John Hewson > Assignee: John Hewson > Attachments: ELVIA-Reiserucktritt-Vollschutz.pdf-1.png, > FreeSansTest.pdf, PDFBOX-1094-094730.pdf-1.png, PDFBOX-1770.pdf-1.png, > bugzilla886049.pdf, bugzilla886049.pdf-1.png > > > We're still using AWT fonts to render the "standard 14" built-in fonts, which > causes rendering problems and encoding issues (see PDFBOX-2140). We're also > using AWT for some fallback fonts. > Removal of these AWT fonts isn't too difficult, we need to load the fonts > using the existing PDFFontManager mechanism which has recently been added. > All missing TrueType fonts loaded from disk have been using SystemFontManager > for a number of weeks now. > We should ship some sensible default fonts with PDFBox, such as the > Liberation fonts (see PDFBOX-2169, PDFBOX-2263), in case PDFFontManager can't > find anything suitable, rather than falling back to the default TTF font, but > by default we'll probe the system for suitable fonts. -- This message was sent by Atlassian JIRA (v6.2#6252)