bogdiuk opened a new pull request, #189: URL: https://github.com/apache/pdfbox/pull/189
On one machine with 792 fonts (~800MB) scanning takes around 5.5s, after these changes time is down to 0.7s (without checksumming), or 2.2s (with checksums). In this PR, TTF parsers have an "only headers" mode where each table reads as little information as possible: * in this mode, whole file is not read to `byte[]`, parser uses `RandomAccessRead` directly, because most of the file is skipped * only read 5 tables needed for `FSFontInfo` (`name`, `head`, `OS/2`, `CFF `, `gcid`) * table parsers finish as soon as they have all needed data * skip checksumming because it is now faster to simply re-parse the file (gated with `pdfbox.fontcache.skipchecksums` system property, for backward compatibility): checksumming 800MB takes 1.5s and parsing headers takes only 0.7s Additional fixes: * fixed a memory leak: `RandomAccessRead` passed to `TrueTypeCollection` constructor was never freed. * `NamingTable`: use sorted list instead of multilevel `HashMap`, delay-load Strings (for non-"only headers" mode) * `TTFSubsetter`: avoid bytes->string->bytes conversion * streamline I/O: replace readByte() with read(array) * consolidate `read(buf, offset, len)` loops into `readNBytes()` (allows underflow) and `readExact()` (throwing) Breaking changes: * `NameRecord.getString()` is now package-private and lazy, renamed to `getStringLazy()` * new abstract method `TTFDataStream.getSubReader()` * `.pdfbox.cache` file has dashes instead of checksums (if `pdfbox.fontcache.skipchecksums` property is set) Only tested with 3.0 branch, all tests pass, resulting file is the same (except for checksums field, of course). Should I break it into smaller commits? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
