This is plucker & palmfontconv unicode patch. List of relevant changes:
Palmfontconv: topalmtext has parameter -r used to select ranges of unicode characters to include. Use it like this: topalmtext -r0000-017f:0400-04f9:2010-2030 ... to select Latin1, Latin-A, Cyrillic and some usefull punctuation in the font. Range 0000-00ff is added by default. Parameter -u (from previous patch) is gone. See also topluckergray-unicode script. - Font format has a bit changed, but remains backward compatible with official plucker releases (though NOT with my previous patch) now it supports discontinuous glyph ranges, with first 256 characters in continuous range for backward compatibility and optimalization. - consequently, fontversion is changed to 3. - New fonts do work with old plucker, but if they are big, old plucker tries to allocate too much memory and might fail doing this. - you can have only about 5400 glyphs in gray font. Plucker: - usese8bitChars is not being fiddled with anymore, with the exception of grayfont.c, where it is necessary. - uses binary search for new fonts How to test it: get sample text and fonts from http://kassiopeia.juls.savba.sk/~garabik/plucker/ Fonts are rather big, but that is good for demonstration purposes. Use topluckergray-unicode script to generate your own suitable fonts with only ne glyphs. Comparision between different encodings and encoding schemes: I took the novel Three Musketeers by A. Dumas, in French and its translation into Russian. French was chosen deliberately, because of european languages it seems to make most use of diacritics, and Russian is a typical example of non latin script language. "increase" in following tables is the penalty one gets for using unicode instead of legacy 8-bit encoding. "uni" encoding means plucker format with UNICODE functions for unicode chars - this is a bir unfair, since in current (+my patch) parser implementation, there is 1 byte added to each unicode character (meant as fallback when unicode fonts are not available), so the results could be made better. Conclusion: we all know that zlib compression is much better than doc. Even for diacritic-rich French, the disadvantages of using unicode are just a few percent. For Russian, original text does not increase twice when going to utf-8, as one would expect, but only by 67% - because of spaces and punctuation (it would be worse for languages using neither spaces nor ascii punctuation, like Japanese). Compression smoothes the penaly a bit, but not that much. However, the best results are with utf-8 encoded plucker document (which is just an example, since it is not readable by plucker), where the penalty is 6% for doc compression and 23% for zlib one (zlib is of course much better in absolute numbers). So perhaps, if document size if a big issue, it would make sence to use utf-8 for plucker documents, tags them as such and implement the support for it in plucker. russian text: type encoding size increase original cp1251 1471835 gzipped cp1251 555947 original utf-8 2463173 +67.4% gzipped utf-8 651722 +17.2% pdb doc uni 1734806 +45.3% pdb zlib uni 999639 +63.0% pdb doc cp1251 1194075 pdb zlib cp1251 613449 pdb doc utf-8 1263083 +5.85 pdb zlib utf-8 753543 +22.8% french text: type encoding size increase original iso1 1415790 gzipped iso1 517698 original utf-8 1445719 +2.1% gzipped utf-8 522078 +0.8% pdb doc iso1 836411 pdb zlib iso1 592018 pdb doc uni 865120 +3.4% pdb zlib uni 614324 +3.8% pdb doc utf-8 851363 +1.8% pdb zlib utf-8 599377 +1.2% -- ----------------------------------------------------------- | Radovan Garab�k http://melkor.dnp.fmph.uniba.sk/~garabik/ | | __..--^^^--..__ garabik @ melkor.dnp.fmph.uniba.sk | ----------------------------------------------------------- Antivirus alert: file .signature infected by signature virus. Hi! I'm a signature virus! Copy me into your signature file to help me spread!
plucker.diff.gz
Description: Binary data
palmfontconv.diff.gz
Description: Binary data
