This is plucker & palmfontconv unicode patch.
List of relevant changes:

Palmfontconv:

topalmtext has parameter -r used to select ranges of unicode characters to
include. Use it like this:
topalmtext -r0000-017f:0400-04f9:2010-2030 ...
to select Latin1, Latin-A, Cyrillic and some usefull punctuation in the font. 
Range 0000-00ff is added by default.
Parameter -u (from previous patch) is gone.
See also topluckergray-unicode script.

- Font format has a bit changed, but remains backward compatible 
with official plucker releases (though NOT with my previous patch)
now it supports discontinuous glyph ranges, with first 256 characters 
in continuous range for backward compatibility and optimalization. 
- consequently, fontversion is changed to 3. 
- New fonts do work with old plucker, but if they are big, old
plucker tries to allocate too much memory and might fail doing this.
- you can have only about 5400 glyphs in gray font.


Plucker:

- usese8bitChars is not being fiddled with anymore, with the exception of
grayfont.c, where it is necessary.
- uses binary search for new fonts


How to test it:
get sample text and fonts from 
http://kassiopeia.juls.savba.sk/~garabik/plucker/
Fonts are rather big, but that is good for demonstration purposes.
Use topluckergray-unicode script to generate your own suitable fonts 
with only ne glyphs.


Comparision between different encodings and encoding schemes:

I took the novel Three Musketeers by A. Dumas, in French and its
translation into Russian. French was chosen deliberately, because of
european languages it seems to make most use of diacritics, and Russian
is a typical example of non latin script language.
"increase" in following tables is the penalty one gets for using unicode
instead of legacy 8-bit encoding.
"uni" encoding means plucker format with UNICODE functions for unicode chars -
this is a bir unfair, since in current (+my patch) parser implementation,
there is 1 byte added to each unicode character (meant as fallback when
unicode fonts are not available), so the results could be made better.

Conclusion: we all know that zlib compression is much better than doc.
Even for diacritic-rich French, the disadvantages of using unicode
are just a few percent.
For Russian, original text does not increase twice when going to utf-8, as one 
would expect, but only by 67% - because of spaces and punctuation (it would
be worse for languages using neither spaces nor ascii punctuation, like
Japanese).
Compression smoothes the penaly a bit, but not that much. However,
the best results are with utf-8 encoded plucker document (which is just an
example, since it is not readable by plucker), where the penalty is 6% 
for doc compression and 23% for zlib one (zlib is of course much better 
in absolute numbers). So perhaps, if document size if a big issue, it would
make sence to use utf-8 for plucker documents, tags them as such
and implement the support for it in plucker.


russian text:
type    encoding   size   increase
original cp1251  1471835
gzipped  cp1251   555947
original  utf-8  2463173 +67.4%
gzipped   utf-8   651722 +17.2%
pdb doc     uni  1734806 +45.3%
pdb zlib    uni   999639 +63.0%
pdb doc  cp1251  1194075 
pdb zlib cp1251   613449
pdb doc   utf-8  1263083 +5.85
pdb zlib  utf-8   753543 +22.8%

french text:
type    encoding   size   increase
original   iso1  1415790
gzipped    iso1   517698
original  utf-8  1445719 +2.1%
gzipped   utf-8   522078 +0.8%
pdb doc    iso1   836411 
pdb zlib   iso1   592018
pdb doc     uni   865120 +3.4%
pdb zlib    uni   614324 +3.8%
pdb doc    utf-8  851363 +1.8%
pdb zlib   utf-8  599377 +1.2%


-- 
 -----------------------------------------------------------
| Radovan Garab�k http://melkor.dnp.fmph.uniba.sk/~garabik/ |
| __..--^^^--..__    garabik @ melkor.dnp.fmph.uniba.sk     |
 -----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!

Attachment: plucker.diff.gz
Description: Binary data

Attachment: palmfontconv.diff.gz
Description: Binary data

Reply via email to