On 9/1/06, Rich Felker <[EMAIL PROTECTED]> wrote:
IMO the answer is common sense. Languages that have a low information
per character density (lots of letters/marks per word, especially
Indic) should be in 2-byte range and those with high information
density (especially ideographic) should be in 3-byte range. If it
weren't for so many legacy Latin blocks near the beginning of the
character set, most or all scripts for low-density languages could
have fit in the 2-byte range.

Once you compress the data with a decent compression scheme, you may
as well store the data by writing out the full Unicode name (e.g.
"LATIN CAPITAL LETTER OU"); the final result will be about the same
size. Furthermore, you can fit a decent sized novel on a floppy
uncompressed and a decent sized library on a DVD uncompressed. The
only application I've seen where text data size was really crucial was
text messaging. Hence, common sense tells _me_ that we should put
scripts used by heavily text-messaging cultures in the 2-byte range;
that is, Latin, Hiragana and Katakana.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to