On 9/1/06, Rich Felker <[EMAIL PROTECTED]> wrote:
IMO the answer is common sense. Languages that have a low information per character density (lots of letters/marks per word, especially Indic) should be in 2-byte range and those with high information density (especially ideographic) should be in 3-byte range. If it weren't for so many legacy Latin blocks near the beginning of the character set, most or all scripts for low-density languages could have fit in the 2-byte range.
Once you compress the data with a decent compression scheme, you may as well store the data by writing out the full Unicode name (e.g. "LATIN CAPITAL LETTER OU"); the final result will be about the same size. Furthermore, you can fit a decent sized novel on a floppy uncompressed and a decent sized library on a DVD uncompressed. The only application I've seen where text data size was really crucial was text messaging. Hence, common sense tells _me_ that we should put scripts used by heavily text-messaging cultures in the 2-byte range; that is, Latin, Hiragana and Katakana. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
