On Mon, Sep 04, 2006 at 11:44:26PM -0500, David Starner wrote: > On 9/1/06, Rich Felker <[EMAIL PROTECTED]> wrote: > >IMO the answer is common sense. Languages that have a low information > >per character density (lots of letters/marks per word, especially > >Indic) should be in 2-byte range and those with high information > >density (especially ideographic) should be in 3-byte range. If it > >weren't for so many legacy Latin blocks near the beginning of the > >character set, most or all scripts for low-density languages could > >have fit in the 2-byte range. > > Once you compress the data with a decent compression scheme, you may > as well store the data by writing out the full Unicode name (e.g. > "LATIN CAPITAL LETTER OU"); the final result will be about the same > size.
With some compression methods this is true, particularly bz2. > Furthermore, you can fit a decent sized novel on a floppy > uncompressed and a decent sized library on a DVD uncompressed. Yet somehow the firefox source code is still 36 megs (bz2), and god only knows how large OOO is. Imagine now if all the variable and function names were written in Hindi or Thai... It would be an interesting test to transliterate the Latin letters to Devanagari and see how much the compressed tarball size goes up. > The > only application I've seen where text data size was really crucial was > text messaging. Hence, common sense tells _me_ that we should put > scripts used by heavily text-messaging cultures in the 2-byte range; > that is, Latin, Hiragana and Katakana. ROTFL! :) In all seriousness, though, unless you're dealing with image, music, or movie files, text weighs in quite heavy in size. It's true that in html 75-90% of the size is usually tags (in ASCII) but that's due to incompetence of the web designers and their inability to use CSS correctly, not anything fundamental. If you're making a website without fluff and with lots of information, text size will be the dominant factor in traffic. It's quite unfortunate that native language text is 3 to 6(*) times larger in countries where bandwidth is very expensive. Rich (*) 6 because a large number of characters in Indic scripts will have the virama (a combining character) attached to them to remove the inherent vowel and attach them into clusters. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
