On Mon, Sep 04, 2006 at 11:44:26PM -0500, David Starner wrote:
> On 9/1/06, Rich Felker <[EMAIL PROTECTED]> wrote:
> >IMO the answer is common sense. Languages that have a low information
> >per character density (lots of letters/marks per word, especially
> >Indic) should be in 2-byte range and those with high information
> >density (especially ideographic) should be in 3-byte range. If it
> >weren't for so many legacy Latin blocks near the beginning of the
> >character set, most or all scripts for low-density languages could
> >have fit in the 2-byte range.
> 
> Once you compress the data with a decent compression scheme, you may
> as well store the data by writing out the full Unicode name (e.g.
> "LATIN CAPITAL LETTER OU"); the final result will be about the same
> size.

With some compression methods this is true, particularly bz2.

> Furthermore, you can fit a decent sized novel on a floppy
> uncompressed and a decent sized library on a DVD uncompressed.

Yet somehow the firefox source code is still 36 megs (bz2), and god
only knows how large OOO is. Imagine now if all the variable and
function names were written in Hindi or Thai... It would be an
interesting test to transliterate the Latin letters to Devanagari and
see how much the compressed tarball size goes up.

> The
> only application I've seen where text data size was really crucial was
> text messaging. Hence, common sense tells _me_ that we should put
> scripts used by heavily text-messaging cultures in the 2-byte range;
> that is, Latin, Hiragana and Katakana.

ROTFL! :)

In all seriousness, though, unless you're dealing with image, music,
or movie files, text weighs in quite heavy in size. It's true that in
html 75-90% of the size is usually tags (in ASCII) but that's due to
incompetence of the web designers and their inability to use CSS
correctly, not anything fundamental. If you're making a website
without fluff and with lots of information, text size will be the
dominant factor in traffic. It's quite unfortunate that native
language text is 3 to 6(*) times larger in countries where bandwidth
is very expensive.

Rich


(*) 6 because a large number of characters in Indic scripts will have
the virama (a combining character) attached to them to remove the
inherent vowel and attach them into clusters.


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to