Mark J. Reed wrote:

Unicode per se doesn't do anything to file sizes; it's all in how you
encode it.

Yes. And basically there are common ways to encode this: utf-8 and utf-16 (or similar variants requiring >= 2 bytes per character)

The UTF-8 encoding is not so attractive in locales that make
heavy use of characters which require several bytes to encode therein, or
relatively little use of characters in the ASCII range;

utf-8 is fine for languages like German, Polish, Norwegian, Spanish, French,... which have >= 90% of the text with ASCII-7-bit-characters.

but that's why
there are other encoding schemes like SCSU which get you Unicode
compatibility while not taking up much more space than the locale's native charset.

These make sense for languages like Japanese, Korean, Chinese etc, where you need more than one byte per character anyway.

But Russian, Greek, Hebrew, Arabic, Armenian and Georgian would work fine with one
byte per character.  But the kinds of of encoding that I can think of both make
this two bytes per character.  So for these I see file sizes doubled.  Or do I
miss something?  Anyway, it will be necessary to specify the encoding of unicode in
some way, which could possibly allow even to specify even some non-unicode-charsets.

IMHO the OS should provide a standard way to specify such a charset as a file 
attribute,
but usually it does not and it won't in the future, unless the file comes through the
network and has a Mime-Header.

Best regards,

Karl



Reply via email to