Re: Latin-1-characters

Karl Brodowsky Mon, 15 Mar 2004 15:14:54 -0800

Mark J. Reed wrote:

Unicode per se doesn't do anything to file sizes; it's all in how you
encode it.


Yes.  And basically there are common ways to encode this: utf-8 and utf-16
(or similar variants requiring >= 2 bytes per character)

The UTF-8 encoding is not so attractive in locales that make
heavy use of characters which require several bytes to encode therein, or
relatively little use of characters in the ASCII range;


utf-8 is fine for languages like German, Polish, Norwegian, Spanish, French,...
which have >= 90% of the text with ASCII-7-bit-characters.

but that's why there are other encoding schemes like SCSU which get you Unicode compatibility while not taking up much more space than the locale's native charset.


These make sense for languages like Japanese, Korean, Chinese etc, where you need
more than one byte per character anyway.

But Russian, Greek, Hebrew, Arabic, Armenian and Georgian would work fine with one
byte per character.  But the kinds of of encoding that I can think of both make
this two bytes per character.  So for these I see file sizes doubled.  Or do I
miss something?  Anyway, it will be necessary to specify the encoding of unicode in
some way, which could possibly allow even to specify even some non-unicode-charsets.

IMHO the OS should provide a standard way to specify such a charset as a file 
attribute,
but usually it does not and it won't in the future, unless the file comes through the
network and has a Mime-Header.

Best regards,

Karl

Re: Latin-1-characters

Reply via email to