Re: question on Linux UTF8 support

Danilo Segan Sat, 06 Aug 2005 05:21:58 -0700

Last Wednesday at 20:36, Bruno Haible wrote:

>> Or even
>> worse, what if administrator provides some dirs for the user in an
>> encoding different from the one user wants to use?
>>
>> Eg. imagine having a global "/Müsik" in ISO-8859-1, and user desires
>> to use UTF-8 or ISO-8859-5.
>
> For this directory to be useful for different users, the files that it
> contains have to be in the same encoding. (If a user put the titles or
> lyrics of a song there in ISO-8859-5, and another user wants to see them
> in his UTF-8 locale, there will be a mess.) So a requirement for using
> a common directory is _anyway_ that all users are in locales with the
> same encoding.


Yeah, with the difference that in just ONE of those encodings all
users will be able to use whatever characters they wish, provided
everybody knows that file names use that encoding. 

That's what I was arguing anyway: file names encoding should be
per-system, not per-user, and most suitable of all encodings for a
per-system encoding is UTF-8.

> All that you say about the file names is also valid for the file contents.
> A lot of them are in plain text, and filenames are easily converted into
> plain text. But all POSIX compliant applications have their interpretation
> of plain text guided by LC_CTYPE et al.

Indeed.  That's a big problem of external metadata which is not
commonly transferred along with the data itself.

> However, when you recommend to an application author that his application
> should consider all filenames as being UTF-8, this is not an improvement.
> It is a no-op for the UTF-8 users but breaks the world of the EUC-JP and
> KOI8-R users.

You're right that it's not an improvement for them, but what are we
going to do with filenames extracted from eg. a tar file?  If I send
them a tar file with UTF-8 (or KOI8-R) encoded filenames, they're
going to see a mess (or get their terminal to hang).

Yeah, that's solving the wrong problem (I want metadata attached to 
everything :), but brokeness is already there, and standardising on
one encoding (if it's suitable for everybody) is still a step forward:
we'll break some things in the process, but improve others.  After a
while, we'll be there.


Recommending LC_CTYPE to use UTF-8 is an attempt to try to lower the
brokeness rate once the switch is finally done.  I don't know if such
a thing would ever be really done (compatibility, big amount of
already encoded data inducing big costs for transition, etc.), but I
can at least dream about it ;-)

Cheers,
Danilo

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: question on Linux UTF8 support

Reply via email to