>> The only thing that's not
>> possible in legacy codepage locales is handling text from other languages
>> that need characters not present in your codepage.
>
> You say it's not possible??? Just launch firefox/opera/konqueror/whatever
> modern browser with a legacy locale and see whether it displays all
> foreign letters. It _does_, though you believe it's "not possible".
>
> But let's reverse the whole story. I write a homepage in Hungarian, using
> either latin2 or utf8 charset. Someone who lives in West Europe, America,
> Asia, the Christmas Island... anywhere else happens to visit this page.
> It's not only important for him, it's also important for me that my
> accents get displayed correctly there, under a locale unknown to me. And
> luckily that's how all good browsers work. I can't see why you're
> reasoning that this should't (or mustn't?) work.

Correct me if I'm wrong, but isn't the web server supposed to tell the
client which charset is used: Latin2 or UTF-8? This might be done by using
the <meta> element in HTML. The client (browser) renders the page using
the charset suggested by the server (if possible) regardless of what
locale the receiving user has his/hers environment set to. Some browsers
might try to guess what charset to use if the server doesn't specify it,
but the "correct" solution is to configure the server (or the HTML
document) right.

In the (X)HTML document:
<meta http-equiv="content-type" content="text/html; charset=utf-8" />

HTTP header from the server:
Content-Type: text/html; charset=UTF-8

And an XML example:
<?xml version="1.0" encoding="utf-8" ?>

It's the same with MIME - the sender tells the receiver what charset it
used. Since the charset for the strings is specified, it's easy for the
program to know how to handle that string. If the sender doesn't specify
which charset to use, some/most/all sending clients assumes that the
content is written using the locale set in the sender's environment (which
might be wrong if inserting text through pipes).

These solutions work since the content is tagged with the charset used.
Guessing just isn't an option (unless the content isn't tagged, which it
shouldn't be), but the basic rule is that if a string isn't tagged there's
no way to know what charset was used when writing it. (Analysing the
encoding might result in a good guess though.) If the receiving end
displays data based on the receiving user's locale things will go terribly
wrong.

The receiving end must have a font which supports the charset needed for
rendering, but that's another issue.

Every program that inputs data should make sure that it knows what
encoding is used and give an error for any input that's malformed. Then
the data is tagged with the right charset before sending it to the
receiver, who will then know what to use to render it. If the receiver
only can render using the charset in the locale - their loss.

I'm just a beginner, so I might be completely wrong. If so - please
educate me.

Sincerely,
Fredrik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to