>> The only thing that's not >> possible in legacy codepage locales is handling text from other languages >> that need characters not present in your codepage. > > You say it's not possible??? Just launch firefox/opera/konqueror/whatever > modern browser with a legacy locale and see whether it displays all > foreign letters. It _does_, though you believe it's "not possible". > > But let's reverse the whole story. I write a homepage in Hungarian, using > either latin2 or utf8 charset. Someone who lives in West Europe, America, > Asia, the Christmas Island... anywhere else happens to visit this page. > It's not only important for him, it's also important for me that my > accents get displayed correctly there, under a locale unknown to me. And > luckily that's how all good browsers work. I can't see why you're > reasoning that this should't (or mustn't?) work.
Correct me if I'm wrong, but isn't the web server supposed to tell the client which charset is used: Latin2 or UTF-8? This might be done by using the <meta> element in HTML. The client (browser) renders the page using the charset suggested by the server (if possible) regardless of what locale the receiving user has his/hers environment set to. Some browsers might try to guess what charset to use if the server doesn't specify it, but the "correct" solution is to configure the server (or the HTML document) right. In the (X)HTML document: <meta http-equiv="content-type" content="text/html; charset=utf-8" /> HTTP header from the server: Content-Type: text/html; charset=UTF-8 And an XML example: <?xml version="1.0" encoding="utf-8" ?> It's the same with MIME - the sender tells the receiver what charset it used. Since the charset for the strings is specified, it's easy for the program to know how to handle that string. If the sender doesn't specify which charset to use, some/most/all sending clients assumes that the content is written using the locale set in the sender's environment (which might be wrong if inserting text through pipes). These solutions work since the content is tagged with the charset used. Guessing just isn't an option (unless the content isn't tagged, which it shouldn't be), but the basic rule is that if a string isn't tagged there's no way to know what charset was used when writing it. (Analysing the encoding might result in a good guess though.) If the receiving end displays data based on the receiving user's locale things will go terribly wrong. The receiving end must have a font which supports the charset needed for rendering, but that's another issue. Every program that inputs data should make sure that it knows what encoding is used and give an error for any input that's malformed. Then the data is tagged with the right charset before sending it to the receiver, who will then know what to use to render it. If the receiver only can render using the charset in the locale - their loss. I'm just a beginner, so I might be completely wrong. If so - please educate me. Sincerely, Fredrik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
