On Wed, Mar 28, 2007 at 05:57:35PM -0400, SrinTuar wrote: > The regex library can ask the locale what encoding things are in, just > like everybody else
The locale tells you which encoding your system uses _by default_. This is not necessarily the same as the data you're currently working with. > Even then, the user and app programmer should not have to care what > encoding is being used. For the user: you're perfectly right. For the programmer: how would you write a browser or a mail client if you completely ignored the charset information in the MIME header? How would you write a console mp3 id3v2 editor if you completely ignored the console's charset or the charset used within the id3v2 tags? How would you write a database frontend if you completely ignored the local charset as well as the charset used in the database? (Someone inserts some data, someone else queries it and receives different letters...) If you're the programmer, you can only ignore the locale in the most simpler situations, e.g. when appending two strings that you know are encoded in the same charset; determining the extension part of a filename etc... For more complicated operations you must know how your data is encoded, no matter what programming language you use. > >There _are_ many character sets out there, and it's _your_ job, the > >programmer's job to tell the compiler/interpreter how to handle your bytes > >and to hide all these charset issues from the users. Therefore you have to > >be aware of the technical issues and have to be able to handle them. > > If that was true then the vast majority of programs would not be i18n'd. That's false. Check for example the bind_textdomain_codeset call. In Gtk+-2 apps you call it with an UTF-8 argument. This happens because you _know_ that you'll need this data encoded in UTF-8. In most other applications you omit this call because there you _know_ you need the data in the encoding of the current locale. In both cases it's important that you _know_ what encoding is used. (By "knowing the charset" I don't necessarily mean one particular fixed charset known in advance, a dynamic one such as "the charset set by our locale" or "the charset named in that variable" are perfect choises too.) > Luckily, there is a way to support utf-8 without having to really > worry about it: > Just think in bytes! Seems we have a different concept of "thinking". For example, when you write a Gtk+2 application, you of course _work_ with bytes, but in a higher level of abstraction you know that an utf-8 encoding is used there and hence you _think_ in characters. > I wish perl would let me do that- it works so well in C. I already wrote twice. Just in case you haven't seen it, I write it for the third time. Perl _lets_ you think/work in bytes. Just ignore everything related to UTF-8. Just never set the utf8 mode. You'll be back at the world of bytes. It's so simple! > Hrm, I think Java needs to be fixed. Sure. Just alter the specifications. 99% of the existing programs will work incorrectly and would need to be fixed according to the new "fixed" language definition. It's so simple, isn't it? :-) > Their internal utf-16 mandate was a mistake, imo. That was not utf-16 but ucs-2 at that time and imo those days it was a perfectly reasonable decision. > They should store strings in whatever the locale says they are in. Oh yes... Sure everyone would be perfectly happy if his software wasn't able to handle characters that don't fit in his locale. Just because someone still uses an iso-8859-1 charset he sure wants his browser to display question marks instead of foreign accented letters and kanjis, right? > (and the locale should always say utf-8) Should, but doesn't. It's your choice to decide whether you want your application to work everywhere, or only under utf-8 locales. > Normally, you should not have to ever convert strings between > encodings. Its just > not your problem, plus it indroces a ton of potential headaches. > Just assume your input is in the encoding its supposed to be in. Ha-ha-ha. Do you know what makes my head ache? When I see accented characters of my mother tounge displayed incorrectly. The only way they can be displayed correctly is if you _know_ the encoding used in each file, each data stream, each strings etc. If you don't know their encoding, it's hopeless to display them correctly. I admit that in an ideal world everything would be encoded in UTF-8. Just don't forget: our world is not ideal. My browser has to display web pages encoded in Windows-1250 correctly. My e-mail client has to display messages encoded in iso-8859-2 correctly. And so on... -- Egmont -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
