Re: perl unicode support

Egmont Koblinger Thu, 29 Mar 2007 02:25:31 -0800

On Wed, Mar 28, 2007 at 05:57:35PM -0400, ＳｒｉｎＴｕａｒ wrote:

> The regex library can ask the locale what encoding things are in, just
> like everybody else


The locale tells you which encoding your system uses _by default_. This is
not necessarily the same as the data you're currently working with.

> Even then, the user and app programmer should not have to care what
> encoding is being used.

For the user: you're perfectly right.

For the programmer: how would you write a browser or a mail client if you
completely ignored the charset information in the MIME header? How would you
write a console mp3 id3v2 editor if you completely ignored the console's
charset or the charset used within the id3v2 tags? How would you write a
database frontend if you completely ignored the local charset as well as the
charset used in the database? (Someone inserts some data, someone else
queries it and receives different letters...)

If you're the programmer, you can only ignore the locale in the most simpler
situations, e.g. when appending two strings that you know are encoded in the
same charset; determining the extension part of a filename etc... For more
complicated operations you must know how your data is encoded, no matter
what programming language you use.

> >There _are_ many character sets out there, and it's _your_ job, the
> >programmer's job to tell the compiler/interpreter how to handle your bytes
> >and to hide all these charset issues from the users. Therefore you have to
> >be aware of the technical issues and have to be able to handle them.
> 
> If that was true then the vast majority of programs would not be i18n'd.

That's false. Check for example the bind_textdomain_codeset call. In Gtk+-2
apps you call it with an UTF-8 argument. This happens because you _know_
that you'll need this data encoded in UTF-8. In most other applications you
omit this call because there you _know_ you need the data in the encoding of
the current locale. In both cases it's important that you _know_ what
encoding is used. (By "knowing the charset" I don't necessarily mean one
particular fixed charset known in advance, a dynamic one such as "the
charset set by our locale" or "the charset named in that variable" are
perfect choises too.)

> Luckily, there is a way to support utf-8 without having to really
> worry about it:
> Just think in bytes!

Seems we have a different concept of "thinking". For example, when you write
a Gtk+2 application, you of course _work_ with bytes, but in a higher level
of abstraction you know that an utf-8 encoding is used there and hence you
_think_ in characters.

> I wish perl would let me do that- it works so well in C.

I already wrote twice. Just in case you haven't seen it, I write it for the
third time. Perl _lets_ you think/work in bytes. Just ignore everything
related to UTF-8. Just never set the utf8 mode. You'll be back at the world
of bytes. It's so simple!

> Hrm, I think Java needs to be fixed.

Sure. Just alter the specifications. 99% of the existing programs will work
incorrectly and would need to be fixed according to the new "fixed" language
definition. It's so simple, isn't it? :-)

> Their internal utf-16 mandate was a mistake, imo.

That was not utf-16 but ucs-2 at that time and imo those days it was a
perfectly reasonable decision.

> They should store strings in whatever the locale says they are in.

Oh yes... Sure everyone would be perfectly happy if his software wasn't able
to handle characters that don't fit in his locale. Just because someone
still uses an iso-8859-1 charset he sure wants his browser to display
question marks instead of foreign accented letters and kanjis, right?

> (and the locale should always say utf-8)

Should, but doesn't. It's your choice to decide whether you want your
application to work everywhere, or only under utf-8 locales.

> Normally, you should not have to ever convert strings between
> encodings. Its just
> not your problem, plus it indroces a ton of potential headaches.
> Just assume your input is in the encoding its supposed to be in.

Ha-ha-ha. Do you know what makes my head ache? When I see accented
characters of my mother tounge displayed incorrectly. The only way they can
be displayed correctly is if you _know_ the encoding used in each file, each
data stream, each strings etc. If you don't know their encoding, it's
hopeless to display them correctly.

I admit that in an ideal world everything would be encoded in UTF-8. Just
don't forget: our world is not ideal. My browser has to display web pages
encoded in Windows-1250 correctly. My e-mail client has to display messages
encoded in iso-8859-2 correctly. And so on...



-- 
Egmont

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

Reply via email to