2007/3/28, Egmont Koblinger <[EMAIL PROTECTED]>:
> Why is it not so simple?I just want to know some basic information:
> Does it match or not. What range of bytes in the string was matched.
Seems you didn't understand. It depends on how to interpret the byte
sequence above. If it stands in UTF-8 then it means "...) in KOI8-R and so
I wasnt talking about encoding detection really though...
The regex library can ask the locale what encoding things are in, just
like everybody else
Even then, the user and app programmer should not have to care what
encoding is being used.
In an ideal world where no more than one character set (and one
representation) is used, a developer could expect the same from any
programming language or development environment. But our world is not ideal.
There _are_ many character sets out there, and it's _your_ job, the
programmer's job to tell the compiler/interpreter how to handle your bytes
and to hide all these charset issues from the users. Therefore you have to
be aware of the technical issues and have to be able to handle them.
If that was true then the vast majority of programs would not be i18n'd.
Luckily, there is a way to support utf-8 without having to really
worry about it:
Just think in bytes! I wish perl would let me do that- it works so well in C.
There are many ways to solve charset problems, and which one to choose
depends on the goals of your software too. If you only handle _texts_ then
probably the best approach is to convert every string as soon as they arrive
at your application to some Unicode representation (UTF-8 for Perl, "String"
(which uses UTF-16) for Java and so on)
Hrm, I think Java needs to be fixed. Their internal utf-16 mandate was
a mistake, imo.
They should store strings in whatever the locale says they are in.
(and the locale should always say utf-8)
Normally, you should not have to ever convert strings between
encodings. Its just
not your problem, plus it indroces a ton of potential headaches.
Just assume your input is in the encoding its supposed to be in.
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/