On Wed, Mar 28, 2007 at 12:44:44PM -0400, SrinTuar wrote:
> >It's not so simple. Suppose you have a byte sequence (decimal) 65 195 129
> >66. (This is the beginning of the Hungarian alphabet AÁB... encoded in
> >UTF-8).
>
> Why is it not so simple?I just want to know some basic information:
> Does it match or not. What range of bytes in the string was matched.
Seems you didn't understand. It depends on how to interpret the byte
sequence above. If it stands in UTF-8 then it means "AÁB" and hence it
matches the regexp "A followed by a letter followed by B". However, the same
byte sequence may encode "AÁB" in CP437 ("A followed by a vertical+right
frame element followed by u with diaeresis followed by B), and may also
encode "Aц│B" (cyrillic "tse" and a vertical frame element) in KOI8-R and so
on. In these latter cases it does not match the same regexp. See? Whether it
matches or not _does_ depend on the character set that you use. It's not
perl's flaw that it couldn't decide, it's impossible to decide in theory
unless you know the charset.
> I don't care what the regex library does under the covers, and I
> shouldnt have to care...
>From a user's point of view, it's a good expectation against any program:
they should _just work_ without requiring any charset knowledge from me.
In an ideal world where no more than one character set (and one
representation) is used, a developer could expect the same from any
programming language or development environment. But our world is not ideal.
There _are_ many character sets out there, and it's _your_ job, the
programmer's job to tell the compiler/interpreter how to handle your bytes
and to hide all these charset issues from the users. Therefore you have to
be aware of the technical issues and have to be able to handle them.
Having a variable in your code that stores sequence of bytes, without you
being able to tell what encoding is used there, is just like having a
variable to store the height of people, without knowing whether it's
measured in cm or meter or feet... The actions you may take are very limited
(e.g. you can add two of these to calculate how large they'd be if one would
stand on the top of the other (btw the answer would also lack the unit)),
but there are plenty of things you cannot answer.
There are many ways to solve charset problems, and which one to choose
depends on the goals of your software too. If you only handle _texts_ then
probably the best approach is to convert every string as soon as they arrive
at your application to some Unicode representation (UTF-8 for Perl, "String"
(which uses UTF-16) for Java and so on), then use this representation inside
your application, and convert (if necessary) when you output them. If you
must be able to handle arbitrary byte sequences, then (as Rich pointed out)
you should keep the array of bytes but you might need to adjust a few
functions that handle them, e.g. regexp matching might be a harder job in
this case (e.g. what does a dot (any character) mean in this case?).
> If it knows how to match "Á" to ".", then I dont have to know how it
> goes about doing so.
Recently you asked why perl didn't just simply work with bytes. Now you talk
about the "Á" letter. But you seem to forget about one very important step:
how should perl know that your sequence of bytes represents "Á" and not some
other letter(s)? It's _your_ job to tell it to perl, there's no way it could
tell it on its own. And this is where all these utf8 magic comes into play.
--
Egmont
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/