Re: perl unicode support

Egmont Koblinger Wed, 28 Mar 2007 06:04:36 -0800

On Tue, Mar 27, 2007 at 01:51:59PM -0400, ＳｒｉｎＴｕａｒ wrote:

> I'm not quite sure how "thinking in characters" helps an application,
> in general. I'd be interested if you had a concrete example...


I don't have a concrete example. It's just a level of abstraction you have
in your mind. When you are coding, you are not just randomly hitting your
keyboard (see infinite monkeys vs. Shakespeare), you have something in your
mind, you give your variables a meaning, you have an intent with your
code... By "thinking in characters" I meant this. Probably all your "if"
branches, all your pointer incmenets and everything happens because you know
that you handle a _character_ and write your code according to that. In most
cases you can't write good code if you don't know what kind of data you're
dealing with. For example it's impossible to implement a regexp matching
routine if you have no idea what encoding is being used.

> It's probably advisable to use a library regex engine than to re-write 
> custom regex engines all the time.

Sure.

> Once you have a regex library that handles codepoints, the code that uses
> it doesnt have to care about them in particular.

It's not so simple. Suppose you have a byte sequence (decimal) 65 195 129
66. (This is the beginning of the Hungarian alphabet AÁB... encoded in
UTF-8). Suppose you want to test whether it matches to the regexp 65 46 66
("A.B"). Does it match? It depends. If the byte sequence really denotes AÁB
(i.e. it is encoded in UTF-8) then it does. If it has different semantics (a
different character sequence encoded in some other 8-bit encoding) then it
doesn't. How do you think perl is supposed to overcome this problem if it
didn't have Unicode support?

You have to make sure that the string to test and the regexp itself are
encoded in the same charset, and in turn this also matches the charset the
regexp library routine expects. Otherwise things will go plain wrong sooner
or later. In some languages regexp matching is done via functions, and
probably you may have an 8-bit match() and a Unicode-aware mbmatch() as
well. Remember that in perl regexp matching is part of the language itself,
the =~ and !~ operators do that. Offering mb=~ and mb!~ counterparts as
built-in operators would be IMHO terribly disgusting. If the operator itself
remains the same then these are the string and regexp objects (the arguments
of that operator) that have to carry the information which the regexp
matching operator can depend on.

> The problem soon as you use a library routine that is utf-8 aware, it sets
> the utf-8 flag on a string and problems start to result. If there was no 
> utf-8
> flag on the scalar strings to be set, then you could stay in byte world all 
> the
> time, while still using unicode functionality where you needed it.

As I've already said, there's absolutely nothing preventing you from _not_
using the Unicode features of Perl at all. But then I'm just curious how you
would match accented characters to regexps for example.


-- 
Egmont

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

Reply via email to