On Wed, Mar 28, 2007 at 07:49:57PM +0200, Egmont Koblinger wrote:
> matches or not _does_ depend on the character set that you use. It's not
> perl's flaw that it couldn't decide, it's impossible to decide in theory
> unless you know the charset.

It is perl's flaw. The LC_CTYPE category of the locale determines the
charset. This is how all sane languages work.

> > I don't care what the regex library does under the covers, and I
> > shouldnt have to care...
> 
> >From a user's point of view, it's a good expectation against any program:
> they should _just work_ without requiring any charset knowledge from me.
> 
> In an ideal world where no more than one character set (and one
> representation) is used, a developer could expect the same from any
> programming language or development environment. But our world is not ideal.
> There _are_ many character sets out there, and it's _your_ job, the
> programmer's job to tell the compiler/interpreter how to handle your bytes
> and to hide all these charset issues from the users. Therefore you have to
> be aware of the technical issues and have to be able to handle them.

I don't have to be aware of it in any other language. It just works.
Perl is being unnecessarily difficult here.

> Having a variable in your code that stores sequence of bytes, without you
> being able to tell what encoding is used there, is just like having a
> variable to store the height of people, without knowing whether it's
> measured in cm or meter or feet... The actions you may take are very limited
> (e.g. you can add two of these to calculate how large they'd be if one would
> stand on the top of the other (btw the answer would also lack the unit)),
> but there are plenty of things you cannot answer.

Nonsense. As long as all the length variables are in the SAME unit,
your program has absolutely no reason to care whatsoever exactly what
that unit it. Any unit is just as good as long as it's consistent. The
same goes for character sets. There is a well-defined native character
encoding, which should be UTF-8 on any modern system. When importing
data from foreign encodings, it should be converted. This is just the
same as if you stored all your lengths in a database. As long as
they're all consistent (e.g. all in meters) then you don't have to
grossly increase complexity and redundancy by storing a unit with each
value. Instead, you just convert foreign values when they're input,
and assume all local data is already in the correct form. The same
applies to character encoding.

> your application, and convert (if necessary) when you output them. If you
> must be able to handle arbitrary byte sequences, then (as Rich pointed out)
> you should keep the array of bytes but you might need to adjust a few
> functions that handle them, e.g. regexp matching might be a harder job in
> this case (e.g. what does a dot (any character) mean in this case?).

Regex matching is also easier with bytes, even if your bytes represent
multibyte characters. The implementation that converts to UTF-32 or
similar is larger, slower, klunkier, and more error-prone.

> > If it knows how to match "Á" to ".", then I dont have to know how it
> > goes about doing so.
> 
> Recently you asked why perl didn't just simply work with bytes. Now you talk
> about the "Á" letter. But you seem to forget about one very important step:
> how should perl know that your sequence of bytes represents "Á" and not some
> other letter(s)?

Because the system defines this as part of LC_CTYPE.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to