On Thu, Mar 29, 2007 at 12:23:23PM -0400, SrinTuar wrote:

Hi,

> I just think that routines such as "regex" or "NFD" should be able to
> assume that the strings they are passed match the encoding of the
> current locale

This might be a reasonable decision when you design a new programming
language. When "hacking" Unicode support into an existing 8-bit programming
language, this approach would have broken backwards compatibility and cause
_lots_ of old Perl code to malfunction when running under UTF-8 locale. Just
as if the C folks altered the language (actually its core libraries to be
precise) so that strlen() counted characters. No, they kept strlen()
counting bytes, you need different functions if you want to count
characters.

> or failing that ask the programmer to explicitly qualify them as one of
> its supported encodings. I do not think the strings should have built in
> machinery that does this work behind the scenes implicitly.

If you have the freedom of choosing the character set you use, you need to
tell the regexp matching function what charset you use. (It's a reasonable
decision that the default is the charset of the current locale, but it has
to be overridable.) There are basically two ways I think to reach this goal.

1st: strings are just byte sequences, and you may pass the charset
information as external data.

2nd: strings are either forced to a fixed encoding (UTF-8 in Gtk+, UCS-16 in
Java) or carry meta-information about their encoding (utf8 flag in Perl).

If I understand you, you'd prefer the 1st solution. According to my
experiences, _usually_ the 2nd is the cleaner way which is likely to lead to
better pieces of software and less bugs. An exception is when you want to
display strings that might contain non-valid byte sequences and in the mean
time you must keep those byte sequences. This may be the case for text
editors, file managers etc. I think this is only a small minority of
software.

Using the 1st approach I still can't see how you'd imagine Perl to work.
Let's go back to my earlier example. Suppose perl read's a file's content
into a variable. This file contained 4 bytes, namely: 65 195 129 66. Then
you do this:

      print "Hooray\n" if $filecontents =~ m/A.B/;

Should it print Hooray or not if you run this program under an UTF-8 locale?

On one hand, when running with a Latin1 locale it didn't print it. So it
mustn't print Hooray otherwise you brake backwards compatibility.

On the other hand, we just encoded the string "AÁB" in UTF-8 since nowadays
we use UTF-8 everywhere, and of course everyone expects AÁB to match A.B.

How would you design Perl's Unicode support to overcome this contradiction?



-- 
Egmont

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to