2007/3/29, Egmont Koblinger <[EMAIL PROTECTED]>:
This might be a reasonable decision when you design a new programming
language. When "hacking" Unicode support into an existing 8-bit programming
language, this approach would have broken backwards compatibility and cause
_lots_ of old Perl code to malfunction when running under UTF-8 locale. Just
as if the C folks altered the language (actually its core libraries to be
precise) so that strlen() counted characters. No, they kept strlen()
counting bytes, you need different functions if you want to count
characters.

Umm... bad example:
strlen is supposed to count bytes. Nobody cares about the number of
unicode codepoints, because that is **almost never** useful
information. Its about as informative as the parity of the string.


If I understand you, you'd prefer the 1st solution. According to my
experiences, _usually_ the 2nd is the cleaner way which is likely to lead to
better pieces of software and less bugs. An exception is when you want to
display strings that might contain non-valid byte sequences and in the mean
time you must keep those byte sequences. This may be the case for text
editors, file managers etc. I think this is only a small minority of
software.

I would argue that it is the correct solution for all software, even
software that
might be trivially simplified by having some things built into the language.

It maintains the correct balance:
Think about it when it matters, don't think about it when it doesnt.


Using the 1st approach I still can't see how you'd imagine Perl to work.
Let's go back to my earlier example. Suppose perl read's a file's content
into a variable. This file contained 4 bytes, namely: 65 195 129 66. Then
you do this:

      print "Hooray\n" if $filecontents =~ m/A.B/;

Should it print Hooray or not if you run this program under an UTF-8 locale?

On one hand, when running with a Latin1 locale it didn't print it. So it
mustn't print Hooray otherwise you brake backwards compatibility.

On the other hand, we just encoded the string "AÁB" in UTF-8 since nowadays
we use UTF-8 everywhere, and of course everyone expects AÁB to match A.B.

How would you design Perl's Unicode support to overcome this contradiction?

Under a latin-1 locale it should not print.
Under a utf-8 locale it should print.

If a person inputs invalid latin-1, while telling everyone to expect
latin-1, this is a perfectly acceptable case of garbage in resulting
in garbage out.

There is no contradiction, nor is there any backwards compatibility issue.
If someone opened such a unqualified utf-8 file in a text editor while
in a latin-1 environment, it should show up as the binary trash that
it is in that context. I don't see how this can be contrived as a
compatibility problem in any way.

Reply via email to