2007/3/29, Egmont Koblinger <[EMAIL PROTECTED]>:
This might be a reasonable decision when you design a new programming language. When "hacking" Unicode support into an existing 8-bit programming language, this approach would have broken backwards compatibility and cause _lots_ of old Perl code to malfunction when running under UTF-8 locale. Just as if the C folks altered the language (actually its core libraries to be precise) so that strlen() counted characters. No, they kept strlen() counting bytes, you need different functions if you want to count characters.
Umm... bad example: strlen is supposed to count bytes. Nobody cares about the number of unicode codepoints, because that is **almost never** useful information. Its about as informative as the parity of the string.
If I understand you, you'd prefer the 1st solution. According to my experiences, _usually_ the 2nd is the cleaner way which is likely to lead to better pieces of software and less bugs. An exception is when you want to display strings that might contain non-valid byte sequences and in the mean time you must keep those byte sequences. This may be the case for text editors, file managers etc. I think this is only a small minority of software.
I would argue that it is the correct solution for all software, even software that might be trivially simplified by having some things built into the language. It maintains the correct balance: Think about it when it matters, don't think about it when it doesnt.
Using the 1st approach I still can't see how you'd imagine Perl to work. Let's go back to my earlier example. Suppose perl read's a file's content into a variable. This file contained 4 bytes, namely: 65 195 129 66. Then you do this: print "Hooray\n" if $filecontents =~ m/A.B/; Should it print Hooray or not if you run this program under an UTF-8 locale? On one hand, when running with a Latin1 locale it didn't print it. So it mustn't print Hooray otherwise you brake backwards compatibility. On the other hand, we just encoded the string "AÁB" in UTF-8 since nowadays we use UTF-8 everywhere, and of course everyone expects AÁB to match A.B. How would you design Perl's Unicode support to overcome this contradiction?
Under a latin-1 locale it should not print. Under a utf-8 locale it should print. If a person inputs invalid latin-1, while telling everyone to expect latin-1, this is a perfectly acceptable case of garbage in resulting in garbage out. There is no contradiction, nor is there any backwards compatibility issue. If someone opened such a unqualified utf-8 file in a text editor while in a latin-1 environment, it should show up as the binary trash that it is in that context. I don't see how this can be contrived as a compatibility problem in any way.