On Thu, Mar 29, 2007 at 12:23:23PM -0400, SrinTuar wrote: Hi,
> I just think that routines such as "regex" or "NFD" should be able to > assume that the strings they are passed match the encoding of the > current locale This might be a reasonable decision when you design a new programming language. When "hacking" Unicode support into an existing 8-bit programming language, this approach would have broken backwards compatibility and cause _lots_ of old Perl code to malfunction when running under UTF-8 locale. Just as if the C folks altered the language (actually its core libraries to be precise) so that strlen() counted characters. No, they kept strlen() counting bytes, you need different functions if you want to count characters. > or failing that ask the programmer to explicitly qualify them as one of > its supported encodings. I do not think the strings should have built in > machinery that does this work behind the scenes implicitly. If you have the freedom of choosing the character set you use, you need to tell the regexp matching function what charset you use. (It's a reasonable decision that the default is the charset of the current locale, but it has to be overridable.) There are basically two ways I think to reach this goal. 1st: strings are just byte sequences, and you may pass the charset information as external data. 2nd: strings are either forced to a fixed encoding (UTF-8 in Gtk+, UCS-16 in Java) or carry meta-information about their encoding (utf8 flag in Perl). If I understand you, you'd prefer the 1st solution. According to my experiences, _usually_ the 2nd is the cleaner way which is likely to lead to better pieces of software and less bugs. An exception is when you want to display strings that might contain non-valid byte sequences and in the mean time you must keep those byte sequences. This may be the case for text editors, file managers etc. I think this is only a small minority of software. Using the 1st approach I still can't see how you'd imagine Perl to work. Let's go back to my earlier example. Suppose perl read's a file's content into a variable. This file contained 4 bytes, namely: 65 195 129 66. Then you do this: print "Hooray\n" if $filecontents =~ m/A.B/; Should it print Hooray or not if you run this program under an UTF-8 locale? On one hand, when running with a Latin1 locale it didn't print it. So it mustn't print Hooray otherwise you brake backwards compatibility. On the other hand, we just encoded the string "AÁB" in UTF-8 since nowadays we use UTF-8 everywhere, and of course everyone expects AÁB to match A.B. How would you design Perl's Unicode support to overcome this contradiction? -- Egmont -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/