On Thu, Mar 29, 2007 at 07:15:37PM +0200, Egmont Koblinger wrote: > > or failing that ask the programmer to explicitly qualify them as one of > > its supported encodings. I do not think the strings should have built in > > machinery that does this work behind the scenes implicitly. > > If you have the freedom of choosing the character set you use, you need to
You don’t. An application should assume that there is no such freedom; the character encoding is dictated by the user or the host implementation, and should on all modern systems be UTF-8 (but don’t assume this). Any text that’s encoded with another scheme needs to be treated as non-text (binary) data (i.e. not suitable for use with regex). It could be converted (to the dictated encoding) or left as binary data depending on the application. > tell the regexp matching function what charset you use. (It's a reasonable > decision that the default is the charset of the current locale, but it has > to be overridable.) There are basically two ways I think to reach this goal. You can get by just fine without it being overridable. For instance, mutt does just fine using the POSIX regex routines which do not have any way of specifying a character encoding. > 1st: strings are just byte sequences, and you may pass the charset > information as external data. > > 2nd: strings are either forced to a fixed encoding (UTF-8 in Gtk+, UCS-16 in > Java) or carry meta-information about their encoding (utf8 flag in Perl). Of these (neither of which is necessary), #1 is the more unix-like and #2 is the mac/windows approach. Unix has a strong history of intentionally NOT assigning types to data files etc., but instead treating everything as streams of bytes. This leads to very powerful combinations of tools where the same (byte sequence) of data is interpreted in different ways by different tools/contexts. I am a mathematician and I must say it’s comparable to what we do when we allow ourselves to think of an operator on a linear space both as a map between linear spaces and as an element of a larger linear space of operators (and possibly also in many other ways) at the same time. On the other hand, DOS/Windows/Mac have a strong history of assigning fixed types to data files. On DOS/Windows it’s mostly just extensions, but Mac goes much farther with the ‘resource fork’, not only typing the file but also associating it with a creating application. This sort of mechanism is, in my opinion, deceptively convenient to ignorant new users, but also fosters an unsophisticated, uneducated, less powerful way of thinking about data. Of course in either case there are ways to override things and get around the limitations. Even on unix files tend to have suffixes to identify the ‘type’ a user will most often want to consider the file as, and likewise on Mac you can edit the resource forks or ignore them. Still, I think the approach you take says a lot about your system philosophy. > Using the 1st approach I still can't see how you'd imagine Perl to work. > Let's go back to my earlier example. Suppose perl read's a file's content > into a variable. This file contained 4 bytes, namely: 65 195 129 66. Then > you do this: > > print "Hooray\n" if $filecontents =~ m/A.B/; > > Should it print Hooray or not if you run this program under an UTF-8 locale? Of course. > On one hand, when running with a Latin1 locale it didn't print it. So it > mustn't print Hooray otherwise you brake backwards compatibility. No, the program still does the same thing if run in a Latin-1 locale, regardless of your perl version. There’s no reason to believe that text processing code should behave byte-identically under different locales. > On the other hand, we just encoded the string "AÁB" in UTF-8 since nowadays > we use UTF-8 everywhere, and of course everyone expects AÁB to match A.B. So you need to make your data and your locale consistent. If you want to set the locale to UTF-8, the string “AÁB” needs to be in UTF-8. If you want to use the legacy Latin-1 data, your locale needs to be set to something Latin-1-based. > How would you design Perl's Unicode support to overcome this contradiction? I don’t see it as any contradiction. The code does exactly what it’s supposed to in either case, as long as your locale and data are consistent. Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
