On Wed, Mar 28, 2007 at 02:35:32PM -0400, Rich Felker wrote: > > matches or not _does_ depend on the character set that you use. It's not > > perl's flaw that it couldn't decide, it's impossible to decide in theory > > unless you know the charset. > > It is perl's flaw. The LC_CTYPE category of the locale determines the > charset. This is how all sane languages work.
LC_CTYPE determines the system charset. This is used when reading from / writing to a terminal, to/from text files by default; this is the charset you expect messages coming from glibc to be encoded in; etc... But this is not necessarily the charset you want your application to work with. Think of Gtk+-2 for example, internally it always uses UTF-8, no matter what your locale is. So it _has_ to tell every external regexp routine (if it uses any) to work with UTF-8, not with the charset implied by LC_CTYPE. And you can think of any web browser, mail client and so on, they have to cope with the charset that particular web page or message uses, yet again independently from the system locale. So, to stay at our example of a fictional regexp matching library: If this library insists on assuming that the strings are encoded according to LC_CTYPE then it's quite hard to use it correctly in such circumstances. (You might need to write a wrapper that alters the locale temporarily -- but could you tell me how to find a locale whose charset is one particular charset?) If the charset the regexp library expects _defaults_ to LC_CTYPE but is overridable then it's much better. And for libraries such as glib2/gtk2 which force using utf-8 internally it's of course perfectly okay if they implement an utf8-only regexp matching function. > I don't have to be aware of it in any other language. It just works. Show me your code that you think "just works" and I'll show you where you're wrong. :-) > Perl is being unnecessarily difficult here. You forget one very important thing: Compatibility. In the old days Perl used 8-bit strings and there many people created many perl programs that handled 8-bit (most likely iso-8859-1) data. These programs must continue to work correctly with newer Perls. This implies that perl mustn't assume UTF-8 charset for the data flows (even if your locale says so) since in this case it would produce different output. > Nonsense. As long as all the length variables are in the SAME unit, > your program has absolutely no reason to care whatsoever exactly what > that unit it. Any unit is just as good as long as it's consistent. If you don't know what unit is used, then you're unable to answer questions whether that man is most likely healthy, whether he's extremely tall or extremely small. If you don't know what unit is used, how do you fill up your structures from external data source? What if you are supposed to store cm but the data arrives in inches? How would you know that you need to convert? What if multiple external data sources use different units? If you ignore the whole problem you'll end up with different units in your database where even adding two numbers doesn't make any sense - just as it doesn't make any sense to simply concatenate two byte sequences that represent text in different encodings. I guess you've heard several stories about million (billion?) dollar projects failing due to such stupid mistakes - one developer sending the data in centimeters, the other expecting them to arrive in inches. -- Egmont -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/