And after reading much of the earlier discussion, I must say that, while I love UTF-8 dearly, it's usually the wrong abstraction level to be working at for most text-processing jobs. Ordinary people not steeped in systems programming and C culture just want to think in graphemes, and so that's what the standard dialect of Perl 6 will default to. A small nudge will push it into supporting the graphemes of a particular human language. The people who want to think more like a computer will also find ways to do that. It's just not the sweet spot we're aiming for.
Very interesting, though I must admit I'm sad to hear that. Over the years I have come to find that what I see as the sweet spot for string processing would be this definition of a utf-8 "string": "a null terminated series of bytes, some of which are parts of valid utf-8 sequences, and others which are treated as individial binary values" Effectively, its a bare minimum update of what we had with ascii. The only time I want to depart from this paradigm is when I have to. But in general, I want to avoid conversion and keep my strings in this format as much as possible. (this is much like the way tools such as readline, vim, curses, etc handle utf-8 strings, etc ) Most code should not have to care which parts are valid and which are not. If they call a function which requires a specific level of validation, that function should be free to complain when that is not the case. But I don't see a reason why a plain "print" should ever need to care or complain about what its printing. All it has to do is catenate and dump bytes out, I don't think it should be a bouncer of what is kosher or not for printing. I think a regex engine should, for example, match one binary byte to a "." the same way it would match a valid sequence of unicode characters and composing characters as a singe grapheme. This is a best effort to work with the string as provided, and someone who does not want such behavior would not run regex's over such strings. When a program needs to take in data from various different encodings, it should be their job to convert that data into their locale's native encoding. (by reading mime headers or whatever mechanism) I don't think a programming language should have built-ins that track the status of a string- as that strikes me as an attempt to DWIM and not DWIS. Taking that trend to its logical conclusion, I would not want every scalar value to track every possible kind of validation that has happned to a string: "utf-8, validated NFD, turkish + korean". If someone wants to do language specific case folding they can either default to the locale's language+encoding, or else specify which one's they want to use. If someone wants to make sure their string is valid utf-8 in NFKC, they can pass it to a validation routine such as Unicode::Normalize::NFKC. But the input and output of that routine should be a plain old scalar, with no special knowledge of what has happened to it. This minimal approach is much like what happens in C/C++, and i don't see any reason why a scripting language should do more than it is asked to and in the process potintially do the wrong thing despite its best intentions. Admittedly, in perl 5 these are trivial annoyances with readily available workarounds. From your post I guess I can assume that perl 6 will be similar. On a separate topic: Java seems to have a much worse problem. Forcing conversion to utf-16 causes you to lose information, since utf-16 cannot represent all the possible invalid utf-8 sequences. It forces you treat your strings as binary blobs and lose access to all the functions that operate on strings, and/or take a performance hit for conversion where none is actually needed. (If the design goal of Java was to force utf-16 on the world they are unlikely to succeed at it, as utf-8 has largely ursurped it's place) -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
