On Thu, Mar 29, 2007 at 11:53:01AM -0700, Larry Wall wrote: > : I think a regex engine should, for example, match one binary byte to a > : "." the same way it would match a valid sequence of unicode characters > : and composing characters as a singe grapheme. This is a best effort to > : work with the string as provided, and someone who does not want such > : behavior would not run regex's over such strings. > > How can it possibly know whether to match a binary byte or a grapheme > if you've mixed UTF-8 and binary in the same string?
I agree that SrinTuar’s idea of matching . to a byte is insane. While NFA/DFA is sometimes a nice tool even with binary data, using regex character syntax for it is maybe a bit dubious. And surely, like you said, they should not be mixed in the same string. With that in mind, though, I think your emphasis on graphemes is also a bit misplaced. The idea of a “grapheme” as the fundamental unit of editing, instead of a character, is pretty much only appropriate when writing Latin, Greek, and Cyrillic based languages with NFD. In most Indian scripts, whole syllables get counted as “graphemes” for visual presentation, yet users still expect to be able to edit, search, etc. individual characters. Even if you’re just considering a “grapheme” to be a base character followed by a sequence of combining marks (Mn/Me/Cf), it’s inappropriate for Tibetan where letters stack vertically (via combining forms of class Mn) and yet each is considered a letter for the purposes of editing, character counting, etc. A similar situation applies for Hangul Jamo. IMO, a regex pattern to match whole graphemes could be useful, but I suspect character matching is almost always what’s wanted except for NFD with European scripts. > it might be. And null termination has turned out to be a terrible > workaround (in security terms as well as efficiency) for not knowing Null termination is not the security problem. Broken languages that DON'T use null-termination are the security problem, particularly mixing them with C. > the length. C's head-in-the-sand approach to string processing is > directly responsible for many of the security breaks on the net. No, the incompetence of people writing C code is what’s directly responsible for them. C’s approach might be indirectly responsible, for being difficult or something, but certainly not directly. There are examples of real-world C programs which are absolutely secure, such as vsftpd. > It's just my gut-level feeling that traditional world of C, Unix, > locales, etc. simply does not provide appropriate abstractions to deal > with internationalization. Yes, you can get there if you throw enough > libraries and random functions and macros and pipes and filters at it, > but the basic abstractions leak like a seive. It's time to clean it > all up. Mutt works right without any of that.. It’s as close as you’ll find to the pinnacle of correct C application coding. > I don't think it's Perl 6's place to force either utf-8 or utf-16 or > utf-whatever on anyone. If the abstractions are sane and properly > encapsulated, the implementors can do whatever makes sense behind > the scenes, and that very likely means different things in different > contexts. But the corner-case of handling “text” data with malformed sequences in it will be very difficult and painful, no? With C and byte strings it’s very easy.. > I try hard not to be a linguistic imperialist (when I try at all). :-) ☺ ☻ ☺ ☻ (happy multiracial smileys) > Anyway, if anyone wants to give me specific feedback on the current > design of Perl 6, that'd be cool. Though perl6-language@perl.org would > probably be a better forum for that. The only feedback I’d like to give is ask that if the nasty warning messages are kept, they should be applied to characters in the range 128-255 as well, not just characters >255. Also.. is there a clean way to deal with the issue (aside from just disabling warnings) on a perl build without PerlIO (and thus no working binmode)? Finally, I must admit I’m not at all a Perl fan, so maybe take what I say with a grain of salt. I just wish Perl scripts I obtain from others would work more comfortably without making me have to think about the nonstandard (compared to the rest of a unix system) treatment they’re giving to character encoding. Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/