On Thu, Mar 29, 2007 at 11:53:01AM -0700, Larry Wall wrote:
> : I think a regex engine should, for example, match one binary byte to a
> : "." the same way it would match a valid sequence of unicode characters
> : and composing characters as a singe grapheme. This is a best effort to
> : work with the string as provided, and someone who does not want such
> : behavior would not run regex's over such strings.
> 
> How can it possibly know whether to match a binary byte or a grapheme
> if you've mixed UTF-8 and binary in the same string?

I agree that SrinTuar’s idea of matching . to a byte is insane. While
NFA/DFA is sometimes a nice tool even with binary data, using regex
character syntax for it is maybe a bit dubious. And surely, like you
said, they should not be mixed in the same string.

With that in mind, though, I think your emphasis on graphemes is also
a bit misplaced. The idea of a “grapheme” as the fundamental unit of
editing, instead of a character, is pretty much only appropriate when
writing Latin, Greek, and Cyrillic based languages with NFD. In most
Indian scripts, whole syllables get counted as “graphemes” for visual
presentation, yet users still expect to be able to edit, search, etc.
individual characters.

Even if you’re just considering a “grapheme” to be a base character
followed by a sequence of combining marks (Mn/Me/Cf), it’s
inappropriate for Tibetan where letters stack vertically (via
combining forms of class Mn) and yet each is considered a letter for
the purposes of editing, character counting, etc. A similar situation
applies for Hangul Jamo.

IMO, a regex pattern to match whole graphemes could be useful, but I
suspect character matching is almost always what’s wanted except for
NFD with European scripts.

> it might be.  And null termination has turned out to be a terrible
> workaround (in security terms as well as efficiency) for not knowing

Null termination is not the security problem. Broken languages that
DON'T use null-termination are the security problem, particularly
mixing them with C.

> the length.  C's head-in-the-sand approach to string processing is
> directly responsible for many of the security breaks on the net.

No, the incompetence of people writing C code is what’s directly
responsible for them. C’s approach might be indirectly responsible,
for being difficult or something, but certainly not directly. There
are examples of real-world C programs which are absolutely secure,
such as vsftpd.

> It's just my gut-level feeling that traditional world of C, Unix,
> locales, etc. simply does not provide appropriate abstractions to deal
> with internationalization.  Yes, you can get there if you throw enough
> libraries and random functions and macros and pipes and filters at it,
> but the basic abstractions leak like a seive.  It's time to clean it
> all up.

Mutt works right without any of that.. It’s as close as you’ll find to
the pinnacle of correct C application coding.

> I don't think it's Perl 6's place to force either utf-8 or utf-16 or
> utf-whatever on anyone.  If the abstractions are sane and properly
> encapsulated, the implementors can do whatever makes sense behind
> the scenes, and that very likely means different things in different
> contexts.

But the corner-case of handling “text” data with malformed sequences
in it will be very difficult and painful, no? With C and byte strings
it’s very easy..

> I try hard not to be a linguistic imperialist (when I try at all).  :-)

☺ ☻ ☺ ☻    (happy multiracial smileys)

> Anyway, if anyone wants to give me specific feedback on the current
> design of Perl 6, that'd be cool.  Though perl6-language@perl.org would
> probably be a better forum for that.

The only feedback I’d like to give is ask that if the nasty warning
messages are kept, they should be applied to characters in the range
128-255 as well, not just characters >255.

Also.. is there a clean way to deal with the issue (aside from just
disabling warnings) on a perl build without PerlIO (and thus no
working binmode)?

Finally, I must admit I’m not at all a Perl fan, so maybe take what I
say with a grain of salt. I just wish Perl scripts I obtain from
others would work more comfortably without making me have to think
about the nonstandard (compared to the rest of a unix system)
treatment they’re giving to character encoding.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to