On Oct 26, 2013, at 5:39 , Bjoern Hoehrmann <[email protected]> wrote:
> * Norbert Lindenberg wrote: >> On Oct 25, 2013, at 18:35 , Jason Orendorff <[email protected]> >> wrote: >> >>> UTF-16 is designed so that you can search based on code units >>> alone, without computing boundaries. RegExp searches fall in this >>> category. >> >> Not if the RegExp is case insensitive, or uses a character class, or ".", or >> a >> quantifier - these all require looking at code points rather than UTF-16 code >> units in order to support the full Unicode character set. > > If you have a regular expression over an alphabet like "Unicode scalar > values" it is easy to turn it into an equivalent regular expression over > an alphabet like "UTF-16 code units". I have written a Perl module that > does it for UTF-8, <http://search.cpan.org/dist/Unicode-SetAutomaton/>; > Russ Cox's http://swtch.com/~rsc/regexp/regexp3.html#step3 is a popular > implementation. In effect it is still as though the implementation used > Unicode scalar values, but that would be true of any implementation. It > is much harder to implement something like this for other encodings like > UTF-7 and Punycode. > > It is useful to keep in mind features like character classes are just > syntactic sugar and can be decomposed into regular expression primitives > like a choice listing each member of the character class as literal. The > `.` is just a large character class, and flags like //i just transform > parts of an expression where /a/i becomes something more like /a|A/. OK, if Jason's comment was meant to say that RegExp searches specified based on code points can be implemented through an equivalent search based on code units, then that's correct. I was assuming that we're discussing API design, and requiring developers to provide those equivalent UTF-16 based regular expressions (as the current RegExp design does) is a recipe for breakage whenever supplementary characters are involved. Norbert _______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

