* Norbert Lindenberg wrote: >On Oct 25, 2013, at 18:35 , Jason Orendorff <[email protected]> wrote: > >> UTF-16 is designed so that you can search based on code units >> alone, without computing boundaries. RegExp searches fall in this >> category. > >Not if the RegExp is case insensitive, or uses a character class, or ".", or a >quantifier - these all require looking at code points rather than UTF-16 code >units in order to support the full Unicode character set.
If you have a regular expression over an alphabet like "Unicode scalar values" it is easy to turn it into an equivalent regular expression over an alphabet like "UTF-16 code units". I have written a Perl module that does it for UTF-8, <http://search.cpan.org/dist/Unicode-SetAutomaton/>; Russ Cox's http://swtch.com/~rsc/regexp/regexp3.html#step3 is a popular implementation. In effect it is still as though the implementation used Unicode scalar values, but that would be true of any implementation. It is much harder to implement something like this for other encodings like UTF-7 and Punycode. It is useful to keep in mind features like character classes are just syntactic sugar and can be decomposed into regular expression primitives like a choice listing each member of the character class as literal. The `.` is just a large character class, and flags like //i just transform parts of an expression where /a/i becomes something more like /a|A/. -- Björn Höhrmann · mailto:[email protected] · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ _______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

