Re: Working with grapheme clusters

Bjoern Hoehrmann Sat, 26 Oct 2013 05:39:46 -0700

* Norbert Lindenberg wrote:
>On Oct 25, 2013, at 18:35 , Jason Orendorff <[email protected]> wrote:
>
>> UTF-16 is designed so that you can search based on code units
>> alone, without computing boundaries. RegExp searches fall in this
>> category.
>
>Not if the RegExp is case insensitive, or uses a character class, or ".", or a
>quantifier - these all require looking at code points rather than UTF-16 code
>units in order to support the full Unicode character set.


If you have a regular expression over an alphabet like "Unicode scalar
values" it is easy to turn it into an equivalent regular expression over
an alphabet like "UTF-16 code units". I have written a Perl module that
does it for UTF-8, <http://search.cpan.org/dist/Unicode-SetAutomaton/>;
Russ Cox's http://swtch.com/~rsc/regexp/regexp3.html#step3 is a popular
implementation. In effect it is still as though the implementation used
Unicode scalar values, but that would be true of any implementation. It
is much harder to implement something like this for other encodings like
UTF-7 and Punycode.

It is useful to keep in mind features like character classes are just
syntactic sugar and can be decomposed into regular expression primitives
like a choice listing each member of the character class as literal. The
`.` is just a large character class, and flags like //i just transform
parts of an expression where /a/i becomes something more like /a|A/.
-- 
Björn Höhrmann · mailto:[email protected] · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Working with grapheme clusters

Reply via email to