Re: Full Unicode based on UTF-16 proposal

Steven L. Sun, 18 Mar 2012 07:15:55 -0700

Steven Levithan wrote:

* \s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default).
* \s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)).

Oops. My ASCII-only version of \s is obviously missing space \x20 andno-break space \xAO (which are included in Unicode's \p{Z}).


Erik Corry wrote:

Steven Levithan wrote:

[:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only
[A-Za-z0-9]. Making it Unicode-based in ES would be confusing.

This would be pretty useless and is not true in perl. I tried thefollowing:


perl -e "use utf8; print 'æ' =~ /[[:alnum:]]/ . \"\n\";"

and it prints 1, indicating a match.

***<Updating my mental notes>*** Roger that. Online docs (including thePerl-specific page you linked to earlier) typically list [:alnum:] as[A-Za-z0-9], but I've just done some quick testing and it seems that regexpackages supporting [:alnum:] give it at least three different meanings:


* [A-Za-z0-9]
* [\p{Ll}\p{Lu}\p{Lt}\p{Nd}]
* [\p{Ll}\p{Lu}\p{Lt}\p{Nd}\p{Nl}]

Note that although Java doesn't support POSIX character class syntax, it toosupports alnum via \p{Alnum}. Java's alnum matches only [A-Za-z0-9].

Anyway, this is probably all moot, unless someone wants to officiallypropose POSIX character classes for ES RegExp. ...In which case I'll behappy to state about a half-dozen reasons to not do so. :)


Erik Corry wrote:

OK, I'm convinced that /u should make \d, \b and \w Unicode aware.


w00t!

--Steven Levithan


_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to