Re: Full Unicode based on UTF-16 proposal

Steven L. Sat, 17 Mar 2012 17:09:00 -0700

Eric Corry wrote:

I further objected because I think the /u flag would be better used as a
ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on

Python's re.UNICODE or (?u) flag, which does the same thing except thatit

also covers \s (which is already Unicode-based in ES).


I am rather skeptical about treating \d like this.  I think "any digit
including rods and roman characters but not decimal points/commas"
http://en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals
would be needed much less often than the digits 0-9, so I think
hijacking \d for this case is poor use of name space.  The \d escape
in perl does not cover other Unicode numerals, and even with the
[:name:] syntax there appears to be no way to get the Unicode

numerals:http://search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes

 This suggests to me that it's not very useful.

I know from experience that it's common for Arabic speakers to want to matchboth 0-9 and Arabic-Indic digits. The same seems true for Hindi/Devanagaridigits, and probably others. Even if it wasn't often useful, IMO this changeis necessary for congruity with Unicode-enabled \w and \b (I'll get tothat), and would likely never be detrimental since /u would be opt-in andit's easy to explicitly use [0-9] when that's what you want.

For the record, I am proposing that /\d/u be equivalent to /\p{Nd}/, not/\p{N}/. I.e., it should not match any Unicode number, but rather anyUnicode decimal digit (seehttp://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cp%7BNd%7D for thelist). And as Norbert noted, that is in fact what Perl's \d matches.


Comparison with other regex flavors:

* \w == [A-Za-z0-9_] -- ES-current, Java, PCRE, Ruby, Python (default).
* \w == [\p{L}\p{Nd}_] -- .NET, Perl, Python (with (?u)).

* \b matches between ASCII \w\W -- ES-current, PCRE, Ruby, Python (default).
* \b matches between Unicode \w\W -- Java, .NET, Perl, Python (with (?u)).

* \d == [0-9] -- ES-current, Java, PCRE, Ruby, Python (default).
* \d == \p{Nd} -- .NET, Perl, Python (with (?u)).

* \s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default).
* \s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)).

Note that Java's \w and \b are inconsistent.

Unicode-based \w and \b are incredibly useful, and it is very common forusers to sometimes want them to be Unicode-based--thus, an opt-in flagoffers the best of both worlds. In fact, I'd go so far as to say they arebroken without Unicode support. Consider, e.g., /a\b/.test('naïve'), whichcurrently returns true.

Unicode-based \d would not only help international users/apps, it is alsoimportant because otherwise Unicode-based \w and \b would have to use[\p{L}0-9_] rather than [\p{L}\p{Nd}_], which breaks portability with .NET,Perl, Python, and Java. If, conversely, Unicode-enabled \w and \b used[\p{L}\p{Nd}_] but \d used [0-9], then among other consequences (includinguser confusion), [^\W\d_] could not be used equivalently to \p{L}.

And instead of changing the meaning of \w, which will be confusing, I
think that [:alnum:] as in perl would work fine.

[:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only[A-Za-z0-9]. Making it Unicode-based in ES would be confusing. It also worksonly within character classes. IMO, the POSIX-style [[:name:]] syntax isclumsy and confusing, not to mention backward incompatible. It wouldpotentially also be confusing if ES supports only [:alnum:] without addingthe rest of the (not-very-useful) POSIX regex class names.

\b is a little tougher.  The Unicode rewrite would be
(?:(?<![:alnum:])(?=[:alnum:])|(?<=[:alnum:])(?![:alnum:])) which is
obviously too verbose.  But if we take \b for this then the ASCII
version has to be written as
(?:(?<!\w)(?=\w)|(?<=\w)(?!\w)) which is also more than a little
annoying.  However, often you don't need that if you have negative
lookbehind because you can write something
like

/(?<!\w)word(?=!\w)/    // Negative look-behind for a \w and negative
look-ahead for \w at the end.

which isn't _too_ bad, even if it is much worse than

/\bword\b/

I've already started to explain above why I think Unicode-based \b isimportant and useful. I'll just add the footnote that relying on lookbehindwould in all likelihood perform less efficiently than \b (depending onimplementation optimizations).

Indeed. My response was rushed and poorly formed. My apologies.


Gratefully accepted with the hope that my next rushed and poorly
formed response will also be forgiven!


Consider it done. ;-P

--Steven Levithan


_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to