Eric Corry wrote:

I further objected because I think the /u flag would be better used as a
ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on
Python's re.UNICODE or (?u) flag, which does the same thing except that it
also covers \s (which is already Unicode-based in ES).

I am rather skeptical about treating \d like this.  I think "any digit
including rods and roman characters but not decimal points/commas"
http://en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals
would be needed much less often than the digits 0-9, so I think
hijacking \d for this case is poor use of name space.  The \d escape
in perl does not cover other Unicode numerals, and even with the
[:name:] syntax there appears to be no way to get the Unicode
numerals: http://search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes
 This suggests to me that it's not very useful.

I know from experience that it's common for Arabic speakers to want to match both 0-9 and Arabic-Indic digits. The same seems true for Hindi/Devanagari digits, and probably others. Even if it wasn't often useful, IMO this change is necessary for congruity with Unicode-enabled \w and \b (I'll get to that), and would likely never be detrimental since /u would be opt-in and it's easy to explicitly use [0-9] when that's what you want.

For the record, I am proposing that /\d/u be equivalent to /\p{Nd}/, not /\p{N}/. I.e., it should not match any Unicode number, but rather any Unicode decimal digit (see http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cp%7BNd%7D for the list). And as Norbert noted, that is in fact what Perl's \d matches.

Comparison with other regex flavors:

* \w == [A-Za-z0-9_] -- ES-current, Java, PCRE, Ruby, Python (default).
* \w == [\p{L}\p{Nd}_] -- .NET, Perl, Python (with (?u)).

* \b matches between ASCII \w\W -- ES-current, PCRE, Ruby, Python (default).
* \b matches between Unicode \w\W -- Java, .NET, Perl, Python (with (?u)).

* \d == [0-9] -- ES-current, Java, PCRE, Ruby, Python (default).
* \d == \p{Nd} -- .NET, Perl, Python (with (?u)).

* \s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default).
* \s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)).

Note that Java's \w and \b are inconsistent.

Unicode-based \w and \b are incredibly useful, and it is very common for users to sometimes want them to be Unicode-based--thus, an opt-in flag offers the best of both worlds. In fact, I'd go so far as to say they are broken without Unicode support. Consider, e.g., /a\b/.test('naïve'), which currently returns true.

Unicode-based \d would not only help international users/apps, it is also important because otherwise Unicode-based \w and \b would have to use [\p{L}0-9_] rather than [\p{L}\p{Nd}_], which breaks portability with .NET, Perl, Python, and Java. If, conversely, Unicode-enabled \w and \b used [\p{L}\p{Nd}_] but \d used [0-9], then among other consequences (including user confusion), [^\W\d_] could not be used equivalently to \p{L}.

And instead of changing the meaning of \w, which will be confusing, I
think that [:alnum:] as in perl would work fine.

[:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only [A-Za-z0-9]. Making it Unicode-based in ES would be confusing. It also works only within character classes. IMO, the POSIX-style [[:name:]] syntax is clumsy and confusing, not to mention backward incompatible. It would potentially also be confusing if ES supports only [:alnum:] without adding the rest of the (not-very-useful) POSIX regex class names.

\b is a little tougher.  The Unicode rewrite would be
(?:(?<![:alnum:])(?=[:alnum:])|(?<=[:alnum:])(?![:alnum:])) which is
obviously too verbose.  But if we take \b for this then the ASCII
version has to be written as
(?:(?<!\w)(?=\w)|(?<=\w)(?!\w)) which is also more than a little
annoying.  However, often you don't need that if you have negative
lookbehind because you can write something
like

/(?<!\w)word(?=!\w)/    // Negative look-behind for a \w and negative
look-ahead for \w at the end.

which isn't _too_ bad, even if it is much worse than

/\bword\b/

I've already started to explain above why I think Unicode-based \b is important and useful. I'll just add the footnote that relying on lookbehind would in all likelihood perform less efficiently than \b (depending on implementation optimizations).

Indeed. My response was rushed and poorly formed. My apologies.

Gratefully accepted with the hope that my next rushed and poorly
formed response will also be forgiven!

Consider it done. ;-P

--Steven Levithan


_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to