Eric Corry wrote:
I further objected because I think the /u flag would be better used as a
ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on
Python's re.UNICODE or (?u) flag, which does the same thing except that
it
also covers \s (which is already Unicode-based in ES).
I am rather skeptical about treating \d like this. I think "any digit
including rods and roman characters but not decimal points/commas"
http://en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals
would be needed much less often than the digits 0-9, so I think
hijacking \d for this case is poor use of name space. The \d escape
in perl does not cover other Unicode numerals, and even with the
[:name:] syntax there appears to be no way to get the Unicode
numerals:
http://search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes
This suggests to me that it's not very useful.
I know from experience that it's common for Arabic speakers to want to match
both 0-9 and Arabic-Indic digits. The same seems true for Hindi/Devanagari
digits, and probably others. Even if it wasn't often useful, IMO this change
is necessary for congruity with Unicode-enabled \w and \b (I'll get to
that), and would likely never be detrimental since /u would be opt-in and
it's easy to explicitly use [0-9] when that's what you want.
For the record, I am proposing that /\d/u be equivalent to /\p{Nd}/, not
/\p{N}/. I.e., it should not match any Unicode number, but rather any
Unicode decimal digit (see
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cp%7BNd%7D for the
list). And as Norbert noted, that is in fact what Perl's \d matches.
Comparison with other regex flavors:
* \w == [A-Za-z0-9_] -- ES-current, Java, PCRE, Ruby, Python (default).
* \w == [\p{L}\p{Nd}_] -- .NET, Perl, Python (with (?u)).
* \b matches between ASCII \w\W -- ES-current, PCRE, Ruby, Python (default).
* \b matches between Unicode \w\W -- Java, .NET, Perl, Python (with (?u)).
* \d == [0-9] -- ES-current, Java, PCRE, Ruby, Python (default).
* \d == \p{Nd} -- .NET, Perl, Python (with (?u)).
* \s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default).
* \s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)).
Note that Java's \w and \b are inconsistent.
Unicode-based \w and \b are incredibly useful, and it is very common for
users to sometimes want them to be Unicode-based--thus, an opt-in flag
offers the best of both worlds. In fact, I'd go so far as to say they are
broken without Unicode support. Consider, e.g., /a\b/.test('naïve'), which
currently returns true.
Unicode-based \d would not only help international users/apps, it is also
important because otherwise Unicode-based \w and \b would have to use
[\p{L}0-9_] rather than [\p{L}\p{Nd}_], which breaks portability with .NET,
Perl, Python, and Java. If, conversely, Unicode-enabled \w and \b used
[\p{L}\p{Nd}_] but \d used [0-9], then among other consequences (including
user confusion), [^\W\d_] could not be used equivalently to \p{L}.
And instead of changing the meaning of \w, which will be confusing, I
think that [:alnum:] as in perl would work fine.
[:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only
[A-Za-z0-9]. Making it Unicode-based in ES would be confusing. It also works
only within character classes. IMO, the POSIX-style [[:name:]] syntax is
clumsy and confusing, not to mention backward incompatible. It would
potentially also be confusing if ES supports only [:alnum:] without adding
the rest of the (not-very-useful) POSIX regex class names.
\b is a little tougher. The Unicode rewrite would be
(?:(?<![:alnum:])(?=[:alnum:])|(?<=[:alnum:])(?![:alnum:])) which is
obviously too verbose. But if we take \b for this then the ASCII
version has to be written as
(?:(?<!\w)(?=\w)|(?<=\w)(?!\w)) which is also more than a little
annoying. However, often you don't need that if you have negative
lookbehind because you can write something
like
/(?<!\w)word(?=!\w)/ // Negative look-behind for a \w and negative
look-ahead for \w at the end.
which isn't _too_ bad, even if it is much worse than
/\bword\b/
I've already started to explain above why I think Unicode-based \b is
important and useful. I'll just add the footnote that relying on lookbehind
would in all likelihood perform less efficiently than \b (depending on
implementation optimizations).
Indeed. My response was rushed and poorly formed. My apologies.
Gratefully accepted with the hope that my next rushed and poorly
formed response will also be forgiven!
Consider it done. ;-P
--Steven Levithan
_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss