2012/3/18 Steven L. <steves_l...@hotmail.com>: > Eric Corry wrote: > >>> I further objected because I think the /u flag would be better used as a >>> ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on >>> Python's re.UNICODE or (?u) flag, which does the same thing except that >>> it >>> also covers \s (which is already Unicode-based in ES). >> >> >> I am rather skeptical about treating \d like this. I think "any digit >> including rods and roman characters but not decimal points/commas" >> http://en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals >> would be needed much less often than the digits 0-9, so I think >> hijacking \d for this case is poor use of name space. The \d escape >> in perl does not cover other Unicode numerals, and even with the >> [:name:] syntax there appears to be no way to get the Unicode >> numerals: >> http://search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes >> This suggests to me that it's not very useful. > > > I know from experience that it's common for Arabic speakers to want to match > both 0-9 and Arabic-Indic digits. The same seems true for Hindi/Devanagari > digits, and probably others. Even if it wasn't often useful, IMO this change > is necessary for congruity with Unicode-enabled \w and \b (I'll get to > that), and would likely never be detrimental since /u would be opt-in and > it's easy to explicitly use [0-9] when that's what you want. > > For the record, I am proposing that /\d/u be equivalent to /\p{Nd}/, not > /\p{N}/. I.e., it should not match any Unicode number, but rather any > Unicode decimal digit (see > http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cp%7BNd%7D for the > list). And as Norbert noted, that is in fact what Perl's \d matches.
Ah, that makes much more sense. > Comparison with other regex flavors: > > * \w == [A-Za-z0-9_] -- ES-current, Java, PCRE, Ruby, Python (default). > * \w == [\p{L}\p{Nd}_] -- .NET, Perl, Python (with (?u)). > > * \b matches between ASCII \w\W -- ES-current, PCRE, Ruby, Python (default). > * \b matches between Unicode \w\W -- Java, .NET, Perl, Python (with (?u)). > > * \d == [0-9] -- ES-current, Java, PCRE, Ruby, Python (default). > * \d == \p{Nd} -- .NET, Perl, Python (with (?u)). > > * \s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default). > * \s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)). > > Note that Java's \w and \b are inconsistent. > > Unicode-based \w and \b are incredibly useful, and it is very common for > users to sometimes want them to be Unicode-based--thus, an opt-in flag > offers the best of both worlds. In fact, I'd go so far as to say they are > broken without Unicode support. Consider, e.g., /a\b/.test('naïve'), which > currently returns true. > > Unicode-based \d would not only help international users/apps, it is also > important because otherwise Unicode-based \w and \b would have to use > [\p{L}0-9_] rather than [\p{L}\p{Nd}_], which breaks portability with .NET, > Perl, Python, and Java. If, conversely, Unicode-enabled \w and \b used > [\p{L}\p{Nd}_] but \d used [0-9], then among other consequences (including > user confusion), [^\W\d_] could not be used equivalently to \p{L}. > > >> And instead of changing the meaning of \w, which will be confusing, I >> think that [:alnum:] as in perl would work fine. > > > [:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only > [A-Za-z0-9]. Making it Unicode-based in ES would be confusing. It also works This would be pretty useless and is not true in perl. I tried the following: perl -e "use utf8; print 'æ' =~ /[[:alnum:]]/ . \"\n\";" and it prints 1, indicating a match. > only within character classes. IMO, the POSIX-style [[:name:]] syntax is > clumsy and confusing, not to mention backward incompatible. It would > potentially also be confusing if ES supports only [:alnum:] without adding > the rest of the (not-very-useful) POSIX regex class names. The implication was to add the rest too. Seeing things like the regexp at the bottom of this page http://inimino.org/~inimino/blog/javascript_cset is an indication to me that there is a demand. >> \b is a little tougher. The Unicode rewrite would be >> (?:(?<![:alnum:])(?=[:alnum:])|(?<=[:alnum:])(?![:alnum:])) which is >> obviously too verbose. But if we take \b for this then the ASCII >> version has to be written as >> (?:(?<!\w)(?=\w)|(?<=\w)(?!\w)) which is also more than a little >> annoying. However, often you don't need that if you have negative >> lookbehind because you can write something >> like >> >> /(?<!\w)word(?=!\w)/ // Negative look-behind for a \w and negative >> look-ahead for \w at the end. >> >> which isn't _too_ bad, even if it is much worse than >> >> /\bword\b/ > > > I've already started to explain above why I think Unicode-based \b is > important and useful. I'll just add the footnote that relying on lookbehind > would in all likelihood perform less efficiently than \b (depending on > implementation optimizations). OK, I'm convinced that /u should make \d, \b and \w Unicode aware. I don't think the performance will be much different between a lookbehind and a \b though. -- Erik Corry _______________________________________________ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss