Re: Full Unicode based on UTF-16 proposal

Erik Corry Sat, 17 Mar 2012 17:45:10 -0700

2012/3/18 Steven L. <steves_l...@hotmail.com>:
> Eric Corry wrote:
>
>>> I further objected because I think the /u flag would be better used as a
>>> ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on
>>> Python's re.UNICODE or (?u) flag, which does the same thing except that
>>> it
>>> also covers \s (which is already Unicode-based in ES).
>>
>>
>> I am rather skeptical about treating \d like this.  I think "any digit
>> including rods and roman characters but not decimal points/commas"
>> http://en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals
>> would be needed much less often than the digits 0-9, so I think
>> hijacking \d for this case is poor use of name space.  The \d escape
>> in perl does not cover other Unicode numerals, and even with the
>> [:name:] syntax there appears to be no way to get the Unicode
>> numerals:
>> http://search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes
>>  This suggests to me that it's not very useful.
>
>
> I know from experience that it's common for Arabic speakers to want to match
> both 0-9 and Arabic-Indic digits. The same seems true for Hindi/Devanagari
> digits, and probably others. Even if it wasn't often useful, IMO this change
> is necessary for congruity with Unicode-enabled \w and \b (I'll get to
> that), and would likely never be detrimental since /u would be opt-in and
> it's easy to explicitly use [0-9] when that's what you want.
>
> For the record, I am proposing that /\d/u be equivalent to /\p{Nd}/, not
> /\p{N}/. I.e., it should not match any Unicode number, but rather any
> Unicode decimal digit (see
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cp%7BNd%7D for the
> list). And as Norbert noted, that is in fact what Perl's \d matches.


Ah, that makes much more sense.

> Comparison with other regex flavors:
>
> * \w == [A-Za-z0-9_] -- ES-current, Java, PCRE, Ruby, Python (default).
> * \w == [\p{L}\p{Nd}_] -- .NET, Perl, Python (with (?u)).
>
> * \b matches between ASCII \w\W -- ES-current, PCRE, Ruby, Python (default).
> * \b matches between Unicode \w\W -- Java, .NET, Perl, Python (with (?u)).
>
> * \d == [0-9] -- ES-current, Java, PCRE, Ruby, Python (default).
> * \d == \p{Nd} -- .NET, Perl, Python (with (?u)).
>
> * \s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default).
> * \s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)).
>
> Note that Java's \w and \b are inconsistent.
>
> Unicode-based \w and \b are incredibly useful, and it is very common for
> users to sometimes want them to be Unicode-based--thus, an opt-in flag
> offers the best of both worlds. In fact, I'd go so far as to say they are
> broken without Unicode support. Consider, e.g., /a\b/.test('naïve'), which
> currently returns true.
>
> Unicode-based \d would not only help international users/apps, it is also
> important because otherwise Unicode-based \w and \b would have to use
> [\p{L}0-9_] rather than [\p{L}\p{Nd}_], which breaks portability with .NET,
> Perl, Python, and Java. If, conversely, Unicode-enabled \w and \b used
> [\p{L}\p{Nd}_] but \d used [0-9], then among other consequences (including
> user confusion), [^\W\d_] could not be used equivalently to \p{L}.
>
>
>> And instead of changing the meaning of \w, which will be confusing, I
>> think that [:alnum:] as in perl would work fine.
>
>
> [:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only
> [A-Za-z0-9]. Making it Unicode-based in ES would be confusing. It also works

This would be pretty useless and is not true in perl.  I tried the following:

perl -e "use utf8; print 'æ' =~ /[[:alnum:]]/ . \"\n\";"

and it prints 1, indicating a match.

> only within character classes. IMO, the POSIX-style [[:name:]] syntax is
> clumsy and confusing, not to mention backward incompatible. It would
> potentially also be confusing if ES supports only [:alnum:] without adding
> the rest of the (not-very-useful) POSIX regex class names.

The implication was to add the rest too.  Seeing things like the
regexp at the bottom of this page
http://inimino.org/~inimino/blog/javascript_cset is an indication to
me that there is a demand.

>> \b is a little tougher.  The Unicode rewrite would be
>> (?:(?<![:alnum:])(?=[:alnum:])|(?<=[:alnum:])(?![:alnum:])) which is
>> obviously too verbose.  But if we take \b for this then the ASCII
>> version has to be written as
>> (?:(?<!\w)(?=\w)|(?<=\w)(?!\w)) which is also more than a little
>> annoying.  However, often you don't need that if you have negative
>> lookbehind because you can write something
>> like
>>
>> /(?<!\w)word(?=!\w)/    // Negative look-behind for a \w and negative
>> look-ahead for \w at the end.
>>
>> which isn't _too_ bad, even if it is much worse than
>>
>> /\bword\b/
>
>
> I've already started to explain above why I think Unicode-based \b is
> important and useful. I'll just add the footnote that relying on lookbehind
> would in all likelihood perform less efficiently than \b (depending on
> implementation optimizations).

OK, I'm convinced that /u should make \d, \b and \w Unicode aware.  I
don't think the performance will be much different between a
lookbehind and a \b though.

-- 
Erik Corry
_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to