[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

Vlastimil Brom Mon, 20 Sep 2010 16:56:13 -0700

Vlastimil Brom <vlastimil.b...@gmail.com> added the comment:

I like the idea of the general "new" flag introducing the reasonable, backwards 
incompatible behaviour; one doesn't have to remember a list of non-standard 
flags to get this features.


While I recognise, that the module probably can't work correctly with wide 
unicode characters on a narrow python build (py 2.7, win XP in this case), i 
noticed a difference to re in this regard (it might be based on the absence of 
the wide unicode literal in the latter).

re.findall(u"\\U00010337", u"a\U00010337bc")
[]
re.findall(u"(?i)\\U00010337", u"a\U00010337bc")
[]
regex.findall(u"\\U00010337", u"a\U00010337bc")
[]
regex.findall(u"(?i)\\U00010337", u"a\U00010337bc")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Python27\lib\regex.py", line 203, in findall
    return _compile(pattern, flags).findall(string, pos, endpos,
  File "C:\Python27\lib\regex.py", line 310, in _compile
    parsed = parsed.optimise(info)
  File "C:\Python27\lib\_regex_core.py", line 1735, in optimise
    if self.is_case_sensitive(info):
  File "C:\Python27\lib\_regex_core.py", line 1727, in is_case_sensitive
    return char_type(self.value).lower() != char_type(self.value).upper()
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

I.e. re fails to match this pattern (as it actually looks for "U00010337" ), 
regex doesn't recognise the wide unicode as surrogate pair either, but it also 
raises an error from narrow unichr. Not sure, whether/how it should be fixed, 
but the difference based on the i-flag seems unusual.

Of course it would be nice, if surrogate pairs were interpreted, but I can 
imagine, that it would open a whole can of worms, as this is not thoroughly 
supported in the builtin unicode either (len, indices, slicing).

I am trying to make wide unicode characters somehow usable in my app, mainly 
with hacks like extended unichr
("\U"+hex(67)[2:].zfill(8)).decode("unicode-escape") 
or likewise for ord
surrog_ord = (ord(first) - 0xD800) * 0x400 + (ord(second) - 0xDC00) + 0x10000

Actually, using regex, one can work around some of these limitations of len, 
index or slice using a list form of the string containing surrogates

regex.findall(ur"(?s)(?:\p{inHighSurrogates}\p{inLowSurrogates})|.", u"ab𐌷𐌸𐌹cd")
[u'a', u'b', u'\U00010337', u'\U00010338', u'\U00010339', u'c', u'd']

but apparently things like wide unicode literals or character sets (even 
extending of the shorthands like \w etc.) are much more complicated.

regards,
   vbr

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue2636>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

Reply via email to