[Python-3000] Regular expressions, py3k and unicode

Antoine Pitrou Sat, 28 Jun 2008 13:45:55 -0700

Hello,

Several posters (including a certain GvR) in the bug tracker (*) have been
baffled by an apparent bug where the re.IGNORECASE flag didn't imply
case-insensitivity for non-ASCII characters. It turns out that, although the
pattern was a string object and although Py3k is supposed to be
unicode-friendly, you still need to supply the re.UNICODE flag if you want the
re module to use unicode-aware case-insensitive matching.


Wouldn't it be more natural that, at least when the pattern is a str object
rather a bytes object, the re.UNICODE be implied by default?

(*) http://bugs.python.org/issue2834


Another question in the same vein: is it normal that we can match a bytes object
with an str pattern and vice-versa?

 pat = re.compile('Á', re.IGNORECASE | re.UNICODE)
 pat.match('á'.encode('latin1'))
 # gives <_sre.SRE_Match object at 0xb7c66c60>

 pat = re.compile('Á'.encode('latin1'), re.IGNORECASE | re.UNICODE)
 pat.match('á')
 # gives <_sre.SRE_Match object at 0xb7c66c60>

Regards

Antoine.


_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

[Python-3000] Regular expressions, py3k and unicode

Reply via email to