On Sat, Jun 28, 2008 at 1:45 PM, Antoine Pitrou <[EMAIL PROTECTED]> wrote: > Several posters (including a certain GvR) in the bug tracker (*) have been > baffled by an apparent bug where the re.IGNORECASE flag didn't imply > case-insensitivity for non-ASCII characters. It turns out that, although the > pattern was a string object and although Py3k is supposed to be > unicode-friendly, you still need to supply the re.UNICODE flag if you want the > re module to use unicode-aware case-insensitive matching. > > Wouldn't it be more natural that, at least when the pattern is a str object > rather a bytes object, the re.UNICODE be implied by default?
+1 > (*) http://bugs.python.org/issue2834 > > > Another question in the same vein: is it normal that we can match a bytes > object > with an str pattern and vice-versa? > > pat = re.compile('Á', re.IGNORECASE | re.UNICODE) > pat.match('á'.encode('latin1')) > # gives <_sre.SRE_Match object at 0xb7c66c60> > > pat = re.compile('Á'.encode('latin1'), re.IGNORECASE | re.UNICODE) > pat.match('á') > # gives <_sre.SRE_Match object at 0xb7c66c60> This made sense in 2.x where text could be represented by str or unicode. It makes a lot less sense now, and I suspect it can cause widespread confusion. Forbidding this would also be another step in the direction we're already taking of never allowing implicit conversion between str and bytes. -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com