New submission from Vlastimil Brom <vlastimil.b...@gmail.com>:

Hi,
I just noticed a behaviour of the re.LOCALE flag I can't understand; I first 
reported this to the new regex implementation, which, however, only mimics the 
standard lib re in this case:
http://code.google.com/p/mrab-regex-hg/issues/detail?id=6
I also couldn't find anything relevant in the tracker, other than some older, 
already fixed issues; I'm sorry, if I missed something.
I thought, the search pattern (?L)\w would match any of the respective 
string.letters according to the current locale (and possibly additionally 
[0-9_]).

However, the locale doesn't seem to be reflected in an expected way.

>>> unicode_BMP = " " + "".join(unichr(i)for i in range(1, 0x10000))
>>> import locale
>>> locale.setlocale(locale.LC_ALL, "")
'Czech_Czech Republic.1250'
>>> import re
>>> print("".join(re.findall(r"(?L)\w", unicode_BMP)))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzŠŒŽšœžŸ£¥ª¯³µ¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
>>> locale.setlocale(locale.LC_ALL, "Greek")
'Greek_Greece.1253'
>>> print("".join(re.findall(r"(?L)\w", unicode_BMP)))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzƒ¢²³µ¸¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþ
>>> 

>>> unicode_BMP = " " + "".join(unichr(i)for i in range(1, 0x10000))

>>> locale.setlocale(locale.LC_ALL, "")
'Czech_Czech Republic.1250'
>>> print unicode(string.letters, "windows-1250")
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzŠŚŤŽŹšśťžźŁĄŞŻłµąşĽľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ
>>> locale.setlocale(locale.LC_ALL, "Greek")
'Greek_Greece.1253'
>>> print unicode(string.letters, "windows-1253")
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzƒΆµΈΉΊΌΎΏΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώ
>>> 

It seems that the nearest letter set to the result of the re/regex LOCALE flags 
migt be ascii or US locale:

>>> locale.setlocale(locale.LC_ALL, "US")
'English_United States.1252'
>>> print unicode(string.letters, "windows-1252")
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzƒŠŒŽšœžŸªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
>>> 

however, there are some differences too, namely between zƒ and À
re (?L)\w : 
Czech
zŠŒŽšœžŸ£¥ª¯³µ¹º¼¾¿À
Greek
zƒ¢²³µ¸¹º¼¾¿À
string.letters -- US locale
zƒŠŒŽšœžŸªµºÀ
(as displayed in tkinter Idle shell)
(in either case, there are some items, one wouldn't consider usual word 
characters, cf. ¿)

I am not sure whether there are no other issues (like some encoding/displaying 
peculiarities in Tkinter), but the re matching using the LOCALE flag don't 
reflect the locale.setlocale(...) in a transparent way.

Is it supposed to work this way and is there another possibility to get the 
expected locale aware matching, as one might expect according to:
http://docs.python.org/library/re.html#re.LOCALE
"""
Make \w, \W, \b, \B, \s and \S dependent on the current locale.
"""


using Python 2.7.1, 32 bit;  win 7 Home Premium 64-bit, Czech.

in Python 3.1.3 as well as 3.2 the result is the same (with the appropriately 
modified code): ...
>>> import locale
>>> locale.setlocale(locale.LC_ALL, "")
'Czech_Czech Republic.1250'
>>> import re
>>> print("".join(re.findall(r"(?L)\w", unicode_BMP)))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzŠŒŽšœžŸ£¥ª¯³µ¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
>>> 

However, in Python 3, there is no comparison with string.letters available 
anymore.

Regards,
    Vlastimil Brom

----------
components: Regular Expressions, Unicode
messages: 132826
nosy: vbr
priority: normal
severity: normal
status: open
title: re.LOCALE doesn't reflect locale.setlocale(...)
type: behavior
versions: Python 2.7, Python 3.1, Python 3.2

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue11744>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to