Vlastimil Brom <[email protected]> added the comment:
I just noticed a somehow strange behaviour in matching character sets or
alternate matches which contain some more "advanced" unicode characters, if
they are in the search pattern with some "simpler" ones. The former seem to be
ignored and not matched (the original re engine matches all of them); (win XPh
SP3 Czech, Python 2.7; regex issue2636-20100414)
>>> print u"".join(regex.findall(u".", u"eèéêëēěė"))
eèéêëēěė
>>> print u"".join(regex.findall(u"[eèéêëēěė]", u"eèéêëēěė"))
eèéêëē
>>> print u"".join(regex.findall(u"e|è|é|ê|ë|ē|ě|ė", u"eèéêëēěė"))
eèéêëē
>>> print u"".join(re.findall(u"[eèéêëēěė]", u"eèéêëēěė"))
eèéêëēěė
>>> print u"".join(re.findall(u"e|è|é|ê|ë|ē|ě|ė", u"eèéêëēěė"))
eèéêëēěė
even stranger, if the pattern contains only these "higher" unicode characters,
everything works ok:
>>> print u"".join(regex.findall(u"ē|ě|ė", u"eèéêëēěė"))
ēěė
>>> print u"".join(regex.findall(u"[ēěė]", u"eèéêëēěė"))
ēěė
The characters in question are some accented latin letters (here in ascending
codepoints), but it can be other scripts as well.
>>> print regex.findall(u".", u"eèéêëēěė")
[u'e', u'\xe8', u'\xe9', u'\xea', u'\xeb', u'\u0113', u'\u011b', u'\u0117']
The threshold isn't obvious to me, at first I thought, the characters
represented as unicode escapes are problematic, whereas those with hexadecimal
escapes are ok; however ē - u'\u0113' seems ok too.
(python 3.1 behaves identically:
>>> regex.findall("[eèéêëēěė]", "eèéêëēěė")
['e', 'è', 'é', 'ê', 'ë', 'ē']
>>> regex.findall("[ēěė]", "eèéêëēěė")
['ē', 'ě', 'ė']
)
vbr
----------
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue2636>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com