On 28 July 2013 10:12, Bináris <[email protected]> wrote: > Hi, > > \b in a regex treats letter "é" (which is a correct Hungarian letter) as a > word boundary. > Can I prevent this behaviour with some kind of settings? >
Simple ascii: >>> re.findall(r".+?\b", "bla bla bla") ['bla', ' ', 'bla', ' ', 'bla'] Incorrect: - no re.UNICODE flag, bytestring >>> re.findall(r".+?\b", "bléa bléa bléa") ['bl', '\xc3\xa9', 'a', ' ', 'bl', '\xc3\xa9', 'a', ' ', 'bl', '\xc3\xa9', 'a'] - no re.UNICODE flag, unicode string >>> re.findall(r".+?\b", u"bléa bléa bléa") [u'bl', u'\xe9', u'a', u' ', u'bl', u'\xe9', u'a', u' ', u'bl', u'\xe9', u'a'] - re.UNICODE flag, bytestring >>> re.findall(r".+?\b", "bléa bléa bléa", re.UNICODE) ['bl\xc3', '\xa9', 'a', ' ', 'bl\xc3', '\xa9', 'a', ' ', 'bl\xc3', '\xa9', 'a'] CorrecT: - both re.UNICODE and using a unicode string >>> re.findall(r".+?\b", u"bléa bléa bléa", re.UNICODE) [u'bl\xe9a', u' ', u'bl\xe9a', u' ', u'bl\xe9a'] Hope this helps! Merlijn
_______________________________________________ Pywikipedia-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
