Re: unicode categories -- regex

Martin v. Löwis Sat, 22 Sep 2007 10:34:38 -0700

> So how do i include this information in regular pattern search? Any
> ideas?


At the moment, you have to generate a character class for this yourself,
e.g.

py> chars = [unichr(i) for i in range(sys.maxunicode)]
py> chars = [c for c in chars if unicodedata.category(c)=='Po']
py> expr = u'[\\' + u'\\'.join(chars)+"]"
py> expr = re.compile(expr)
py> expr.match(u"#")
<_sre.SRE_Match object at 0xb7ce1d40>
py> expr.match(u"a")
py> expr.match(u"\u05be")
<_sre.SRE_Match object at 0xb7ce1d78>

Creating this expression is fairly expensive, however, once compiled,
it has a compact representation in memory, and matching it is
efficient.

Contributions to support categories directly in re are welcome. Look
at the relevant Unicode recommendation on how to do that.

HTH,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode categories -- regex

Reply via email to