Steve Moran <s...@uw.edu> added the comment: Forgive me if this is just a stupid oversight.
I'm a linguist and use UTF-8 for "special" characters for linguistics data. This often includes multi-byte Unicode character sequences that are composed as one grapheme. For example the í̵ (if it's displaying correctly for you) is a LATIN SMALL LETTER I WITH STROKE \u0268 combined with COMBINING ACUTE ACCENT \u0301. E.g. a word I'm parsing: jí̵-e-gɨ I was pretty excited to find out that this regex library implements the grapheme match \X (equivalent to \P{M}\p{M}*). For the above example I needed to evaluate which sequences of characters can occur across syllable boundaries (here the hyphen "-"), so I'm aiming for: í̵-e e-g When regex couldn't get any better, you awesome developers implemented an overlapped=True flag with findall and finditer. Python 3.1.2 (r312:79147, May 19 2010, 11:50:28) [GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2 >>> import regex >>> s = "jí̵-e-gɨ" >>> s 'jí̵-e-gɨ' >>> m = regex.compile("(\X)(-)(\X)") >>> m.findall(s, overlapped=False) [('í̵', '-', 'e')] But these results are weird to me: >>> m.findall(s, overlapped=True) [('í̵', '-', 'e'), ('í̵', '-', 'e'), ('e', '-', 'g'), ('e', '-', 'g'), ('e', '-', 'g')] Why the extra matches? At first I figured this had something to do with the overlapping match of the grapheme, since it's multiple characters. So I tried it with with out the grapheme match: >>> m = regex.compile("(.)(-)(.)") >>> s2 = "a-b-cd-e-f" >>> m.findall(s2, overlapped=False) [('a', '-', 'b'), ('d', '-', 'e')] That's right. But with overlap... >>> m.findall(s2, overlapped=True) [('a', '-', 'b'), ('b', '-', 'c'), ('b', '-', 'c'), ('d', '-', 'e'), ('d', '-', 'e'), ('d', '-', 'e'), ('e', '-', 'f'), ('e', '-', 'f')] Those 'extra' matches are confusing me. 2x b-c, 3x d-e, 2x e-f? Or even more simply: >>> s2 = "a-b-c" >>> m.findall(s2, overlapped=False) [('a', '-', 'b')] >>> m.findall(s2, overlapped=True) [('a', '-', 'b'), ('b', '-', 'c'), ('b', '-', 'c')] Thanks! ---------- nosy: +stiv type: feature request -> behavior _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue2636> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com