Steve Moran <s...@uw.edu> added the comment:

Forgive me if this is just a stupid oversight. 

I'm a linguist and use UTF-8 for "special" characters for linguistics data. 
This often includes multi-byte Unicode character sequences that are composed as 
one grapheme. For example the í̵ (if it's displaying correctly for you) is a 
LATIN SMALL LETTER I WITH STROKE \u0268 combined with COMBINING ACUTE ACCENT 
\u0301. E.g. a word I'm parsing:

jí̵-e-gɨ

I was pretty excited to find out that this regex library implements the 
grapheme match \X (equivalent to \P{M}\p{M}*). For the above example I needed 
to evaluate which sequences of characters can occur across syllable boundaries 
(here the hyphen "-"), so I'm aiming for:

í̵-e
e-g

When regex couldn't get any better, you awesome developers implemented an 
overlapped=True flag with findall and finditer. 

Python 3.1.2 (r312:79147, May 19 2010, 11:50:28) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
>>> import regex
>>> s = "jí̵-e-gɨ"
>>> s
'jí̵-e-gɨ'
>>> m = regex.compile("(\X)(-)(\X)")
>>> m.findall(s, overlapped=False)
[('í̵', '-', 'e')]

But these results are weird to me:

>>> m.findall(s, overlapped=True)
[('í̵', '-', 'e'), ('í̵', '-', 'e'), ('e', '-', 'g'), ('e', '-', 'g'), ('e', 
'-', 'g')]

Why the extra matches? At first I figured this had something to do with the 
overlapping match of the grapheme, since it's multiple characters. So I tried 
it with with out the grapheme match:

>>> m = regex.compile("(.)(-)(.)")
>>> s2 = "a-b-cd-e-f"
>>> m.findall(s2, overlapped=False)
[('a', '-', 'b'), ('d', '-', 'e')]

That's right. But with overlap...

>>> m.findall(s2, overlapped=True)
[('a', '-', 'b'), ('b', '-', 'c'), ('b', '-', 'c'), ('d', '-', 'e'), ('d', '-', 
'e'), ('d', '-', 'e'), ('e', '-', 'f'), ('e', '-', 'f')]

Those 'extra' matches are confusing me. 2x b-c, 3x d-e, 2x e-f? Or even more 
simply:

>>> s2 = "a-b-c"
>>> m.findall(s2, overlapped=False)
[('a', '-', 'b')]
>>> m.findall(s2, overlapped=True)
[('a', '-', 'b'), ('b', '-', 'c'), ('b', '-', 'c')]

Thanks!

----------
nosy: +stiv
type: feature request -> behavior

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue2636>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to