[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

Alec Koumjian Sun, 10 Jul 2011 22:21:00 -0700

Alec Koumjian <[email protected]> added the comment:

I apologize if this is the wrong place for this message. I did not see the link 
to a separate list.


First let me explain what I am trying to accomplish. I would like to be able to 
take an unknown regular expression that contains both named and unnamed groups 
and tag their location in the original string where a match was found. Take the 
following redundantly simple example:

>>> a_string = r"This is a demo sentence."
>>> pattern = r"(?<a_thing>\w+) (\w+) (?<another_thing>\w+)"
>>> m = regex.search(pattern, a_string)

What I want is a way to insert named/numbered tags into the original string, so 
that it looks something like this:

r"<a_thing>This</a_thing> <2>is</2> <another_thing>a</another_thing> demo 
sentence."

The syntax doesn't have to be exactly like that, but you get the place. I have 
inserted the names and/or indices of the groups into the original string, 
around the span that the groups occupy. 

This task is exceedingly difficult with the current implementation, unless I am 
missing something obvious. We could call the groups by index, the groups as a 
tuple, or the groupdict:

>>> m.group(1)
'This'
>>> m.groups()
('This', 'is', 'a')
>>> m.groupdict()
{'another_thing': 'a', 'a_thing': 'This'}

If all I wanted was to tag the groups by index, it would be a simple function. 
I would be able to call m.spans() for each index in the length of m.groups() 
and insert the <> and </> tags around the right indices.

The hard part is finding out how to find the spans of the named groups. Do any 
of you have a suggestion?

It would make more sense from my perspective, if each group was an object that 
had its own .span property. It would work like this with the above example:

>>> first = m.group(1)
>>> first.name()
'a_thing'
>>> second = m.group(2)
>>> second.name()
None
>>>

You could still call .spans() on the Match object itself, but it would query 
its children group objects for the data. Overall I think this would be a much 
more Pythonic approach, especially given that you have added subscripting and 
key lookup.

So instead of this:
>>> m['a_thing']
'This'
>>> type(m['a_thing'])
<type 'str'>

You could have:
>>> m['a_thing']
'This'
>>> type(m['a_thing'])
<'regex.Match.Group object'>

With the noted benefit of this:
>>> m['a_thing'].span()
(0, 4)
>>> m['a_thing'].index()
1
>>>

Maybe I'm missing a major point or functionality here, but I've been pouring 
over the docs and don't currently think what I'm trying to achieve is possible.

Thank you for taking the time to read all this.

-Alec

----------
nosy: +akoumjian
versions:  -Python 3.3

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue2636>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

Reply via email to