> It would be useful (and increasingly more common) to be able to match
> qr|<\s*(\w+)([^>]*)>| to qr|<\s*/\1\s*>|, and handle the case where those
> can nest as well. Something like
>
> <list> match this with
> <list>
> </list> not this but
> </list> this.
I suspect this is going to need a ?[ and ?] of its own. I've been
thinking about this since your email on the subject yesterday, and I
don't see how either RFC 145 or this alternative method could support
it, since there are two tags - > and </ - which are paired
asymmetrically, and neither approach gives any credence to what's
contained inside the tag. So <tag> would be matched itself as "< matches
>".
What if we added special XML/HTML-parsing ?< and ?> operators?
Unfortunately, as Richard notes, ?> is already taken, but I will use it
for the examples to make things symmetrical.
?< = opening tag (with name specified)
?> = closing tag (matches based on nesting)
Your example would simply be:
/(?<list)[\s\w]*(?<list)[\s\w]*(?>)[\s\w]*(?>)/;
What makes me nervous about this is that ?< and ?> seem special-case.
They are, but then again XML and HTML are also pervasive. So a
special-case for something like this might not be any stranger than
having a special-case for sin() and cos() - they're extremely important
operations.
The other thing that this doesn't handle is tags with no closing
counterpart, like:
<br>
Perhaps for these the easiest thing is to tell people not to use ?< and
?>:
/(?<p)[\s*\w](?:<br>)(?>)/;
Would match
<p>
Some stuff<br>
</p>
Finally, tags which take arguments:
<div align="center">Stuff</div>
Would require some type of "this is optional" syntax:
/(?<div\s*\w*)Stuff(?>)/
Perhaps only the first word specified is taken as the tag name? This is
the XML/HTML spec anyways.
-Nate