> It would be useful (and increasingly more common) to be able to match
> qr|<\s*(\w+)([^>]*)>| to qr|<\s*/\1\s*>|, and handle the case where those
> can nest as well.  Something like
> 
> <list>    match this with
>    <list>
>    </list>   not this but
> </list>   this.

I suspect this is going to need a ?[ and ?] of its own. I've been
thinking about this since your email on the subject yesterday, and I
don't see how either RFC 145 or this alternative method could support
it, since there are two tags - > and </ - which are paired
asymmetrically, and neither approach gives any credence to what's
contained inside the tag. So <tag> would be matched itself as "< matches
>".

What if we added special XML/HTML-parsing ?< and ?> operators?
Unfortunately, as Richard notes, ?> is already taken, but I will use it
for the examples to make things symmetrical.

   ?<  =  opening tag (with name specified)
   ?>  =  closing tag (matches based on nesting)

Your example would simply be:

   /(?<list)[\s\w]*(?<list)[\s\w]*(?>)[\s\w]*(?>)/;

What makes me nervous about this is that ?< and ?> seem special-case.
They are, but then again XML and HTML are also pervasive. So a
special-case for something like this might not be any stranger than
having a special-case for sin() and cos() - they're extremely important
operations.

The other thing that this doesn't handle is tags with no closing
counterpart, like:

   <br>

Perhaps for these the easiest thing is to tell people not to use ?< and
?>:

   /(?<p)[\s*\w](?:<br>)(?>)/;

Would match

   <p>
      Some stuff<br>
   </p>

Finally, tags which take arguments:

   <div align="center">Stuff</div>

Would require some type of "this is optional" syntax:

   /(?<div\s*\w*)Stuff(?>)/

Perhaps only the first word specified is taken as the tag name? This is
the XML/HTML spec anyways.

-Nate

Reply via email to