Hi all,

I'm looking at some pretty complex regexp's at the moment for parsing HTML,
stripping out some attributes, getting the values of others, etc etc.

The simple fact is that all these...

<A HREF=foo.php>
<A HREF='foo.php'>
<A HREF="foo.php">
<A HREF='foo.php' TARGET="something">
<A HREF=foo.php TARGET=something>
<A HREF = "foo.php">
<A HREF = 'foo.php'>

... and many more are *valid* HTML mark-up.  It starts to make the task look
rather daunting -- especially if I want to get it right, rather than taking

I'm pretty sure I need to start looking at a state engine or HTML parser
which can identify a tag, and seek out all the attributes of it, according
to strictly valid HTML (preferably 4.01 strict), rather than working with

As it turns out, I'm only looking to work with <A> at this stage, so I don't
think it'll be a massive state engine, but I'd like to explore any other
options or existing libraries of code before considering such a beast.

Any links / articles / code / whatever warmly welcome :)

Justin French

