Hi all, I'm looking at some pretty complex regexp's at the moment for parsing HTML, stripping out some attributes, getting the values of others, etc etc.
The simple fact is that all these... <A HREF=foo.php> <A HREF='foo.php'> <A HREF="foo.php"> <A HREF='foo.php' TARGET="something"> <A HREF=foo.php TARGET=something> <A HREF = "foo.php"> <A HREF = 'foo.php'> ... and many more are *valid* HTML mark-up. It starts to make the task look rather daunting -- especially if I want to get it right, rather than taking shortcuts. I'm pretty sure I need to start looking at a state engine or HTML parser which can identify a tag, and seek out all the attributes of it, according to strictly valid HTML (preferably 4.01 strict), rather than working with regexp's. As it turns out, I'm only looking to work with <A> at this stage, so I don't think it'll be a massive state engine, but I'd like to explore any other options or existing libraries of code before considering such a beast. Any links / articles / code / whatever warmly welcome :) Justin French -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php