Ryan Parman <[EMAIL PROTECTED]> Thu, 10 Apr 2008 09:05:47
As someone with a background in parsing RSS/Atom, I can say from years of experience that RSS is only occasionally XML and that you typically find far more HTML in a feed than XML. And parsing HTML can be a bitch.

Big snip.

Woah! That's enough to put one off even starting on parsing and reading uF. Which makes uF all a bit pointless. Oh dear. :(

I suspect though that this Gordian knot can be cut. It seems quite likely that any page marked up with uF is good enough that HTML-Tidy won't remove too many uF marked up elements. If that's the case, then Fetch html -> HTML-Tidy -> XML parsing is going to get 99% of the job done and successfully extract the uF marked data. But that HTML-Tidy step is going to be indispensable. It just plain won't work without it. And the shortcut that reduces even that step is DomDocument>loadHtml($html) which is effectively doing the same thing.

It would be interesting to do some interop testing and see just how bad a web page has to be before the uF starts getting missed.

And a uF validator would come in handy there.

--
Julian Bond  E&MSN: julian_bond at voidstar.com  M: +44 (0)77 5907 2173
Webmaster:          http://www.ecademy.com/      T: +44 (0)192 0412 433
Personal WebLog:    http://www.voidstar.com/     skype:julian.bond?chat
                           Tastes Like Milk
_______________________________________________
microformats-discuss mailing list
microformats-discuss@microformats.org
http://microformats.org/mailman/listinfo/microformats-discuss

Reply via email to