As someone with a background in parsing RSS/Atom, I can say from years of experience that RSS is only occasionally XML and that you typically find far more HTML in a feed than XML. And parsing HTML can be a bitch.

If an XML parser is conforming to the XML spec, it'll fail on a number of cases -- most notably, ill-formed XML. Standard, validated HTML 4.01 is ill-formed XML. libxml/libxml2, DOM, and SimpleXML are all conforming, and I believe that SAX is too, although I've never used it before. I don't know about XPath. Therefore, parsing data from anything other than perfectly-formed XHTML (served with an XML- friendly mime type, as per RFC 3023) is expected to fail. Frequently.

You could attempt to go another route and use regular expressions, but regex can be slow, and certain recent versions of PCRE in PHP5 were buggy, causing PHP segmentation faults (i.e. the PHP executable crashes) in complex PCRE expressions.

In SimplePie, we use a hybrid wherein we do some up-front checking and fixing with string parsing, then pass the RSS string through libxml- based parsing functions, then use PCRE regexes on various XML-Array nodes to pull out specific bits of data. It's the most reliable method that we could come up with for a syntax (RSS) that at least *attempts* to be XML. The problem is that *most* of the world's web pages aren't XML or even trying to be XML... they're straight-up, old-skool HTML. And you will absolutely run into problems.

"But we can do it in web browsers!" What do web browsers have that PHP developers don't? An HTML parser. As far as I know there are no HTML parsers written for PHP (or any other language that I'm aware of). One of the other SimplePie devs (Geoffrey Sneddon) had gotten together with another developer and attempted to start writing one for PHP 5.2, but gave up shortly after reading a number of specification documents that need to be read and understood before being able to do this properly.

"But can't we just hack something together that's 'good enough?'" If you want to support it, sure. But expect bug reports and feature requests -- LOTS of them. Then you'll go and re-write stuff to make it better and more compliant, and people will begin complaining because some behavior changed, and they're mad about that. Oh, and let's not forget how people will bitch you out because "it doesn't work like [insert web browser here]," and that you must be some sort of "stupid, lazy developer who can't get it right." Granted, these people are complete morons, but after you've put tons of your time and energy into this project to make it as good as possible, stuff like this can get demoralizing. And after you get tired of the verbal abuse and working on the project after a few months, you'll start getting lots and lots of complaints and requests for somebody else to take over the project -- but nobody else has taken the time to read through the relevant spec docs like you have, and it'll take them a really long time to get up to speed. Long enough, in fact, that the project may never get picked back up.

_________________________________________________

I said all of that to make these points:

1) Parsing HTML is hard -- especially when the only tools available are for another language (XML). If you need to screw something in, but screw drivers don't exist, do you use a hammer? An elegantly folded paperclip? A combination of both?

2) *Reliably* parsing microformats out of *most* (X)HTML with object- oriented PHP 5.x is going to be a big project. If you're diligent about commenting your code so that others can understand what's going on, I'd expect a PHP5 library to be at least 1 megabyte. You'll need to account for an unprecedented number of completely idiotic markup faults.

3) If you want to attempt a project like this, get a team of people together. You could probably start with 1-2 people who can evaluate the needs of a project like this, and write some initial code. Open up to the community early to start accepting feedback. Once this project gets rolling, I'd expect no less than 5-6 people working on it to make any notable progress in a reasonable timeframe of 1-2 years. (It's an open source project, remember? Evenings and weekends, baby!) Break the project down into modules and assign them to different developers. Those developers should be prepared to read several specification documents in order to understand the correct way to do things. Oh, and create an automated unit testing suite. It'll save you tons of time in testing.


--
Ryan Parman
<http://ryanparman.com>




On Apr 10, 2008, at 6:01 AM, Ciaran McNulty wrote:
On Thu, Apr 10, 2008 at 1:40 PM, Mark Ng <[EMAIL PROTECTED]> wrote:
XFN itself is fairly easy to deal with by just throwing pages through
tidy and using DOM/SAX/xPath, surely ? I made a rudimentary parser to
do this some time ago.  The code is a little ugly to publish, but I
don't mind sharing privately.

Here's a *very* hacky code example from when I just wanted to check my
'me' links - I include it here just to demonstrate how simple XFN can
be and hopefully it's apparent how easy it would be to work up into a
nice objecty system for spidering:

<?php

$url = 'http://ciaranmcnulty.com/';
if($html = @file_get_contents($url)){
        $dom = new DomDocument();
        if(@$dom->loadHtml($html)){
                $xpath = new DomXpath($dom);
                if($nodes = $xpath->query("//a[contains(concat(' ',
normalize-space(@rel), ' '),' me ')]")){
                        foreach($nodes as $node){
                                echo $node->getAttribute('href'), PHP_EOL;
                        }
                }
        }
        else{ echo 'Could not parse HTML', PHP_EOL; }
}
else{  echo 'Could not fetch file', PHP_EOL; }
?>
_______________________________________________
microformats-discuss mailing list
microformats-discuss@microformats.org
http://microformats.org/mailman/listinfo/microformats-discuss

_______________________________________________
microformats-discuss mailing list
microformats-discuss@microformats.org
http://microformats.org/mailman/listinfo/microformats-discuss

Reply via email to