Re: [uf-discuss] Parsing XFN in PHP

Ryan Parman Thu, 10 Apr 2008 09:14:58 -0700

As someone with a background in parsing RSS/Atom, I can say from yearsof experience that RSS is only occasionally XML and that you typicallyfind far more HTML in a feed than XML. And parsing HTML can be a bitch.

If an XML parser is conforming to the XML spec, it'll fail on a numberof cases -- most notably, ill-formed XML. Standard, validated HTML4.01 is ill-formed XML. libxml/libxml2, DOM, and SimpleXML are allconforming, and I believe that SAX is too, although I've never used itbefore. I don't know about XPath. Therefore, parsing data fromanything other than perfectly-formed XHTML (served with an XML-friendly mime type, as per RFC 3023) is expected to fail. Frequently.

You could attempt to go another route and use regular expressions, butregex can be slow, and certain recent versions of PCRE in PHP5 werebuggy, causing PHP segmentation faults (i.e. the PHP executablecrashes) in complex PCRE expressions.

In SimplePie, we use a hybrid wherein we do some up-front checking andfixing with string parsing, then pass the RSS string through libxml-based parsing functions, then use PCRE regexes on various XML-Arraynodes to pull out specific bits of data. It's the most reliable methodthat we could come up with for a syntax (RSS) that at least *attempts*to be XML. The problem is that *most* of the world's web pages aren'tXML or even trying to be XML... they're straight-up, old-skool HTML.And you will absolutely run into problems.

"But we can do it in web browsers!" What do web browsers have that PHPdevelopers don't? An HTML parser. As far as I know there are no HTMLparsers written for PHP (or any other language that I'm aware of). Oneof the other SimplePie devs (Geoffrey Sneddon) had gotten togetherwith another developer and attempted to start writing one for PHP 5.2,but gave up shortly after reading a number of specification documentsthat need to be read and understood before being able to do thisproperly.

"But can't we just hack something together that's 'good enough?'" Ifyou want to support it, sure. But expect bug reports and featurerequests -- LOTS of them. Then you'll go and re-write stuff to make itbetter and more compliant, and people will begin complaining becausesome behavior changed, and they're mad about that. Oh, and let's notforget how people will bitch you out because "it doesn't work like[insert web browser here]," and that you must be some sort of "stupid,lazy developer who can't get it right." Granted, these people arecomplete morons, but after you've put tons of your time and energyinto this project to make it as good as possible, stuff like this canget demoralizing. And after you get tired of the verbal abuse andworking on the project after a few months, you'll start getting lotsand lots of complaints and requests for somebody else to take over theproject -- but nobody else has taken the time to read through therelevant spec docs like you have, and it'll take them a really longtime to get up to speed. Long enough, in fact, that the project maynever get picked back up.


_________________________________________________

I said all of that to make these points:

1) Parsing HTML is hard -- especially when the only tools availableare for another language (XML). If you need to screw something in, butscrew drivers don't exist, do you use a hammer? An elegantly foldedpaperclip? A combination of both?

2) *Reliably* parsing microformats out of *most* (X)HTML with object-oriented PHP 5.x is going to be a big project. If you're diligentabout commenting your code so that others can understand what's goingon, I'd expect a PHP5 library to be at least 1 megabyte. You'll needto account for an unprecedented number of completely idiotic markupfaults.

3) If you want to attempt a project like this, get a team of peopletogether. You could probably start with 1-2 people who can evaluatethe needs of a project like this, and write some initial code. Open upto the community early to start accepting feedback. Once this projectgets rolling, I'd expect no less than 5-6 people working on it to makeany notable progress in a reasonable timeframe of 1-2 years. (It's anopen source project, remember? Evenings and weekends, baby!) Break theproject down into modules and assign them to different developers.Those developers should be prepared to read several specificationdocuments in order to understand the correct way to do things. Oh, andcreate an automated unit testing suite. It'll save you tons of time intesting.



--
Ryan Parman
<http://ryanparman.com>




On Apr 10, 2008, at 6:01 AM, Ciaran McNulty wrote:

On Thu, Apr 10, 2008 at 1:40 PM, Mark Ng <[EMAIL PROTECTED]> wrote:

XFN itself is fairly easy to deal with by just throwing pages through

tidy and using DOM/SAX/xPath, surely ? I made a rudimentary parserto

do this some time ago.  The code is a little ugly to publish, but I
don't mind sharing privately.


Here's a *very* hacky code example from when I just wanted to check my
'me' links - I include it here just to demonstrate how simple XFN can
be and hopefully it's apparent how easy it would be to work up into a
nice objecty system for spidering:

<?php

$url = 'http://ciaranmcnulty.com/';
if($html = @file_get_contents($url)){
        $dom = new DomDocument();
        if(@$dom->loadHtml($html)){
                $xpath = new DomXpath($dom);
                if($nodes = $xpath->query("//a[contains(concat(' ',
normalize-space(@rel), ' '),' me ')]")){
                        foreach($nodes as $node){
                                echo $node->getAttribute('href'), PHP_EOL;
                        }
                }
        }
        else{ echo 'Could not parse HTML', PHP_EOL; }
}
else{  echo 'Could not fetch file', PHP_EOL; }
?>
_______________________________________________
microformats-discuss mailing list
microformats-discuss@microformats.org
http://microformats.org/mailman/listinfo/microformats-discuss


_______________________________________________
microformats-discuss mailing list
microformats-discuss@microformats.org
http://microformats.org/mailman/listinfo/microformats-discuss

Re: [uf-discuss] Parsing XFN in PHP

Reply via email to