Tom, > It seems like what is missing is a module that provides a > regular-expression style language for matching against tags. It would > make screen scraping tasks almost trivial. Anyone know of a module like > this? > What's your favorite HTML parsing module?
XML::Twig is the grep for XML (and bundles with xml_grep(1)). With it's new parse_html() option, XML::Twig will use Tree::Builder for you to convert HTML to it's internal rep of XML, protecting you from Tree::Builder's interface. You can make reg-ex-like Xpath-like queries on the HTML document with it and let it's pattern engine walk the tree looking for twigs that match your query. It supports an Xpath-like query language. http://search.cpan.org/search?query=XML-Twig&mode=dist Which references <<The XML::Twig page is at http://www.xmltwig.com/xmltwig/ It includes the development version of the module, a slightly better version of the documentation, examples, a tutorial and a: Processing XML efficiently with Perl and XML::Twig: http://www.xmltwig.com/xmltwig/tutorial/index.html >> Which has useful summary http://www.xmltwig.com/xmltwig/quick_ref.html [but read tutorial first]. It can work in either a stream/call-back-handler mode or a parse-then-search mode, and can work as a XML-aware SED (with inplace option!), can preserver or change encoding, etc. A very perl-friendly way to deal with XML. CAVEAT -- I haven't tried this new html-happy mode yet; I've wished for it in the past, when XML::twig rejected HTML that wasn't highly XHTML well-formed. Now with this new option, it probably accepts anything H:TB does and pretends it read a conformant XHTML document. I've got to try this too. -- Bill / n1vux Not speaking for the firm _______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

