Tom,

> It seems like what is missing is a module that provides a 
> regular-expression style language for matching against tags. It would 
> make screen scraping tasks almost trivial. Anyone know of a module
like 
> this?
> What's your favorite HTML parsing module?

XML::Twig is the grep for XML (and bundles with xml_grep(1)).

With it's new parse_html() option, XML::Twig will use Tree::Builder for
you to convert HTML to it's internal rep of XML, protecting you from
Tree::Builder's interface.  You can make reg-ex-like  Xpath-like queries
on the HTML document with it and let it's pattern engine walk the tree
looking for twigs that match your query.  It supports an Xpath-like
query language. 

http://search.cpan.org/search?query=XML-Twig&mode=dist
Which references 
<<The XML::Twig page is at http://www.xmltwig.com/xmltwig/ It includes
the development version of the module, a slightly better version of the
documentation, examples, a tutorial and a: Processing XML efficiently
with Perl and XML::Twig:
http://www.xmltwig.com/xmltwig/tutorial/index.html >>

Which has useful summary http://www.xmltwig.com/xmltwig/quick_ref.html
[but read tutorial first].

It can work in either a stream/call-back-handler mode or a
parse-then-search mode, and can work as a XML-aware SED (with inplace
option!), can preserver or change encoding, etc.  A very perl-friendly
way to deal with XML.

CAVEAT -- I haven't tried this new html-happy mode yet; I've wished for
it in the past, when XML::twig rejected HTML that wasn't highly XHTML
well-formed. Now with this new option, it probably accepts anything H:TB
does and pretends it read a conformant XHTML document. I've got to try
this too.

 -- Bill / n1vux
Not speaking for the firm




 
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to