Dan Boger wrote: > Maybe something like XML::Twig's get_xpath? William Ricker wrote: >> What's your favorite HTML parsing module? > > XML::Twig is the grep for XML (and bundles with xml_grep(1)).
Looks like we have several recommendations for XML::Twig. Thanks. I'll check it out. > You can make reg-ex-like Xpath-like queries on the HTML document with > it and let it's pattern engine walk the tree looking for twigs that > match your query. It's been a few years since I looked at Xpath, but I seem to recall that it was originally inspired by SQL. If that's correct, that doesn't strike me as being very RegEx-like. I also get the impression that it is going to be organized around queries that span parent-child relationships. So, for example, if I want to find an A tag contained in a DIV, contained by one or more wild card tags, all inside of a BODY tag, it'd be good at that. But if I want to find the text inside B tags that happens to occur before a comment tag containing a specific string, it may not be possible to encode that in one query. But I haven't looked into it yet... With screen scraping, you're just as likely to get semantic hints from the ordering of tags as you are from the ancestry. And as with traditional regular expression techniques for extracting data, you often want to find the last X that occurs before a Y, or other ordering relationships, ignoring other aspects of structure in the document. Dan Boger wrote: > I certainly use TreeBuilder a lot - not sure what kind of API you're > looking for? As I mentioned, the holy grail is perhaps a regular-expression style language, that contains directives for not only indicating parent-child relationships, but also can also operate on the document as simply a stream of tags, with of course the added intelligence of recognizing and normalizing the tags. mirod wrote: > If you prefer CSS selectors to XPath, then you can use > HTML::TreeBuilder::Select...which translate a > CSS selector into an XPath expression. Interesting that you should mention CSS selectors, as that is one of the things that came to mind when thinking of how an RE language for tags might be constructed. Though CSS selectors are a tad limited, and I don't think they could handle the ordering condition I mentioned in my above example. Thanks everyone for the suggestions. I'll follow-up to the list after I've tried them out. -Tom -- Tom Metro Venture Logic, Newton, MA, USA "Enterprise solutions through open source." Professional Profile: http://tmetro.venturelogic.com/ _______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

