Dan Boger wrote:
> Maybe something like XML::Twig's get_xpath?

William Ricker wrote:
>> What's your favorite HTML parsing module?
> 
> XML::Twig is the grep for XML (and bundles with xml_grep(1)).

Looks like we have several recommendations for XML::Twig. Thanks. I'll 
check it out.


> You can make reg-ex-like Xpath-like queries on the HTML document with
> it and let it's pattern engine walk the tree looking for twigs that
> match your query.

It's been a few years since I looked at Xpath, but I seem to recall that 
it was originally inspired by SQL. If that's correct, that doesn't 
strike me as being very RegEx-like.

I also get the impression that it is going to be organized around 
queries that span parent-child relationships. So, for example, if I want 
to find an A tag contained in a DIV, contained by one or more wild card 
tags, all inside of a BODY tag, it'd be good at that.

But if I want to find the text inside B tags that happens to occur 
before a comment tag containing a specific string, it may not be 
possible to encode that in one query. But I haven't looked into it yet...

With screen scraping, you're just as likely to get semantic hints from 
the ordering of tags as you are from the ancestry. And as with 
traditional regular expression techniques for extracting data, you often 
want to find the last X that occurs before a Y, or other ordering 
relationships, ignoring other aspects of structure in the document.


Dan Boger wrote:
> I certainly use TreeBuilder a lot - not sure what kind of API you're
> looking for?

As I mentioned, the holy grail is perhaps a regular-expression style 
language, that contains directives for not only indicating parent-child 
relationships, but also can also operate on the document as simply a 
stream of tags, with of course the added intelligence of recognizing and 
normalizing the tags.


mirod wrote:
> If you prefer CSS selectors to XPath, then you can use
> HTML::TreeBuilder::Select...which translate a
> CSS selector into an XPath expression.

Interesting that you should mention CSS selectors, as that is one of the 
things that came to mind when thinking of how an RE language for tags 
might be constructed. Though CSS selectors are a tad limited, and I 
don't think they could handle the ordering condition I mentioned in my 
above example.

Thanks everyone for the suggestions. I'll follow-up to the list after 
I've tried them out.

  -Tom

-- 
Tom Metro
Venture Logic, Newton, MA, USA
"Enterprise solutions through open source."
Professional Profile: http://tmetro.venturelogic.com/
 
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to