Re: [Boston.pm] HTML parsing

Dan Boger Tue, 20 Mar 2007 08:12:19 -0800

On Mon, Mar 19, 2007 at 02:36:36PM -0400, Tom Metro wrote:
> Modules like HTML::TreeBuilder don't buy you much, as you're still
> left with the task of walking the tree and implementing a state
> machine.  HTML::Element, which is used with HTML::TreeBuilder to
> operate on nodes and traverse the tree, provides methods to test
> parent-child relationships ($h->is_inside('tag'),$h->look_down('tag'))
> and adjacency ($h->left(), $h->right()), which should make the job
> simpler, but in the example above they still may be of little help if
> the two tags you are looking for are merely "distant cousins."
> 
> The closest I found to meeting the requirements of my example is
> covered in the "Complex Criteria in Tree Scanning" in this article on
> using HTML::TreeBuilder:
> http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/Tree/Scanning.pod#Complex_Criteria_in_Tree_Scanning
> 
> where the look_down() method is used in conjunction with criteria
> specified as code references, which can then tease out complex
> relationships among tags. But this is just another way of hand rolling
> a state machine with a bit cleaner syntax.
> 
> 
> It seems like what is missing is a module that provides a
> regular-expression style language for matching against tags. It would
> make screen scraping tasks almost trivial. Anyone know of a module
> like this?
> 
> What's your favorite HTML parsing module?


I certainly use TreeBuilder a lot - not sure what kind of API you're
looking for?  Maybe something like XML::Twig's get_xpath?  Of course,
with the quality of HTML in the wild, it might be difficult to get it
loaded into an XML parser...

Dan

-- 
Dan Boger
[EMAIL PROTECTED]
 
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] HTML parsing

Reply via email to