Dan Boger wrote:
> On Mon, Mar 19, 2007 at 02:36:36PM -0400, Tom Metro wrote:
>> Modules like HTML::TreeBuilder don't buy you much, as you're still
>> left with the task of walking the tree and implementing a state
>> machine.  HTML::Element, which is used with HTML::TreeBuilder to
>> operate on nodes and traverse the tree, provides methods to test
>> parent-child relationships ($h->is_inside('tag'),$h->look_down('tag'))
>> and adjacency ($h->left(), $h->right()), which should make the job
>> simpler, but in the example above they still may be of little help if
>> the two tags you are looking for are merely "distant cousins."
>>
>> The closest I found to meeting the requirements of my example is
>> covered in the "Complex Criteria in Tree Scanning" in this article on
>> using HTML::TreeBuilder:
>> http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/Tree/Scanning.pod#Complex_Criteria_in_Tree_Scanning
>>
>> where the look_down() method is used in conjunction with criteria
>> specified as code references, which can then tease out complex
>> relationships among tags. But this is just another way of hand rolling
>> a state machine with a bit cleaner syntax.
>>
>>
>> It seems like what is missing is a module that provides a
>> regular-expression style language for matching against tags. It would
>> make screen scraping tasks almost trivial. Anyone know of a module
>> like this?
>>
>> What's your favorite HTML parsing module?
> 
> I certainly use TreeBuilder a lot - not sure what kind of API you're
> looking for?  Maybe something like XML::Twig's get_xpath?  Of course,
> with the quality of HTML in the wild, it might be difficult to get it
> loaded into an XML parser...

Well, you can always use the bastard child of XML::XPath and
HTML::TreeBuilder: HTML::TreeBuilder::XPath, which adds XPath support
(from XML::XPathEngine, a version of XML::XPath without the XML parsing
bit) to HTML::TreeBuilder.

http://search.cpan.org/dist/HTML-TreeBuilder-XPath/

If you prefer CSS selectors to XPath, then you can use
HTML::TreeBuilder::Select, which amusingly is built on top of
HTML::TreeBuilder::XPath and of HTML::Selector::XPath, which translate a
CSS selector into an XPath expression.

http://search.cpan.org/dist/HTML-TreeBuilder-Select

-- 
mirod
 
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to