It's been a few years since I had to write some code to screen scrape 
HTML. In the past I've always used an HTML::Parser subclass. It works, 
but creating a state machine for each fragment of data I want to extract 
is laborious and not very elegant.

So when the need came up today, I took a look on CPAN to see what new 
modules might be out there.

I first ran across HTML::SimpleParse, which looks like it is no longer 
being maintained. (The author includes a note in the man page saying 
that it was created due to a misunderstanding of how HTML::Parser 
works.) In any case, it takes a similar approach to HTML::Parser.

HTML::PullParser and HTML::TokeParser are built on HTML::Parser and just 
provide a different flavor API, but the same level of abstraction as 
HTML::Parser. I've looked at these in the past. You're still left 
examining each tag, and creating a state machine to track their 
relationships.

HTML::TagParser provides a DOM-style API. Nice, but not particularly 
useful for screen scraping, where you need to do things like "find the 
text inside B tags that immediately proceeds the comment matching 
m/start product photo/." Order and the relative position of elements is 
important.

WWW::Mechanize, often mentioned for screen scraping tasks, seems to be 
optimized for multiple page interaction with methods specifically for 
dealing with forms, links, and images. It doesn't appear to offer any 
fine-grained HTML parsing capability.

Modules like HTML::TreeBuilder don't buy you much, as you're still left 
with the task of walking the tree and implementing a state machine. 
HTML::Element, which is used with HTML::TreeBuilder to operate on nodes 
and traverse the tree, provides methods to test parent-child 
relationships ($h->is_inside('tag'),$h->look_down('tag')) and adjacency 
($h->left(), $h->right()), which should make the job simpler, but in the 
example above they still may be of little help if the two tags you are 
looking for are merely "distant cousins."

The closest I found to meeting the requirements of my example is covered 
in the "Complex Criteria in Tree Scanning" in this article on using 
HTML::TreeBuilder:
http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/Tree/Scanning.pod#Complex_Criteria_in_Tree_Scanning

where the look_down() method is used in conjunction with criteria 
specified as code references, which can then tease out complex 
relationships among tags. But this is just another way of hand rolling a 
state machine with a bit cleaner syntax.


It seems like what is missing is a module that provides a 
regular-expression style language for matching against tags. It would 
make screen scraping tasks almost trivial. Anyone know of a module like 
this?

What's your favorite HTML parsing module?

  -Tom

-- 
Tom Metro
Venture Logic, Newton, MA, USA
"Enterprise solutions through open source."
Professional Profile: http://tmetro.venturelogic.com/
 
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to