[Boston.pm] HTML parsing

Tom Metro Mon, 19 Mar 2007 14:49:14 -0800

It's been a few years since I had to write some code to screen scrape 
HTML. In the past I've always used an HTML::Parser subclass. It works, 
but creating a state machine for each fragment of data I want to extract 
is laborious and not very elegant.

So when the need came up today, I took a look on CPAN to see what new
modules might be out there.

I first ran across HTML::SimpleParse, which looks like it is no longer
being maintained. (The author includes a note in the man page saying
that it was created due to a misunderstanding of how HTML::Parser
works.) In any case, it takes a similar approach to HTML::Parser.

HTML::PullParser and HTML::TokeParser are built on HTML::Parser and just
provide a different flavor API, but the same level of abstraction as
HTML::Parser. I've looked at these in the past. You're still left
examining each tag, and creating a state machine to track their
relationships.

HTML::TagParser provides a DOM-style API. Nice, but not particularly
useful for screen scraping, where you need to do things like "find the
text inside B tags that immediately proceeds the comment matching
m/start product photo/." Order and the relative position of elements is
important.

WWW::Mechanize, often mentioned for screen scraping tasks, seems to be
optimized for multiple page interaction with methods specifically for
dealing with forms, links, and images. It doesn't appear to offer any
fine-grained HTML parsing capability.

Modules like HTML::TreeBuilder don't buy you much, as you're still left
with the task of walking the tree and implementing a state machine.
HTML::Element, which is used with HTML::TreeBuilder to operate on nodes
and traverse the tree, provides methods to test parent-child
relationships ($h->is_inside('tag'),$h->look_down('tag')) and adjacency
($h->left(), $h->right()), which should make the job simpler, but in the
example above they still may be of little help if the two tags you are
looking for are merely "distant cousins."

The closest I found to meeting the requirements of my example is covered
in the "Complex Criteria in Tree Scanning" in this article on using
HTML::TreeBuilder:
http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/Tree/Scanning.pod#Complex_Criteria_in_Tree_Scanning

where the look_down() method is used in conjunction with criteria
specified as code references, which can then tease out complex
relationships among tags. But this is just another way of hand rolling a
state machine with a bit cleaner syntax.

It seems like what is missing is a module that provides a
regular-expression style language for matching against tags. It would
make screen scraping tasks almost trivial. Anyone know of a module like
this?

What's your favorite HTML parsing module?

-Tom

--
Tom Metro
Venture Logic, Newton, MA, USA
"Enterprise solutions through open source."
Professional Profile: http://tmetro.venturelogic.com/

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

[Boston.pm] HTML parsing

Reply via email to