Re: [Boston.pm] HTML parsing

Charlie Reitzel Tue, 20 Mar 2007 07:55:16 -0800

For my sins, I always do this stuff by hand.  Sure, I try to be somewhat 
general purpose and have built up a number of helpful functions to extract 
bits 'o text.  Usually it's element text and attribute values.  Fwiw, I 
usually end up without regular expressions.  I am handy with regex and use 
it all the time for data/log mangling.  But, for production code, it 
usually means I haven't thought about the problem quite right.


Anyway, I have found this is very quick to implement, runs very fast and is 
reliable.  I give it to other folks who appear to have no problems 
maintaining it.  Lately, I have been doing this stuff in Java.  But, at the 
end of the day, it's just another scripting language!  Of course, you need 
C/C++ for real quality parsing code ... ;->

At 02:36 PM 3/19/2007 -0400, Tom Metro wrote:
>It's been a few years since I had to write some code to screen scrape
>HTML. In the past I've always used an HTML::Parser subclass. It works,
>but creating a state machine for each fragment of data I want to extract
>is laborious and not very elegant.
>
>So when the need came up today, I took a look on CPAN to see what new
>modules might be out there.
>
>I first ran across HTML::SimpleParse, which looks like it is no longer
>being maintained. (The author includes a note in the man page saying
>that it was created due to a misunderstanding of how HTML::Parser
>works.) In any case, it takes a similar approach to HTML::Parser.
>
>HTML::PullParser and HTML::TokeParser are built on HTML::Parser and just
>provide a different flavor API, but the same level of abstraction as
>HTML::Parser. I've looked at these in the past. You're still left
>examining each tag, and creating a state machine to track their
>relationships.
>
>HTML::TagParser provides a DOM-style API. Nice, but not particularly
>useful for screen scraping, where you need to do things like "find the
>text inside B tags that immediately proceeds the comment matching
>m/start product photo/." Order and the relative position of elements is
>important.
>
>WWW::Mechanize, often mentioned for screen scraping tasks, seems to be
>optimized for multiple page interaction with methods specifically for
>dealing with forms, links, and images. It doesn't appear to offer any
>fine-grained HTML parsing capability.
>
>Modules like HTML::TreeBuilder don't buy you much, as you're still left
>with the task of walking the tree and implementing a state machine.
>HTML::Element, which is used with HTML::TreeBuilder to operate on nodes
>and traverse the tree, provides methods to test parent-child
>relationships ($h->is_inside('tag'),$h->look_down('tag')) and adjacency
>($h->left(), $h->right()), which should make the job simpler, but in the
>example above they still may be of little help if the two tags you are
>looking for are merely "distant cousins."
>
>The closest I found to meeting the requirements of my example is covered
>in the "Complex Criteria in Tree Scanning" in this article on using
>HTML::TreeBuilder:
>http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/Tree/Scanning.pod#Complex_Criteria_in_Tree_Scanning
>
>where the look_down() method is used in conjunction with criteria
>specified as code references, which can then tease out complex
>relationships among tags. But this is just another way of hand rolling a
>state machine with a bit cleaner syntax.
>
>
>It seems like what is missing is a module that provides a
>regular-expression style language for matching against tags. It would
>make screen scraping tasks almost trivial. Anyone know of a module like
>this?
>
>What's your favorite HTML parsing module?
>
>   -Tom
>
>--
>Tom Metro
>Venture Logic, Newton, MA, USA
>"Enterprise solutions through open source."
>Professional Profile: http://tmetro.venturelogic.com/
>
>_______________________________________________
>Boston-pm mailing list
>[email protected]
>http://mail.pm.org/mailman/listinfo/boston-pm
 
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] HTML parsing

Reply via email to