Hello,
I recently started working on a project that needs to extract data from
known format HTML documents. I started out using HTML::TokeParser for the
job, and very quickly found that having some awareness of tag nesting would
be very useful. Af first, I tried subclassing HTML::TokeParser to do the
job, but it quickly became clear that such functionality sits far better
directly on top of HTML::PullParser itself.
I therefore developed the HTML::PullParser::Nested class. It wholly
encapsulates HTML::PullParser, and provides a compatible constructor,
get_token() and unget_token(). It additionally provides three new methods,
push_nest($token), pop_nest() and eol().
After calling push_nest($token), where $token is some start tag, the class
then keeps track of start and end tags with the same tagname, and behaves
as if the document ended immediately before the corresponding end tag.
pop_nest() jumps to the closing end tag, and moves to the parent nesting
level. Finally, eol() acts like an eof marker for the current level.
get_token() will return undef for the first read at the end of a section,
whereupon eol() will start to return true. Subsequent calls to get_token()
will then raise an error.
The code, with a test suite designed to show intended behaviour in unusual
cases, is available from,
http://www.cpan.org/authors/id/C/CJ/CJK/HTML-PullParser-Nested-0.02.tar.gz
I'd be interested to hear any comments or thoughts. Also, is there any
objection to this class's usage of the HTML::PullParser::Nested namespace?
I'm not aware of the norms for packages that invade someone else's
namespace.
Regards,
Christopher Key