Hello,

I recently started working on a project that needs to extract data from known format HTML documents. I started out using HTML::TokeParser for the job, and very quickly found that having some awareness of tag nesting would be very useful. Af first, I tried subclassing HTML::TokeParser to do the job, but it quickly became clear that such functionality sits far better directly on top of HTML::PullParser itself.

I therefore developed the HTML::PullParser::Nested class. It wholly encapsulates HTML::PullParser, and provides a compatible constructor, get_token() and unget_token(). It additionally provides three new methods, push_nest($token), pop_nest() and eol().

After calling push_nest($token), where $token is some start tag, the class then keeps track of start and end tags with the same tagname, and behaves as if the document ended immediately before the corresponding end tag. pop_nest() jumps to the closing end tag, and moves to the parent nesting level. Finally, eol() acts like an eof marker for the current level. get_token() will return undef for the first read at the end of a section, whereupon eol() will start to return true. Subsequent calls to get_token() will then raise an error.

The code, with a test suite designed to show intended behaviour in unusual cases, is available from, http://www.cpan.org/authors/id/C/CJ/CJK/HTML-PullParser-Nested-0.02.tar.gz

I'd be interested to hear any comments or thoughts. Also, is there any objection to this class's usage of the HTML::PullParser::Nested namespace? I'm not aware of the norms for packages that invade someone else's namespace.

Regards,

Christopher Key

Reply via email to