HTML::PullParser::Nested

Christopher Key Wed, 10 Mar 2010 01:44:20 -0800

Hello,

I recently started working on a project that needs to extract data fromknown format HTML documents. I started out using HTML::TokeParser for thejob, and very quickly found that having some awareness of tag nesting wouldbe very useful. Af first, I tried subclassing HTML::TokeParser to do thejob, but it quickly became clear that such functionality sits far betterdirectly on top of HTML::PullParser itself.

I therefore developed the HTML::PullParser::Nested class. It whollyencapsulates HTML::PullParser, and provides a compatible constructor,get_token() and unget_token(). It additionally provides three new methods,push_nest($token), pop_nest() and eol().

After calling push_nest($token), where $token is some start tag, the classthen keeps track of start and end tags with the same tagname, and behavesas if the document ended immediately before the corresponding end tag.pop_nest() jumps to the closing end tag, and moves to the parent nestinglevel. Finally, eol() acts like an eof marker for the current level.get_token() will return undef for the first read at the end of a section,whereupon eol() will start to return true. Subsequent calls to get_token()will then raise an error.

The code, with a test suite designed to show intended behaviour in unusualcases, is available from,http://www.cpan.org/authors/id/C/CJ/CJK/HTML-PullParser-Nested-0.02.tar.gz

I'd be interested to hear any comments or thoughts. Also, is there anyobjection to this class's usage of the HTML::PullParser::Nested namespace?I'm not aware of the norms for packages that invade someone else'snamespace.


Regards,

Christopher Key

HTML::PullParser::Nested

Reply via email to