"Christian Gilmore" <[EMAIL PROTECTED]> writes: > I found that writing my own parser to fit my specific need was far > and away the fastest thing I could do. It really depends upon your > specific application. HTML::Parser is nice if you want to see the > structure of the document your parsing but is just too slow to use > for wresting particular tags from a document... True. This was the main reason I started work on a new XS based HTML::Parser a week ago. It should make much of the performance argument go away. Still, most of the HTML that I have ever needed to parse or manipulate is regular enough to make perl REs good enough. Since HTML::Parser is XS based now I'm also able to offer many more features without suffering performance. I have attached a message I sent to the <[EMAIL PROTECTED]> mailing list today describing what's new. Regards, Gisle
I am now up to version 2.99_08 of the new HTML::Parser and I think it comes along nicely. As you might guess from the version number I am aiming for version 3.00 when I think it is ready for general use. I still encourage people to download it and test it out on various platforms (at least check that 'make test' says everything is ok). You can get it from: $CPAN/authors/id/GAAS/HTML-Parser-XS-2.99_08.tar.gz Compatibility with HTML-Parser-2.2x is now perfect as far as I can tell. The interfaces to all new features I still reserve the right to change until 3.00-time. There is still no documentation on the new things, but the following text attempts explain most of them: The main new feature is that instead of making a subclass you can just provide callbacks to be invoked when various elements are recognised. When one or more direct callbacks are provided, then no methods will be called. There is a new 'default' callback that is invoked with the text of everything that there is no other callback registered for. This might for instance be used to implement a simple comment stripper by code like this: HTML::Parser->new(comment => sub {}, # ignore default => sub { print $_[0] }, )->parse_file(shift); (I actually thought I was very clever when I realized how handy this would be, but later found out that XML::Parser already had exactly this feature. :-) Text handlers get an extra argument that is true if entities are already expanded in the text string passed. This was needed to handle <script>, <style>, <xmp>, <plaintext> correctly and in a way that was backwards compatible. There is also a boolean parser attribute called $p->decode_text_entities that can be set to let the parser always internally decode entities (so _you_ can ignore the issue). There is a new boolean parser attribute called $p->keep_case that when set to a true value suppress downcasing of tag and attribute names. There is a new boolean parser attribute called $p->xml_mode that make the parser recognise XMLs empty tags, makes processing instructions be terminated by "?>" (instead of ">"), and implies $p->keep_case. This should be enough to parse some simple XML documents. There is a new parser attribute called $p->bool_attr_val that can be set to influence the value set for boolean HTML attributes. If you don't set this value they will (as before) take the attribute key as value. There is a new parser attribute called $p->accum. It takes an array reference as its value. If set, then all parsed stuff will be accumulated here in the style of HTML::TokeParser. No callbacks will be invoked. (HTML::TokeParser is in fact implemented based on this now.) HTML::Entities::decode is now implemented by XS code. That makes it a few times faster. Other things I am thinking about supporting (soon?): - keep track of byte counts and line numbers. - an attribute that makes the parser never break text, i.e. that you can never get two 'text' callbacks in a row. This will have to delay text callbacks until some other element is recognised. - attributes that control what will enter the 'accum' array - report byte positions within the start tag where the attributes and their values live. This should be handy when all you want to do is remove/add or change some values while keeping everything else unchanged. - parsing of marked sections; eg. "<![CDATA[ ... ]]>" - utf8 text (affects what bytes entities are expanded into as well as the range of numeric entities that will be expanded.) Is there anything else anybody have wished for? Regards, Gisle