Re: Trying not to re-invent the wheel

Gisle Aas Wed, 10 Nov 1999 14:01:35 -0800

"Christian Gilmore" <[EMAIL PROTECTED]> writes:

> I found that writing my own parser to fit my specific need was far
> and away the fastest thing I could do. It really depends upon your
> specific application. HTML::Parser is nice if you want to see the
> structure of the document your parsing but is just too slow to use
> for wresting particular tags from a document...

True. This was the main reason I started work on a new XS based
HTML::Parser a week ago.  It should make much of the performance
argument go away.  Still, most of the HTML that I have ever needed to
parse or manipulate is regular enough to make perl REs good enough.

Since HTML::Parser is XS based now I'm also able to offer many more
features without suffering performance.  I have attached a message I
sent to the <[EMAIL PROTECTED]> mailing list today describing what's
new.

Regards,
Gisle

I am now up to version 2.99_08 of the new HTML::Parser and I think it
comes along nicely.  As you might guess from the version number I am
aiming for version 3.00 when I think it is ready for general use.

I still encourage people to download it and test it out on various
platforms (at least check that 'make test' says everything is ok).
You can get it from:

   $CPAN/authors/id/GAAS/HTML-Parser-XS-2.99_08.tar.gz

Compatibility with HTML-Parser-2.2x is now perfect as far as I can tell.
The interfaces to all new features I still reserve the right to change
until 3.00-time.  There is still no documentation on the new things,
but the following text attempts explain most of them:

The main new feature is that instead of making a subclass you can just
provide callbacks to be invoked when various elements are recognised.
When one or more direct callbacks are provided, then no methods will
be called.

There is a new 'default' callback that is invoked with the text of
everything that there is no other callback registered for.  This might
for instance be used to implement a simple comment stripper by code
like this:

  HTML::Parser->new(comment => sub {}, # ignore
                    default => sub { print $_[0] },
                   )->parse_file(shift);

(I actually thought I was very clever when I realized how handy this
would be, but later found out that XML::Parser already had exactly
this feature. :-)

Text handlers get an extra argument that is true if entities are
already expanded in the text string passed.  This was needed to handle
<script>, <style>, <xmp>, <plaintext> correctly and in a way that was
backwards compatible.  There is also a boolean parser attribute called
$p->decode_text_entities that can be set to let the parser always
internally decode entities (so _you_ can ignore the issue).

There is a new boolean parser attribute called $p->keep_case that when
set to a true value suppress downcasing of tag and attribute names.

There is a new boolean parser attribute called $p->xml_mode that make
the parser recognise XMLs empty tags, makes processing instructions be
terminated by "?>" (instead of ">"), and implies $p->keep_case. This
should be enough to parse some simple XML documents.

There is a new parser attribute called $p->bool_attr_val that can be
set to influence the value set for boolean HTML attributes.  If you
don't set this value they will (as before) take the attribute key as
value.

There is a new parser attribute called $p->accum.  It takes an array
reference as its value.  If set, then all parsed stuff will be
accumulated here in the style of HTML::TokeParser.  No callbacks will
be invoked.  (HTML::TokeParser is in fact implemented based on this
now.)

HTML::Entities::decode is now implemented by XS code.  That makes it a
few times faster.

Other things I am thinking about supporting (soon?):

   - keep track of byte counts and line numbers.
   - an attribute that makes the parser never break text, i.e. that
     you can never get two 'text' callbacks in a row.  This will have
     to delay text callbacks until some other element is recognised.
   - attributes that control what will enter the 'accum' array
   - report byte positions within the start tag where the attributes
     and their values live.  This should be handy when all you want to
     do is remove/add or change some values while keeping everything
     else unchanged.
   - parsing of marked sections; eg. "<![CDATA[ ... ]]>"
   - utf8 text (affects what bytes entities are expanded into as well
     as the range of numeric entities that will be expanded.)

Is there anything else anybody have wished for?

Regards,
Gisle

Re: Trying not to re-invent the wheel

Reply via email to