Dave Hodgkinson <[EMAIL PROTECTED]> writes:

> Do you have any numbers on speed?

These are some examples:

-------------------------------------------------------------------
#!/usr/bin/perl
$file = "/local/doc/html-spec/html4.0.1/interact/forms.html";
print "Parsing ", -s $file, " bytes\n";

$doc = `cat $file`;

use HTML::Parser ();
use Time::HiRes qw(time);

$before = time;
for (1..10) {
    HTML::Parser->new->parse_file($file);
}
printf "parse_file: %.1f seconds\n", time - $before;

$before = time;
for (1..10) {
    HTML::Parser->new->parse($doc)->eof;
}
printf "parse: %.1f seconds\n", time - $before;
__END__


Prints:

   Parsing 138204 bytes
   parse_file: 7.1 seconds
   parse: 87.3 seconds

when using HTML-Parser-2.25, and

   Parsing 138204 bytes
   parse_file: 2.3 seconds
   parse: 2.1 seconds

when using HTML-Parser-XS-2.99_08.  We get a speedup of 41(!) times when
parsing from an inline string and 3 times when using the parse_file
method.  This also shows that the old parser was very bad at breaking
up large chunks.  The 'parse_file' method feeds the document in small
chunks.


-------------------------------------------------------------------
#!/usr/bin/perl
$file = "/local/doc/html-spec/html4.0.1/interact/forms.html";
print "Parsing ", -s $file, " bytes\n";

use HTML::LinkExtor ();
use Time::HiRes qw(time);

$count = 0;
$before = time;
for (1..10) {
    HTML::LinkExtor->new(sub {$count++})->parse_file($file);
}
printf "Found $count links in %.1f seconds\n", time - $before;
__END__

Prints:

   Parsing 138204 bytes
   Found 8770 links in 8.3 seconds

when using HTML-Parser-2.25, and

   Parsing 138204 bytes
   Found 8770 links in 2.0 seconds

when using HTML-Parser-XS-2.99_08.  That is 4 times speedup.


-------------------------------------------------------------------
#!/usr/bin/perl
$file = "/local/doc/html-spec/html4.0.1/interact/forms.html";
print "Parsing ", -s $file, " bytes\n";

use HTML::TokeParser ();
use Time::HiRes qw(time);

$count = 0;
$before = time;
for (1..10) {
    my $p = HTML::TokeParser->new($file);
    while (my $t = $p->get_token) {
        $count++;
    }
}

printf "Processed $count tokens in %.1f seconds\n", time - $before;
__END__

Prints:

   Parsing 138204 bytes
   Processed 80140 tokens in 11.5 seconds

when using HTML-Parser-2.25, and

   Parsing 138204 bytes
   Processed 80140 tokens in 3.3 seconds

when using HTML-Parser-XS-2.99_08.  That is 3.5 times speedup.


-------------------------------------------------------------------
#!/usr/bin/perl
$file = "/local/doc/html-spec/html4.0.1/interact/forms.html";
print "Parsing ", -s $file, " bytes\n";

use HTML::Parser ();
use Time::HiRes qw(time);

$count = 0;
$before = time;

if ($HTML::Parser::VERSION < 2.9) {
    {
        package MyParser;
        require HTML::Entities;
        @ISA=qw(HTML::Parser);
        sub text { my $t = HTML::Entities::decode($_[1]); $main::count++}
    }

    for (1..10) {
        my $p = MyParser->new;
        $p->parse_file($file);
    }
}
else {
    for (1..10) {
        my $p = HTML::Parser->new(text => sub {$count++},
                                  decode_text_entities => 1,
                                 );
        $p->parse_file($file);
    }
}

printf "Processed $count decoded text segments in %.2f seconds\n",
    time - $before;
__END__

Prints:

   Parsing 138204 bytes
   Processed 32430 decoded text segments in 9.32 seconds

when using HTML-Parser-2.25, and

   Parsing 138204 bytes
   Processed 32430 decoded text segments in 0.76 seconds

when using HTML-Parser-XS-2.99_08.  This shows that the new library
also provide some new ways to do things that is good for speed.  Here
we get 12 times speedup.


-------------------------------------------------------------------
But, then new library loads slighty slower (it has a dynamic C-library
to link):

  $ time for i in $(range 50); do perl -MHTML::Parser -le 'HTML::Parser->new; print 
$HTML::Parser::VERSION'; done

Takes 3.6 seconds with version 2.25 and 4.2 seconds with 2.99_08.
That is 17% slower startup.  This shouldn't matter much when you use
it under mod_perl though :-)

['range' is a little tool I have that will print the numbers 1..50 in
this case]

These tests where made on a SuSE Linux box with enough memory and a
350 Mhz Pentium II processor.

I expect the new parser to become a bit faster when I get to the point
where I try to optimize it.  Currently I am just trying to get all new
features implemented.

Regards,
Gisle

Reply via email to