Dave Hodgkinson <[EMAIL PROTECTED]> writes:
> Do you have any numbers on speed?
These are some examples:
-------------------------------------------------------------------
#!/usr/bin/perl
$file = "/local/doc/html-spec/html4.0.1/interact/forms.html";
print "Parsing ", -s $file, " bytes\n";
$doc = `cat $file`;
use HTML::Parser ();
use Time::HiRes qw(time);
$before = time;
for (1..10) {
HTML::Parser->new->parse_file($file);
}
printf "parse_file: %.1f seconds\n", time - $before;
$before = time;
for (1..10) {
HTML::Parser->new->parse($doc)->eof;
}
printf "parse: %.1f seconds\n", time - $before;
__END__
Prints:
Parsing 138204 bytes
parse_file: 7.1 seconds
parse: 87.3 seconds
when using HTML-Parser-2.25, and
Parsing 138204 bytes
parse_file: 2.3 seconds
parse: 2.1 seconds
when using HTML-Parser-XS-2.99_08. We get a speedup of 41(!) times when
parsing from an inline string and 3 times when using the parse_file
method. This also shows that the old parser was very bad at breaking
up large chunks. The 'parse_file' method feeds the document in small
chunks.
-------------------------------------------------------------------
#!/usr/bin/perl
$file = "/local/doc/html-spec/html4.0.1/interact/forms.html";
print "Parsing ", -s $file, " bytes\n";
use HTML::LinkExtor ();
use Time::HiRes qw(time);
$count = 0;
$before = time;
for (1..10) {
HTML::LinkExtor->new(sub {$count++})->parse_file($file);
}
printf "Found $count links in %.1f seconds\n", time - $before;
__END__
Prints:
Parsing 138204 bytes
Found 8770 links in 8.3 seconds
when using HTML-Parser-2.25, and
Parsing 138204 bytes
Found 8770 links in 2.0 seconds
when using HTML-Parser-XS-2.99_08. That is 4 times speedup.
-------------------------------------------------------------------
#!/usr/bin/perl
$file = "/local/doc/html-spec/html4.0.1/interact/forms.html";
print "Parsing ", -s $file, " bytes\n";
use HTML::TokeParser ();
use Time::HiRes qw(time);
$count = 0;
$before = time;
for (1..10) {
my $p = HTML::TokeParser->new($file);
while (my $t = $p->get_token) {
$count++;
}
}
printf "Processed $count tokens in %.1f seconds\n", time - $before;
__END__
Prints:
Parsing 138204 bytes
Processed 80140 tokens in 11.5 seconds
when using HTML-Parser-2.25, and
Parsing 138204 bytes
Processed 80140 tokens in 3.3 seconds
when using HTML-Parser-XS-2.99_08. That is 3.5 times speedup.
-------------------------------------------------------------------
#!/usr/bin/perl
$file = "/local/doc/html-spec/html4.0.1/interact/forms.html";
print "Parsing ", -s $file, " bytes\n";
use HTML::Parser ();
use Time::HiRes qw(time);
$count = 0;
$before = time;
if ($HTML::Parser::VERSION < 2.9) {
{
package MyParser;
require HTML::Entities;
@ISA=qw(HTML::Parser);
sub text { my $t = HTML::Entities::decode($_[1]); $main::count++}
}
for (1..10) {
my $p = MyParser->new;
$p->parse_file($file);
}
}
else {
for (1..10) {
my $p = HTML::Parser->new(text => sub {$count++},
decode_text_entities => 1,
);
$p->parse_file($file);
}
}
printf "Processed $count decoded text segments in %.2f seconds\n",
time - $before;
__END__
Prints:
Parsing 138204 bytes
Processed 32430 decoded text segments in 9.32 seconds
when using HTML-Parser-2.25, and
Parsing 138204 bytes
Processed 32430 decoded text segments in 0.76 seconds
when using HTML-Parser-XS-2.99_08. This shows that the new library
also provide some new ways to do things that is good for speed. Here
we get 12 times speedup.
-------------------------------------------------------------------
But, then new library loads slighty slower (it has a dynamic C-library
to link):
$ time for i in $(range 50); do perl -MHTML::Parser -le 'HTML::Parser->new; print
$HTML::Parser::VERSION'; done
Takes 3.6 seconds with version 2.25 and 4.2 seconds with 2.99_08.
That is 17% slower startup. This shouldn't matter much when you use
it under mod_perl though :-)
['range' is a little tool I have that will print the numbers 1..50 in
this case]
These tests where made on a SuSE Linux box with enough memory and a
350 Mhz Pentium II processor.
I expect the new parser to become a bit faster when I get to the point
where I try to optimize it. Currently I am just trying to get all new
features implemented.
Regards,
Gisle