On November 12, 1998 at 16:25, Jason L Tibbitts III wrote:
> I'm trying to improve the speed at which Wilma indexes. Right now the real
> bottleneck is that we pass every MHonArc-generated page through the
> striphtml program, which is written in Perl. The time to load the Perl
> interpreter tens or hundred of thousands of times is pretty harsh, and
> occasionally we've seen HTML that the simple regexp-based approach freaks
> out on, causing it to take near infinite time to process.
>
> Does anyone know of any free (i.e. we can incorporate it into something
> under the Artistic License) C code, or a small utility that we can call,
> which will do this?
I have the SGML::StripParser as part of the perlSGML[1] package.
It is in Perl, so the performance is not as good as a C program.
But since it is a module, you can write a Perl program to iterate
through a list of files and use SGML::StripParser on each file to
avoid lauching perl for each file separately.
perlSGML is under the GPL, so if that will not work for you, I
can redistribute it under the Artistic License for your needs.
Another option may be to use James Clark's SP[2] package.
--ewh
[1] <URL:http://www.oac.uci.edu/indiv/ehood/perlSGML.html>
[2] <URL:http://www.jclark.com/sp/index.htm>