Mark Wagner wrote:
I've got a script I'm using to search through a list of Wikipedia
article titles to find ones that match certain patterns.

As-written, if you run it and supply '.*target.*' on standard input,
it will process my test file in 125 seconds.

'.*target.*' is inefficient because the .* at the beginning is greedy so it scans to the end of the data and then backtracks to match 'target'. Since you don't care *where* 'target' matches then using 'target' would be more efficient. But this won't matter much in your case as the target data are fairly short strings.


Make any of the changes
mentioned in the comments, and the time needed will drop to 1.8
seconds.  Why the difference?  Particularly interesting is that it
seems to matter where the regex pattern came from: if it's from
standard input, testing is slow; if it's assigned in the script,
testing is fast.

If it matters, I'm using Perl 5.8.8.

To see the problem I'm having, download
http://download.wikimedia.org/eswiki/20081018/eswiki-20081018-all-titles-in-ns0.gz
(a 4.1-MB file), unzip it, and run the program supplying the name of
the unzipped file.

--------------
binmode STDIN, ":utf8"; # Comment this out to speed things up

while(<STDIN>)
{
    my $lines = 0;
    my $lines2 = 0;
    my $regex;
    $regex = $_;
    chomp $regex;

Since you are not using $_ why not just:

while ( my $regex = <STDIN> )
{
    chomp $regex;


    #$regex = '.*target.*'; # Or uncomment this to speed things up
    open INFILE, "<", $ARGV[0];

You should *always* verify that the file opened correctly:

    open INFILE, '<:utf8', $ARGV[0] or die "Cannot open '$ARGV[0]' $!";


    binmode INFILE, ":utf8"; # Or comment this out to speed things up

    while(<INFILE>)
    {
        my $target = $_;
        chomp $target;

Since you are not using $_ why not just:

    while ( my $target = <INFILE> )
    {

And you don't really need to chomp() either.


        $target =~ s/_/ /g;

This is pretty slow, use tr/// instead:

        $target =~ tr/_/ /;


        print "Match\n" if($target =~ /^$regex$/); # Or make
this case-insensitive to speed things up, or remove the start and end
anchors to speed things up

        $lines = $lines + 1;
        if($lines >= 10000)
        {
            $lines = 0;
            $lines2 += 10000;
            print STDERR "$lines2\r";
        }

Counting and printing out the line numbers is a real time sink.


    }
}

If you want real speed use the grep/egrep program.



John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order.                            -- Larry Wall

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to