Re: regex optimization

John W. Krahn Tue, 05 Jan 2010 09:05:21 -0800

Jeff Peng wrote:

Hello,


Hello,

Can the code (specially the regex) below be optimized to run faster?

#!/usr/bin/perl
for ($i=0; $i<1000; $i+=1) {

++$i is usually faster than $i+=1. But you are not using the $ivariable so you don't really need it (your Ruby programs don't have it.)


for ( 1 .. 1000 ) {

 open HD,"index.html" or die $!;

You are opening the same file one thousand times so the operating systemis probably caching the file in memory and using that cached file forthe last 999 reads instead of doing actual disk IO. Your Ruby programdoesn't test open() for failure so they are not equivalent.

 while(<HD>) {
   print $1,"\n" if /href="http:\/\/(.*?)\/.*" target="_blank"/;

There is not much, if anything, you can optimize about that regularexpression. Possibly eliminate any backtracking if present. Perhapstry the same regular expression in the second Ruby program?

 }
 close HD;
}


Instead of reading line by line you could just read the whole file:

local $/;
local $\ = "\n";
local @ARGV = ( 'index.html' ) x 1000;
while ( <> ) {
    print $1 while /href="http:\/\/(.*?)\/.*" target="_blank"/g;
    }

The "index.html" is got from:
wget http://www.265.com/Kexue_Jishu/


I ask this because someone posted a question on ruby-talk list, shows
perl's regex is much faster than ruby's.

[Quote]
#!/usr/bin/ruby
1000.times do

 File.open("index.html").each do |c|
   puts $1 if /href="http:\/\/(.*?)\/.*" target="_blank"/ =~ c
 end
end

time ./test.rb >/tmp/t
elap 6.511 user 6.336 syst 0.136 CPU 99.40%


#!/usr/bin/perl
for ($i=0; $i<1000; $i+=1) {

 open HD,"index.html" or die $!;
 while(<HD>) {
   print $1,"\n" if /href="http:\/\/(.*?)\/.*" target="_blank"/;
 }
 close HD;
}

time ./test.pl >/tmp/t
elap 0.864 user 0.844 syst 0.020 CPU 100.04%

So perl is 7 or 8 times faster here.
[/Quote]


But someone another optimized the ruby code and used ruby's built-in
scan method, which makes the regex run a lot faster.

[Quote]
I get best results in Ruby with:

 regexp = %r{href="http://([^"/]*)/[^"]*"\s+target="_blank"}
 1000.times do
  puts File.read('index.html').scan(regexp)

Does scan() only print out the contents of the capturing parentheses orthe whole line or the whole pattern? In other words, is the output thesame as the other Ruby program? it's obvious that the regularexpression is not the same.

 end

~/ruby/bench time ruby19 regex.rb > /dev/null
real  0m1.428s
user  0m1.359s
sys  0m0.056s

~/ruby/bench time perl5.10.0 regex.pl > /dev/null
real  0m1.189s
user  0m1.095s
sys  0m0.084s

It's still slower. Perl has regular expression magic beyond my
imagination, though. I heard they take the most "rare" character in the
literal part of the regex (let's say, the colon) and search for it using
machine code, and then work their way backwards to the beginning of the
regexp...

Say what you want, but Perl rocks when it comes to text processing
speed.
[/Quote]


So I'm asking what's Perl's optimization for that regex.
I hope this doesn't disturb everyone, thanks.

Most of the regular expression is literal text which cannot beoptimized. As in the second Ruby program, try changing '\/(.*?)\/.*"'to '\/([^"\/]*)\/[^"]*"'.




John
--
The programmer is fighting against the two most
destructive forces in the universe: entropy and
human stupidity.               -- Damian Conway

--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/

Re: regex optimization

Reply via email to