Jeff Peng wrote:
Hello,
Hello,
Can the code (specially the regex) below be optimized to run faster?
#!/usr/bin/perl
for ($i=0; $i<1000; $i+=1) {
++$i is usually faster than $i+=1. But you are not using the $i
variable so you don't really need it (your Ruby programs don't have it.)
for ( 1 .. 1000 ) {
open HD,"index.html" or die $!;
You are opening the same file one thousand times so the operating system
is probably caching the file in memory and using that cached file for
the last 999 reads instead of doing actual disk IO. Your Ruby program
doesn't test open() for failure so they are not equivalent.
while(<HD>) {
print $1,"\n" if /href="http:\/\/(.*?)\/.*" target="_blank"/;
There is not much, if anything, you can optimize about that regular
expression. Possibly eliminate any backtracking if present. Perhaps
try the same regular expression in the second Ruby program?
}
close HD;
}
Instead of reading line by line you could just read the whole file:
local $/;
local $\ = "\n";
local @ARGV = ( 'index.html' ) x 1000;
while ( <> ) {
print $1 while /href="http:\/\/(.*?)\/.*" target="_blank"/g;
}
The "index.html" is got from:
wget http://www.265.com/Kexue_Jishu/
I ask this because someone posted a question on ruby-talk list, shows
perl's regex is much faster than ruby's.
[Quote]
#!/usr/bin/ruby
1000.times do
File.open("index.html").each do |c|
puts $1 if /href="http:\/\/(.*?)\/.*" target="_blank"/ =~ c
end
end
time ./test.rb >/tmp/t
elap 6.511 user 6.336 syst 0.136 CPU 99.40%
#!/usr/bin/perl
for ($i=0; $i<1000; $i+=1) {
open HD,"index.html" or die $!;
while(<HD>) {
print $1,"\n" if /href="http:\/\/(.*?)\/.*" target="_blank"/;
}
close HD;
}
time ./test.pl >/tmp/t
elap 0.864 user 0.844 syst 0.020 CPU 100.04%
So perl is 7 or 8 times faster here.
[/Quote]
But someone another optimized the ruby code and used ruby's built-in
scan method, which makes the regex run a lot faster.
[Quote]
I get best results in Ruby with:
regexp = %r{href="http://([^"/]*)/[^"]*"\s+target="_blank"}
1000.times do
puts File.read('index.html').scan(regexp)
Does scan() only print out the contents of the capturing parentheses or
the whole line or the whole pattern? In other words, is the output the
same as the other Ruby program? it's obvious that the regular
expression is not the same.
end
~/ruby/bench time ruby19 regex.rb > /dev/null
real 0m1.428s
user 0m1.359s
sys 0m0.056s
~/ruby/bench time perl5.10.0 regex.pl > /dev/null
real 0m1.189s
user 0m1.095s
sys 0m0.084s
It's still slower. Perl has regular expression magic beyond my
imagination, though. I heard they take the most "rare" character in the
literal part of the regex (let's say, the colon) and search for it using
machine code, and then work their way backwards to the beginning of the
regexp...
Say what you want, but Perl rocks when it comes to text processing
speed.
[/Quote]
So I'm asking what's Perl's optimization for that regex.
I hope this doesn't disturb everyone, thanks.
Most of the regular expression is literal text which cannot be
optimized. As in the second Ruby program, try changing '\/(.*?)\/.*"'
to '\/([^"\/]*)\/[^"]*"'.
John
--
The programmer is fighting against the two most
destructive forces in the universe: entropy and
human stupidity. -- Damian Conway
--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/