On Wed, Sep 19, 2012 at 11:41 AM, Peter Hickman
<[email protected]> wrote:
> On 19 September 2012 10:09, Carlos Agarie <[email protected]> wrote:
>> I'd like to know, too. I stumbled upon a similar problem, but it was long
>> ago.
>
> Ok here is a quick test that I hacked up. The data is a 2,659,800 line
> 639Mb text file. Some lines contain the string "FRED", count them

Let's see: that are 252 chars per line on average.

Here's how I generated the file:

$ ruby -e 'x="X"*243; 2_600_000.times {|i| printf
"%7d%s%s\n",i,x,rand(1000)==0 ? "FRED" : "OOOO" }' >results201101.dat

> To be honest I suspect that it is more an issue with the regexes than
> file io and the real regexes are much more complicated than just match
> a string. I was a bit surprised that the index() wasn't faster.

Darn!  Maybe encoding plays a role here.  The pure IO is pretty fast
(see last test):

RUN 2
2659

real    0m3.520s
user    0m3.213s
sys     0m0.249s
./perl.pl
2659

real    0m2.220s
user    0m1.950s
sys     0m0.249s
./ruby-1.rb
2659

real    0m4.912s
user    0m4.383s
sys     0m0.498s
./ruby-2.rb

real    0m5.032s
user    0m4.336s
sys     0m0.639s
./ruby-3.rb

real    0m3.610s
user    0m3.276s
sys     0m0.312s
./ruby-4.rb
2659

real    0m5.004s
user    0m4.399s
sys     0m0.467s
./ruby-5.rb
2659

real    0m4.980s
user    0m4.430s
sys     0m0.451s
./ruby-6.rb
0

real    0m2.495s
user    0m2.012s
sys     0m0.420s

$ head -200 *.pl *.rb
==> perl.pl <==
#!/usr/bin/env perl

use strict;
use warnings;

my $logfile = 'results201101.dat';
my $counter = 0;
open FILE, "<$logfile" or die $!;
while(my $line = <FILE>) {
  if($line =~ /FRED/) {
    $counter++;
  }
}
close(FILE);
print "$counter\n";

==> ruby-1.rb <==
#!/usr/bin/env ruby

counter = 0
File.open("results201101.dat").each do |line|
  if line =~ /FRED/
    counter += 1
  end
end

puts counter


==> ruby-2.rb <==
#!/usr/bin/env ruby

r = Regexp.new('FRED')

counter = 0
File.open("results201101.dat").each do |line|
  if r.match(line)
    counter += 1
  end
end


==> ruby-3.rb <==
#!/usr/bin/env ruby

counter = 0
File.open("results201101.dat").each do |line|
  if line.index("FRED")
    counter += 1
  end
end


==> ruby-4.rb <==
#!/usr/bin/env ruby

count = 0

File.foreach "results201101.dat" do |line|
  count += 1 if /FRED/ =~ line
end

puts count


==> ruby-5.rb <==
#!/usr/bin/env ruby

count = 0

File.foreach "results201101.dat", encoding: "ASCII" do |line|
  count += 1 if /FRED/ =~ line
end

puts count


==> ruby-6.rb <==
#!/usr/bin/env ruby

count = 0

File.foreach "results201101.dat", encoding: "ASCII" do |line|
  # count += 1 if /FRED/ =~ line
end

puts count


And here's the test run

$ for i in {1..2}; do echo "RUN $i"; time fgrep -c FRED
results201101.dat; for f in ./*.pl ./*.rb; do echo "$f"; time "$f";
done; done

This was all on cygwin on a machine with plenty memory => likely no real IO.

Ah, it get's a tad better without regexp:

$ time ./ruby-7.rb
2659

real    0m3.432s
user    0m2.869s
sys     0m0.529s
$ cat ruby-7.rb
#!/usr/bin/env ruby

count = 0
f = 'FRED'

File.foreach "results201101.dat", encoding: "BINARY" do |line|
  count += 1 if line.include? f
end

puts count


Kind regards

robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

-- You received this message because you are subscribed to the Google Groups 
ruby-talk-google group. To post to this group, send email to 
[email protected]. To unsubscribe from this group, send email 
to [email protected]. For more options, visit this 
group at https://groups.google.com/d/forum/ruby-talk-google?hl=en

Reply via email to