Hey John,

One more follow up and then it's bedtime for me.  I wanted to further this
discussion just a little bit more by implementing the mmap solution that I
applied to perl to ruby instead.  Now all of a sudden, ruby is much much
faster.  My ruby source code follows:

Goodnight!

# ruby -W0 ./doit.rb | md5
786be54356a5832dcd1148c18de71fc8
# perl ./doit2.pl | md5
786be54356a5832dcd1148c18de71fc8


# truss -c ruby -W0 ./doit.rb
<!-- snip -->
                      ------------- ------- -------
                        0.014111502    1855     260

# truss -c perl ./doit2.pl
<!-- snip -->
                      ------------- ------- -------
                        0.049820267     777      52



-------------------------------------
require 'mmap';

stopwords = {}
mmap_s = Mmap.new('stopwords.txt')
mmap_s.advise(Mmap::MADV_SEQUENTIAL)
mmap_s.each_line do |s|
  s.strip!
  stopwords[s] =1
end

count = {}
mmap_c = Mmap.new('words.txt')
mmap_c.advise(Mmap::MADV_SEQUENTIAL)
mmap_c.each_line do |s|
  s.strip!
  if ! stopwords.has_key?(s)
    if count.has_key?(s)
       count[s] += 1
    else
       count[s] = 1
    end
  end
end

z = count.sort {|a1,a2| a2[1]<=>a1[1]}
z.take(20).each do |s| puts "#{s[0]} -> #{s[1]}" end

On Sat, Jan 15, 2022 at 3:48 AM Paul Procacci <pproca...@gmail.com> wrote:

> Hey John,
>
> On Sat, Jan 15, 2022 at 3:04 AM Jon Smart <j...@smartown.nl> wrote:
>
>>
>> Hello Paul
>>
>> Do you mean by undef $/ and with <$fh> we can read the file into memory
>> at one time?
>>
>
> In most cases the short answer is yes.
> I have problems with your wording however given the 'geek' that I am.  'At
> one time' .... not quite.  In your example there were over 4000 read(2)
> syscalls by the operating system for instance.  This wouldn't have been 'at
> one time'.  ;)
>
>
> Yes that would be faster b/c we don't need to read file by each line,
>> which increases the disk IO.
>>
>>
> It actually doesn't make it faster.
> Perl buffers it's reads as does all modern programming languages.  If you
> ask perl to give you 10 bytes it certainly will, but what you don't know is
> that perl has really read up to 8192 bytes.  It only gave you what you
> asked for and the rest is sitting in perl buffers.
> To put this another way, you can put 8192 newline characters in a file and
> read this file line by line.  This doesn't equate to 8192 separate read(2)
> syscalls ... it's just 1 read syscall.  It won't be faster nor slower.
>
>
>
>> Another questions:
>> 1. what's the "truss" command?
>>
>
> truss is akin to strace.  If you're on linux, you can install strace and
> get the samish type of utility.
> It allows you to trace system calls and see how much of your time for a
> given program is waiting on the kernel and/or how often it's asking the
> kernel to do something.
>
> 2. what's the syntax "<:mmap"?
>>
>> mmap is a method of mapping a file (among other things) into memory on an
> on-demand basis.
> Given the example you provided, this is actually where the speed up comes
> from.  This is because my version removes the 4000+ read(2) syscalls in
> favor of just 2 mmap(2) syscalls.
>
> Thank you.
>
>
> ~Paul
>


-- 
__________________

:(){ :|:& };:

Reply via email to