Re: How to sort and count efficiently?

Assaf Gordon Sun, 30 Jun 2019 10:54:51 -0700


On 2019-06-30 11:10 a.m., Peng Yu wrote:

The problem with this kind of awk program is that everything will beloaded to memory.


Well, those are the to main options: store in memory or resort to disk
I/O. each has its own pros and cons.

But bare `sort` use external files to save memory.

Not exactly - The goal is not to "save" memory -
Sort resorts to external files to be able to complete the
sort even with it runs out of the (alloted) memory (which can be
controlled with the "-S" parameter).

I'm not familiar with a program which implements hashing backed by file-
storage, but perhaps such program exists.

When the hash in awk is too large, accessing it can become very slow(maybe due to potential cache miss or slow down of hash as a function ofhash size).


Nothing is "free", and using a hash incurs its own costs.

If you're using the simplified awk hashing program,
try to use other AWK implementations than GNU awk (e.g. I have had
some performance gains from switching to "mawk", the default awk in
Debian).

  $ printf "%s\n" a c b b b b b b c \
     | mawk 'a[$1]++ {} END { for (i in a) { print i, a[i] } }'
  a 1
  b 6
  c 2

Or, if your input is exceedingly large, perhaps consider pre-processing
it and splitting the input into smaller files - each one will have less
strings and hashing them will consume less memory.
The following example splits the input file into 27 files, based on
the first letter of the string (and an "other" file for non-letters):

     mawk '{ l = tolower(substr($0,1,1)) ;
             if (l>="a" && l<="z") {
               print $0 > l
             } else {
               print $0 > "other"
             }
           }' INPUT

This is an O(N) operation that doesn't consume any memory (just lots of

disk I/O) - and the resulting files will be much smaller - then can behashed with less memory.


Of course this can be extended to split into smaller-grained files.

-assaf

Re: How to sort and count efficiently?

Reply via email to