Re: [OT] of the merit of using awk for performance or who's got the fastest quote() (Was: sh(1): is roundtripping of the positional parameter stack possible?)

Robert Elz Wed, 17 May 2017 00:34:22 -0700

    Date:        Tue, 16 May 2017 18:46:39 +0100
    From:        Stephane Chazelas <stephane.chaze...@gmail.com>
    Message-ID:  <20170516174639.ga19...@chaz.gmail.com>


  | Actually, even for a handful of arguments, and even with gawk,
  | it seems it's generally quicker to use awk in my tests:

One can make benchmarks that produce whatever result one wants,
of course, here's mine for what I expect to be a more life-like
application that simply quoting zillions of strings, all of which
just happen to contain multiple single quote characters, by accident...

. "$1"                  # this supplies the definition of "quote" and no more

for file in *
do
        x=$(quote "$file")
done

Of course, while that is using "quote" the way I imagined it being
used, the real work of a real-life script is missing (what we actually
do with x once we have it, and everything else that happens) - but since
we're only interested in benchmarking quote implementations, not the "real"
part of the "real" application, and really, not the shell either, that's
fine I think.

This test means that we're not just measuring the startup time of the
shell concerned (by running the shell over and over.)

Steffen (and others) can decide if my test, or yours, is more relevant
given the use he had in mind originally.

I am running this (using bash for this, as the "time" reserved word, and
TIMEFORMAT are not posix, I don't believe) using this script.  I ran it
3 times in three different directories.

printf "In a directory with %s files...\n\n" $(ls | wc -l)

for shell in sh dash bash fbsh mksh zsh yash
do
        for script in $(ls /tmp/T)
        do
                TIMEFORMAT="Shell $shell using script $script:  %U"
                (time "$shell" /tmp/timer "/tmp/T/${script}") 2>&1 |
                         expand -t 40
        done
        printf '\n'
done

/tmp/timer is the first script I included above - though I also tested
a version of that which checked the results (by expanding $x and
verifying that the result was the same as $file - just to check that
the quote functions worked correctly .. both did.)

There is (or should be, no idea what the mailer will do) a tab before %U
in TIMEFORMAT, that and the expand are just to make the results look pretty...
This doesn't affect the results at all, nor does the subshell in that, and
while there is probably a better way to make pretty results (without
yet another awk script!), this worked well enough.)

In that list of shells, "sh" is (a quite old version of) the NetBSD shell,
"fbsh" is the FreeBSD shell (about a year old as are most of the others)
and the rest you recognise. No ksh93 in there, as (for reasons not relevant
here) I don't have that one installed on the system I'm running this test,
and I don't have most of the others installed on a system where I do have
ksh93.

There are just 2 scripts in /tmp/T, yours (exactly as you gave it in the
last message) and mine (modified from the version I had, which dealt with
leading/trailing ' chars by processing them specially, though that version
was never posted here, instead now using Jilles' better solution to that
problem .. though his generates even more redundant quotes in the cases
it makes a difference - for a leading ' in the input, I would have
output a leading \' in the result, now we get ''\' instead.  Both are
correct of course.)

Incidentally, to refer to another point you made, by "redundant quotes"
I did not really mean ones that actually quote something (though the example
I gave was not a good one) but extra '' pairs inserted into the output,
just because it is convenient for the implementation.

They accomplish nothing, hurt nothing (except my eye-balls), and add a
little extra processing time for the shell when it processes the result,
(more chars to deal with) but really don't matter one way or the other.

I'll include the two quote functions used, for completeness, at the end.

I did try a version of mine, modified in a non-posix way to use "local",
rather than explicit saving and restoring (etc) but that only makes any
difference at all when the arg to quote contains a single-quote char, which
is likely very rare, so there was no real difference between the results
for it, and the posix version of my script, it just cluttered the output,
so I removed that one (even though if I were actually using this function
that would be the version I'd use -- all sane shells support "local")

Now we can expect that most directories will not contain files with '
chars in their names, in fact, there are most likely none - but this is
also what is most likely to occur in any real life application.  We
have to deal with the possibility of it occurring, and handle it properly,
but we do not have to expect it and optimise for it.

I also realise that you could optimise your function in the same way I
did mine (from the first version) to deal with that case, in which case
the results from these tests would probably show almost the same times for
your function and mine, so there's no need to do that and reply...

Lastly, I have no idea which version of awk this is using (for your
script), that has never bothered me much, but it is most likely not
gawk (but with your version, you're always be going to be taking your
chances with how fast/slow the local installed awk happens to be.)

These are the results for 3 directories I ran the script in, the first
is my $HOME (relatively small, but not so small as for the test to
measure nothing), a medium-sized directory with often quite long and messy
file names, and a directory with a large number of very bland file names
(reasonably short ones.)

In a directory with 71 files...

Shell sh using script kre:              0.011
Shell sh using script stephane:         0.055

Shell dash using script kre:            0.010
Shell dash using script stephane:       0.057

Shell bash using script kre:            0.022
Shell bash using script stephane:       0.082

Shell fbsh using script kre:            0.011
Shell fbsh using script stephane:       0.046

Shell mksh using script kre:            0.052
Shell mksh using script stephane:       0.069

Shell zsh using script kre:             0.022
Shell zsh using script stephane:        0.077

Shell yash using script kre:            0.019
Shell yash using script stephane:       0.051

In a directory with 11522 files...

Shell sh using script kre:              1.956
Shell sh using script stephane:         8.410

Shell dash using script kre:            1.780
Shell dash using script stephane:       9.196

Shell bash using script kre:            6.656
Shell bash using script stephane:       15.623

Shell fbsh using script kre:            1.852
Shell fbsh using script stephane:       7.831

Shell mksh using script kre:            8.788
Shell mksh using script stephane:       10.599

Shell zsh using script kre:             4.510
Shell zsh using script stephane:        14.621

Shell yash using script kre:            3.489
Shell yash using script stephane:       8.678

In a directory with 77062 files...

Shell sh using script kre:              14.026
Shell sh using script stephane:         58.061

Shell dash using script kre:            13.398
Shell dash using script stephane:       64.897

Shell bash using script kre:            27.751
Shell bash using script stephane:       87.852

Shell fbsh using script kre:            13.407
Shell fbsh using script stephane:       53.715

Shell mksh using script kre:            63.250
Shell mksh using script stephane:       74.073

Shell zsh using script kre:             57.982
Shell zsh using script stephane:        132.030

Shell yash using script kre:            24.901
Shell yash using script stephane:       59.346



And finally, here are the scripts, first mine (/tmp/T/kre):

-------------------------------------------------------------
quote() {
        case "$1" in
        *\'*)   ;;               # the harder case, we will get to below.
        *)      printf "'%s'" "$1"; return 0;;
        esac

        _save_IFS="${IFS}"; ${IFS+":"} unset _save_IFS
        _save_OPTS="$(set +o)"
        IFS=\'
        set -f

        set -- ''$1''
        _result_="${1}"
        shift

        for __A__
        do
                _result_="${_result_}'\\''${__A__}"
        done
        printf "'%s'" "${_result_}"

        # now clean up

        IFS="${_save_IFS}"; ${_save_IFS+":"} unset IFS
        eval "${_save_OPTS}"
        unset _result_ _save_IFS _save_OPTS __A__

        return 0;
}
-------------------------------------------------------------

and then yours (/tmp/T/stephane):

-------------------------------------------------------------
quote() {
  LC_ALL=C awk -v q="'" -v b='\\' '
    function quote(s) {
      gsub(q, q b q q, s)
      return q s q
    }
    BEGIN {
      sep = ""
      for (i = 1; i < ARGC; i++) {
        printf "%s", sep quote(ARGV[i])
        sep = " "
      }
      if (sep) print ""
    }' "$@"
}
-------------------------------------------------------------

kre

Re: [OT] of the merit of using awk for performance or who's got the fastest quote() (Was: sh(1): is roundtripping of the positional parameter stack possible?)

Reply via email to