Hi!

----

We hit a serious performance issue with $ read -C array[index] # for
large arrays...
... the following testcase...
-- snip --

set -o nounset

builtin sum

bool -r slow_store=$1

# create a stream of compound variables and
# send it to file 'xxx'
print -u2 '# creating test data set...'
(
        integer i

        seq 4000 | while read i ; do
                (( i=i % 100 ))
                compound c
                integer c.i
                typeset c.s=$(printf "hello_%x" i)
                float c.as c.ac c.at
                (( c.i=i ))
                (( c.as=sin(i) ))
                (( c.ac=cos(i) ))
                (( c.at=tan(i) ))
                print -v c
        done >'xxx'
)

# read stream of compound variables
# for each compound variable we add two sort keys
# seperated by a <TAB> character, followed by the
# plain compound variable printed in a single line
# via $ print -C comvar #
print -u2 '# adding sort keys...'
(
        compound comvar

        while read -C comvar ; do
                # print sort key
                printf 'KEY1=%d\t' comvar.i
                printf 'KEY2=%x\t' $((comvar.i%16))
                print -C comvar
        done <'xxx' >'xxx_unsorted'
)


print -u2 '# sorting...'
# sort data, either via KEY1 or KEY2
#sort -t $'\t' -k1 <'xxx_unsorted' >'xxx_sorted'
sort -t $'\t' -k2 <'xxx_unsorted' >'xxx_sorted'


# now read the sorted data, remove the KEY1/KEY2 sort
# keys and store the compound variables in array "sar"
print -u2 '# removing sort keys...'
(
        compound -a sar
        integer num_sar=0
        typeset dummy_key1 dummy_key2 s
        compound c

        while IFS=$'\t' read -r dummy_key1 dummy_key2 s ; do
                if ${slow_store} ; then
                        read -C sar[num_sar] <<<"$s"
                else
                        # bug: can't use sar[num_sar++] here
                        read -C c <<<"$s"
                        sar[num_sar]=c # copy c to sar[num_sar]
                fi
                (( num_sar++ ))
        done <'xxx_sorted'

        print -u2 '# storing data...'
        print -v sar >'xxx_final'
)

print -u2 -f '# output hash: %q\n' "$(sum -x md5 'xxx_final')"

print -u2 '#done.'

exit 0
-- snip --

... produces the following two very different execution times:

-- snip --
$ time ~/bin/ksh sortcompound.sh true
# creating test data set...
# adding sort keys...
# sorting...
# removing sort keys...
# storing data...
# output hash: '379a5de3bf696b75f267339bc2fa2e98 xxx_final'
#done.

real    2m52.371s
user    2m51.453s
sys     0m0.396s
$ time ~/bin/ksh sortcompound.sh false
# creating test data set...
# adding sort keys...
# sorting...
# removing sort keys...
# storing data...
# output hash: '379a5de3bf696b75f267339bc2fa2e98 xxx_final'
#done.

real    0m25.490s
user    0m25.217s
sys     0m0.181s
-- snip --

The difference is that the long-running version uses $ compound -a ar
; ... read -C ar[index] # and the much faster version uses $ compound
-a ar ; compound c ; ... read -C c ; ar[index]=c #, e.g. it first read
the compound variable data into variable "c" and then later copies the
whole compound variable into the array entry...

... erm... why does this happen ? My guess is that $ read -C ... #
does a array lookup for each compound variable member when the member
is read which makes the lookup very slow. The fast version avoids the
problem by loading all data into a local compound variable _first_ and
then copies it into the matching array slot, avoiding the per-compound
variable member lookups...

... David... what do you think ?

----

Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) [email protected]
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
  /O /==\ O\  TEL +49 641 3992797
 (;O/ \/ \O;)
_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Reply via email to