Hi!
----
We hit a serious performance issue with $ read -C array[index] # for
large arrays...
... the following testcase...
-- snip --
set -o nounset
builtin sum
bool -r slow_store=$1
# create a stream of compound variables and
# send it to file 'xxx'
print -u2 '# creating test data set...'
(
integer i
seq 4000 | while read i ; do
(( i=i % 100 ))
compound c
integer c.i
typeset c.s=$(printf "hello_%x" i)
float c.as c.ac c.at
(( c.i=i ))
(( c.as=sin(i) ))
(( c.ac=cos(i) ))
(( c.at=tan(i) ))
print -v c
done >'xxx'
)
# read stream of compound variables
# for each compound variable we add two sort keys
# seperated by a <TAB> character, followed by the
# plain compound variable printed in a single line
# via $ print -C comvar #
print -u2 '# adding sort keys...'
(
compound comvar
while read -C comvar ; do
# print sort key
printf 'KEY1=%d\t' comvar.i
printf 'KEY2=%x\t' $((comvar.i%16))
print -C comvar
done <'xxx' >'xxx_unsorted'
)
print -u2 '# sorting...'
# sort data, either via KEY1 or KEY2
#sort -t $'\t' -k1 <'xxx_unsorted' >'xxx_sorted'
sort -t $'\t' -k2 <'xxx_unsorted' >'xxx_sorted'
# now read the sorted data, remove the KEY1/KEY2 sort
# keys and store the compound variables in array "sar"
print -u2 '# removing sort keys...'
(
compound -a sar
integer num_sar=0
typeset dummy_key1 dummy_key2 s
compound c
while IFS=$'\t' read -r dummy_key1 dummy_key2 s ; do
if ${slow_store} ; then
read -C sar[num_sar] <<<"$s"
else
# bug: can't use sar[num_sar++] here
read -C c <<<"$s"
sar[num_sar]=c # copy c to sar[num_sar]
fi
(( num_sar++ ))
done <'xxx_sorted'
print -u2 '# storing data...'
print -v sar >'xxx_final'
)
print -u2 -f '# output hash: %q\n' "$(sum -x md5 'xxx_final')"
print -u2 '#done.'
exit 0
-- snip --
... produces the following two very different execution times:
-- snip --
$ time ~/bin/ksh sortcompound.sh true
# creating test data set...
# adding sort keys...
# sorting...
# removing sort keys...
# storing data...
# output hash: '379a5de3bf696b75f267339bc2fa2e98 xxx_final'
#done.
real 2m52.371s
user 2m51.453s
sys 0m0.396s
$ time ~/bin/ksh sortcompound.sh false
# creating test data set...
# adding sort keys...
# sorting...
# removing sort keys...
# storing data...
# output hash: '379a5de3bf696b75f267339bc2fa2e98 xxx_final'
#done.
real 0m25.490s
user 0m25.217s
sys 0m0.181s
-- snip --
The difference is that the long-running version uses $ compound -a ar
; ... read -C ar[index] # and the much faster version uses $ compound
-a ar ; compound c ; ... read -C c ; ar[index]=c #, e.g. it first read
the compound variable data into variable "c" and then later copies the
whole compound variable into the array entry...
... erm... why does this happen ? My guess is that $ read -C ... #
does a array lookup for each compound variable member when the member
is read which makes the lookup very slow. The fast version avoids the
problem by loading all data into a local compound variable _first_ and
then copies it into the matching array slot, avoiding the per-compound
variable member lookups...
... David... what do you think ?
----
Bye,
Roland
--
__ . . __
(o.\ \/ /.o) [email protected]
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)
_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers