I made a test implementation of a performance optimisation for GNU APL. The
short story is that it seems to work for limited tests, but there may be
issues with my approach and I don't want to spend any more time on it
unless Jürgen agrees it's a good idea. After all, I may have completely
misunderstood the architecture of GNU APL and my approach may be
fundamentally broken.

When doing some benchmarking I noticed a lot of time was spent allocating,
initialising and freeing Value instances. I realised that many such
allocations are not actually needed. Take the following expression for
example:

,1+2×foo


In this case, the following things happen:

   1. The multiplication function (implemented by eval_skalar_AB) creates a
   new array and fills it with the result of the operation (in this case,
   bif_multiply) on each cell.
   2. The addition function is then called, resulting in an identical flow
   as in the multiplication above.
   3. The ravel function is then called (Bif_COMMA::ravel), resulting in
   yet another array being created which is simply copied from B. Note that
   the ravel is identical between Z and B in this case.

>From a performance perspective, I saw a particular low-hanging fruit in the
ravel call in the last point. Since the value that came from the addition
is never used anywhere else, one can simply change the shape of B in-place
and return it immediately.

My experiment involved creating a new VF (value flag) value; VF_temp that
indicates that the value may be reused. This flag is then set for all
values that are only used once (for example, the return value from a
primitive or user function). If a value needs to be preserved, it is
copied, and the flag is not set.

What all of this means is that values that are returned from a function,
can be reused if it makes sense. There are several cases where this is the
case. For example:

   - The comma function (as discussed above)
   - Reshape, especially when the shape of Z matches that of B
   - Scalar functions. If the sizes of the arguments of plus, for example,
   matches, then A or B can be reused if they are marked as temporary.
   - The each operator can (maybe?) push the result straight into the
   source array.

I've only done very basic testing, but the ravel operator on a large array
went from taking a second or so to pretty much zero. Same should be true
for reshape, although that one crashes for me right now. :-)  Scalar
functions still have to process the elements, so their timing can't go to
zero, but at least the should save the time spent setting up the result
array and creating new values.

Please let me know what you think.

Elias

Reply via email to