On Wed, May 07, 2008 at 06:00:38AM +0200, Jens M Andreasen wrote: > One thing that I wonder is what the pattern of addition in Fons's > application really looks like. I assume the fftA * fftB is some windowed > precalculated impulsresponse and a signal? The addition/accumulate > suggests that the output fftD has been touched before, implying that > there could be more variables to work on at once or that the vectors > would still be in the cache if the order of addition was changed.
It's quite complex. I'll try to build up a picture in four steps. 1. The first step to imagine is just a scalar matrix operating on a number of input sample streams A [j][t] and producing a number of output sample streams B [i][t]: index i = output index j = input index t = time B [i][t] = sum_j (M [i][j] * A [j][t]) 2. Now imagine that the matrix is not scalar, but a matrix of FIR filters: B [i][t] = sum_j (sum_k (M [i][j][k] * A [j][t- k])) with k = 0...L-1, for filters of length L. We now need to store the past L-1 samples of each input as well. Assume this calculation is done for one value of t (i.e. one sample) at a time (we'll see why soon). There are six ways to organise the three nested loops. The one actually used is such that k is the innermost index. 3. Now imagine that each sample in the above equation is actually a block of P samples. This is how FFT based partitioned convolution works. We need a size 2P FFT on each input block (the second half are zeros) and on each matrix element (the latter is precomputed of course), a size 2P IFFT on each output block, and each of the multiplications now becomes a MAC loop over P complex values. The output size is 2P, i.e. the second half overlaps witht the next iteration and has to be stored. There are size(A) * size(B) * L such MAC loops for each block, so it helps to optimize this operation. 4. The scheme above is FFT-based partitioned convolution with a single partition size P. For efficiency you want large P, but this also introduces processing delay as you can only start the computation when P new input samples are available. To avoid this delay in a real-time application zita-convolver uses multiple partition sizes, small at the start of an IR, and larger ones for the later parts. There can be up to five sizes. Calculations for P == period size are performed directly in the JACK callback, the longer ones are performed by lower priority threads. A final optimisation is that a sparse matrix representation is used in all three dimensions, so no time or memory is wasted on zero-valued data. -- FA Laboratorio di Acustica ed Elettroacustica Parma, Italia Lascia la spina, cogli la rosa. _______________________________________________ Linux-audio-dev mailing list [email protected] http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
