Hi, I am using powerpc-eabi-gcc (3.4.1) and trying to retarget it for a fully pipelined FPU. I have a DFA model for the FPU. I am looking at the code produced for a simple FIR algorithm (a loop iterating over an array, with a multiply-add operation per iteration). (I am not using the fused-madd)
for (i = 0; i < 64; i++) accum = z[i] * h[i]; I have the FIR loop partially unrolled, yet am not seeing the multiply from say, iteration i+1, overlapping with the multiply from iteration i. From the scheduling dumps, I do see that the compiler knows that each use of the multiply is incurring the full latency of the multiply instead of having reduced latency by pipelining in software. The adds are also completely linked by data flow and the compiler does not seem to be using temporary registers to be able to exploit executing some of the adds in parallel. Hence, each add is stalled on the previous add. fadds f5,f0,f8 fadds f4,f5,f6 fadds f2,f4,f11 fadds f1,f2,f3 fadds f11,f1,f13 The register pressure is not very high. Registers f15-f31 are not used at all. My question is, am I expecting the wrong version of GCC to be doing this. I saw the following thread about SMS. http://gcc.gnu.org/ml/gcc/2003-09/msg00954.html that seems relevant. Would GCC 4.x be a better version for my requirement? If not, any ideas would be greatly appreciated. thanks in advance, Vasanth