Hi everyone, Since I'm masochistic in my desire to understand and improve the Free Pascal Compiler, I would like to add some vectorisation support in its optimisation cycle, since that is one thing that many other compilers attempt to do these days. But before I begin, does FPC support any kind of vectorisation already? If it does I haven't been able to find it yet, and I don't want to end up reinventing the wheel.
I recall things, for example, where the following is not optimised even if the compiler is set to use SSE: type TVector4f = packed record X, Y, Z, W: Single; end; function VectorAdd(A, B: TVector4f): TVector4f; begin Result.X := A.X + B.X; Result.Y := A.Y + B.Y; Result.Z ;= A.Z + B.Z; Result.W := A.W + B.W; end; The resultant assembler code yields an individual "MOVSS" and arithmetic for each element rather than combining the reads and writes into a MOVUPS instruction and reducing the number of arithmetic instructions by a factor of 4. For clarity, this is the assembler produced with '-CfSSE64': .section .text.n_p$testfile_$$_addvector$tvector4f$tvector4f$$tvector4f,"x" .balign 16,0x90 .globl P$TESTFILE_$$_ADDVECTOR$TVECTOR4F$TVECTOR4F$$TVECTOR4F P$TESTFILE_$$_ADDVECTOR$TVECTOR4F$TVECTOR4F$$TVECTOR4F: .Lc1: .seh_proc P$TESTFILE_$$_ADDVECTOR$TVECTOR4F$TVECTOR4F$$TVECTOR4F leaq -56(%rsp),%rsp .Lc3: .seh_stackalloc 56 .seh_endprologue movq %rcx,%rax movq %rdx,(%rsp) movq %r8,8(%rsp) movq (%rsp),%rdx movq (%rdx),%rcx movq %rcx,16(%rsp) movq 8(%rdx),%rdx movq %rdx,24(%rsp) movq 8(%rsp),%rdx movq (%rdx),%rcx movq %rcx,32(%rsp) movq 8(%rdx),%rdx movq %rdx,40(%rsp) movss 16(%rsp),%xmm0 addss 32(%rsp),%xmm0 movss %xmm0,(%rax) movss 20(%rsp),%xmm0 addss 36(%rsp),%xmm0 movss %xmm0,4(%rax) movss 24(%rsp),%xmm0 addss 40(%rsp),%xmm0 movss %xmm0,8(%rax) movss 28(%rsp),%xmm0 addss 44(%rsp),%xmm0 movss %xmm0,12(%rax) leaq 56(%rsp),%rsp ret .seh_endproc .Lc2: A good vectoriser (for lack of a better name!) would be able to optimise the 12 movss/addss routines to just "movups 16(%rsp),%xmm0 addps 32(%rsp),%xmm0 movups %xmm0,(%rax)" - since the stack is aligned to a 16-byte boundary, it can swap out the first movups to a movaps too. Not sure what to do regarding moving everything to the stack first though. I'm sure it's a mammoth task, but I would like to start somewhere with it - however, are there any design plans that I should be adhering to so I don't end up designing something that is disliked? Kit _______________________________________________ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel