Sergei Gorelkin wrote:
willem wrote:

1 IF we take the mandelbrot paramet N = 5000 then Calculatepoint will be called 25000000 times.
andCalculatePoint pushes x*Step - 1.5,Cy on the stack.
then it executes the statements of CalculatePoint
and finally it pops a boolean from the stack.

It does not do any push's and pop's, you may compile into assembler and see it yourself. In particular mandelbrot program, data is passed to CalculatePoint via global variables. In general, FPC passes up to 3 parameters into functions by using registers, and simple (boolean, integer and alike) function return values are also passed in registers.
ok that is very good.
together that are 50000000 push and pops.
There is a loop in CalculatePoint, that executes 600 assembler instructions per point (at average). Saving 2 push/pop instructions will improve speed by 1/300, which won't be noticeable. I have have experience with motorola assembler, but I am going to look at the assembly generated.
You can avoid that by making an inline statement of CalculatePoints.
So no function calls in the innerloops of mandelbrot.pas .

2 SSE2
sse is based on SIMD, single instruction multiple data (SIMD).
sse can multiply a 4*4 matrix with one multiply instruction.
In CalculatePoint there are 4 multiplications:
2*Zr*Zi
Zi*Zi
Zr*Zr

SSE2 can multiply these parameters in one instruction with the Mulps instruction. (multiply packed single).
But the parameters must be of single precision and not double precsion.
Single precision is 40 bits, I think it will not affect the outcome of the mandelbrot bitmap.
but 4 multiplications in 1 instruction will speed up the program.
we need only a 4*2 matrix  (a*b).
matrix a:
2*Zr*Zi*1
Zi*Zi*1*1
Zr*Zr*1*1
1*1*1*1
matrix b is the output.

But the question is - how many instructions will you need to arrange these matrices before they can be multiplied?
The SSE2 MOUPS Move Unaligned Packed Single can move the 2*Zr*Zi*1 vector in one of the special SSE registers XMMM0 till XMM7. That takes 1 instruction. You can multiply vector a * vector b with with 5 SSE instructions. If the vector lies on a 16bit boundary then you can use the SSE instruction which is much faster.
So 4 floats can be multiplied with 5 instructions.
The C g++ benchmark uses SSE2 and is very fast . So I have to figure out how they did it.
I already downloaded the source code of benchmark.c and gcc. Ik has been done , so it is possible.
Regards,
Sergei


_________________________________________________________________
    To unsubscribe: mail [EMAIL PROTECTED] with
               "unsubscribe" as the Subject
  archives at http://www.lazarus.freepascal.org/mailarchives

Reply via email to