Sergei Gorelkin wrote:
willem wrote:
1 IF we take the mandelbrot paramet N = 5000 then Calculatepoint will
be called 25000000 times.
andCalculatePoint pushes x*Step - 1.5,Cy on the stack.
then it executes the statements of CalculatePoint
and finally it pops a boolean from the stack.
It does not do any push's and pop's, you may compile into assembler
and see it yourself. In particular mandelbrot program, data is passed
to CalculatePoint via global variables.
In general, FPC passes up to 3 parameters into functions by using
registers, and simple (boolean, integer and alike) function return
values are also passed in registers.
ok that is very good.
together that are 50000000 push and pops.
There is a loop in CalculatePoint, that executes 600 assembler
instructions per point (at average). Saving 2 push/pop instructions
will improve speed by 1/300, which won't be noticeable.
I have have experience with motorola assembler, but I am going to look
at the assembly generated.
You can avoid that by making an inline statement of CalculatePoints.
So no function calls in the innerloops of mandelbrot.pas .
2 SSE2
sse is based on SIMD, single instruction multiple data (SIMD).
sse can multiply a 4*4 matrix with one multiply instruction.
In CalculatePoint there are 4 multiplications:
2*Zr*Zi
Zi*Zi
Zr*Zr
SSE2 can multiply these parameters in one instruction with the Mulps
instruction. (multiply packed single).
But the parameters must be of single precision and not double precsion.
Single precision is 40 bits, I think it will not affect the outcome
of the mandelbrot bitmap.
but 4 multiplications in 1 instruction will speed up the program.
we need only a 4*2 matrix (a*b).
matrix a:
2*Zr*Zi*1
Zi*Zi*1*1
Zr*Zr*1*1
1*1*1*1
matrix b is the output.
But the question is - how many instructions will you need to arrange
these matrices before they can be multiplied?
The SSE2 MOUPS Move Unaligned Packed Single can move the 2*Zr*Zi*1
vector in one of the special SSE registers
XMMM0 till XMM7. That takes 1 instruction. You can multiply vector a *
vector b with with 5 SSE instructions.
If the vector lies on a 16bit boundary then you can use the SSE
instruction which is much faster.
So 4 floats can be multiplied with 5 instructions.
The C g++ benchmark uses SSE2 and is very fast . So I have to figure
out how they did it.
I already downloaded the source code of benchmark.c and gcc. Ik has been
done , so it is possible.
Regards,
Sergei
_________________________________________________________________
To unsubscribe: mail [EMAIL PROTECTED] with
"unsubscribe" as the Subject
archives at http://www.lazarus.freepascal.org/mailarchives