Re: [SPAM] Re: [lazarus] madelbrot benchmark much faster

willem Thu, 03 Jan 2008 17:15:08 -0800

Sergei Gorelkin wrote:

willem wrote:
1 IF we take the mandelbrot paramet N = 5000 then Calculatepoint willbe called 25000000 times.
andCalculatePoint pushes x*Step - 1.5,Cy on the stack.
then it executes the statements of CalculatePoint
and finally it pops a boolean from the stack.
It does not do any push's and pop's, you may compile into assemblerand see it yourself. In particular mandelbrot program, data is passedto CalculatePoint via global variables.In general, FPC passes up to 3 parameters into functions by usingregisters, and simple (boolean, integer and alike) function returnvalues are also passed in registers.
ok that is very good.
together that are 50000000 push and pops.
There is a loop in CalculatePoint, that executes 600 assemblerinstructions per point (at average). Saving 2 push/pop instructionswill improve speed by 1/300, which won't be noticeable.I have have experience with motorola assembler, but I am going to lookat the assembly generated.
You can avoid that by making an inline statement of CalculatePoints.
So no function calls in the innerloops of mandelbrot.pas .

2 SSE2
sse is based on SIMD, single instruction multiple data (SIMD).
sse can multiply a 4*4 matrix with one multiply instruction.
In CalculatePoint there are 4 multiplications:
2*Zr*Zi
Zi*Zi
Zr*Zr
SSE2 can multiply these parameters in one instruction with the Mulpsinstruction. (multiply packed single).
But the parameters must be of single precision and not double precsion.
Single precision is 40 bits, I think it will not affect the outcomeof the mandelbrot bitmap.
but 4 multiplications in 1 instruction will speed up the program.
we need only a 4*2 matrix  (a*b).
matrix a:
2*Zr*Zi*1
Zi*Zi*1*1
Zr*Zr*1*1
1*1*1*1
matrix b is the output.
But the question is - how many instructions will you need to arrangethese matrices before they can be multiplied?

The SSE2 MOUPS Move Unaligned Packed Single can move the 2*Zr*Zi*1vector in one of the special SSE registersXMMM0 till XMM7. That takes 1 instruction. You can multiply vector a *vector b with with 5 SSE instructions.If the vector lies on a 16bit boundary then you can use the SSEinstruction which is much faster.

So 4 floats can be multiplied with 5 instructions.

The C g++ benchmark uses SSE2 and is very fast . So I have to figureout how they did it.

I already downloaded the source code of benchmark.c and gcc. Ik has beendone , so it is possible.

Regards,
Sergei


_________________________________________________________________
    To unsubscribe: mail [EMAIL PROTECTED] with
               "unsubscribe" as the Subject
  archives at http://www.lazarus.freepascal.org/mailarchives

Re: [SPAM] Re: [lazarus] madelbrot benchmark much faster

Reply via email to