- vector version is about 3% faster than above instead of 10% slower - wow! So why is gcc 4.0 producing worse code when using intel style intrinsics and why isn't the union version using builtins as fast as using the vector version?
I can answer why unions are slower: that's because they are spilled to memory on every assignment -- GCC 4.0 knows how to replace structs with different scalar variables (one per item), but not unions. GCC 3.4 knew about none of these possibilities.
Ok, thanks for the explanation, though I am not sure, whether I understood everythuing.
About why vectors are faster, well, a lot of the vector support has been rewritten in GCC 4.0 so that may be the case.
I do not know exactly why builtins are still slower, but you may want to create a PR and add me on the CC list ([EMAIL PROTECTED]).
I have some good news: My tests yesterday was flawed. Somehow the compiler wasn't completely switched. It seems gcc 3.4.3 was still used, but I don't understand why it behaved differently after gcc 4 was installed. Perhaps something in the script for switching compilers is wrong....
Nevertheless, today I did it correctly and gcc 4 just blew away. I didn't test the unions version, only built-ins vs Intel style and this time they were pretty identical and what is more, hell faster than gcc 3.4.3 generated code. The devs did really good work.
Some numbers Fastest runs with gcc3.4.3: about 2.9 sec (slowest when using Intel style: 3,1sec
gcc 4.0: <2.4 sec
Just astonishing!
So I guess considering gcc 4.0 this report can be forgotten, but gcc 3.4.3 has serious issues. It even miscompiles SSE code. Are plans to fix this or won't time be spent on it anymore? I'd rather see gcc 4.0 getting into a wonderful stable state. :-)
Sorry for my mistake.
-- Prakash Punnoor
formerly known as Prakash K. Cheemplavam
signature.asc
Description: OpenPGP digital signature