Andrei:

> Third, it looks like larger unrolling limits is better - I only got a 
> plateau at 128!

But this is true for a microbenchmark. In a real program the code half part of 
the CPU L1 cache is quite limited, so the more code you have to push through 
that little cache (code of different  functions), the more cache misses you 
have, and this slows down the code. This is why too much unrolling or too much 
inlining is bad, and this is why I have unrolled my sum() only once.

Bye,
bearophile

Reply via email to