On Wed, Oct 15, 2014 at 11:08 AM, Steven G. Johnson <[email protected]> wrote: > > > On Wednesday, October 15, 2014 8:59:38 AM UTC-4, Erik Schnetter wrote: >> >> Modern x86 CPUs handle floats at about twice the speed as doubles. A >> floating-point instruction usually takes one cycle, and each >> instruction can execute multiple operations due to vectorization. With >> doubles, you can have 4 operations per instruction, and with floats, >> you can have 8 operations per instruction. > > > That assumes that everything obtains optimal SIMD vectorization, which is > usually false.
The original question stated "most time is spent in BLAS", in particular in axpy. We can safely assume that axpy is vectorized. -erik -- Erik Schnetter <[email protected]> http://www.perimeterinstitute.ca/personal/eschnetter/
