Either automatic or manual vectorization can also allow twice as many `float32` numbers to be handled per vector instruction vs `float64` on x86 just like the ARM or GPU cases. You may need `-march=native` or `-mavx` compiler flags (or manual intrinsics/assembly) to activate that feature, though, instead of targeting some lowest common denominator x86 cpu and C compiler autovectorization can be finicky.
It is true that for many calculations things are memory bandwidth bound where you still get 2x improvement. However, many are not membw bound or may be in fast caches. For those the 2x wider vectors help. (Funny - caches used to be almost entirely about latency but have become about both latency & bandwidth in recent times). Obviously, the wrong answer faster is not helpful, but it often is close to 2x faster, depending on how vectorizable what you're doing is, compiler, and compiler flags (and/or manual assembly). Excess precision is also not helpful, if the cost is not minimal.
