> Rationale for this is that modern processors have SSE instructions which 
could perform up to 4 mathematial operations in parallel (like sin, cos, exp, 
log, pow).

Not really. The x87 has built-in functions for trig and exponential functions, 
but SSE doesn't.
It's pretty hard to make them more efficient than calling a loop on each 
element individually. If you only need approximate values, it's possible to get 
a modest speedup, but if you need full accuracy, it's tough.
Essentially because you can't have any branch instructions in the calculation, 
and working around this quickly chews up the 4-at-a-time benefit.

You'd do this for syntax sugar, not for performance.

