this patch uses division by known sizes (which can usually be replaced
by a simple shift because intrinsics have sizes of power of two) instead
of division by the size extracted from the array descriptor itself.
This should save about 20 cycles for a single calculation.
I'll go through the rest of the library to identify other possibilities
Regression-tested, no new failures.
OK for the branch?
Full patch at http://gcc.gnu.org/ml/fortran/2012-03/msg00120.html