+#ifdef __LITTLE_ENDIAN__ + /* Sum across four integers with two integer results. */ + asm ("vsum2sws %0,%1,%2" : "=v" (result) : "v" (vsum), "v" (zero)); + /* Note: vec_sum2s could be used here, but on little-endian, vector + shifts are added that are not needed for this use-case. + A vector shift to correctly position the 32-bit integer results + (currently at [0] and [2]) to [1] and [3] would then need to be + swapped back again since the desired results are two 64-bit + integers ([1]|[0] and [3]|[2]). Thus, no shift is performed. */ +#else /* Sum across four integers with two integer results. */ result = vec_sum2s (vsum, (__vector signed int) zero);
If little-endian adds shifts to correct for the position and big-endian does not, why not use the inline asm without the shifts for both? It seems confusing to add the inline asm only for LE instead of always using it with the appropriate comment. It's a good and valuable optimization for LE. Fewer variants are less fragile, easier to test and easier to maintain. If you're going to go to the trouble of using inline asm for LE, use it for both. Thanks, David