+#ifdef __LITTLE_ENDIAN__
+  /* Sum across four integers with two integer results.  */
+  asm ("vsum2sws %0,%1,%2" : "=v" (result) : "v" (vsum), "v" (zero));
+  /* Note: vec_sum2s could be used here, but on little-endian, vector
+     shifts are added that are not needed for this use-case.
+     A vector shift to correctly position the 32-bit integer results
+     (currently at [0] and [2]) to [1] and [3] would then need to be
+     swapped back again since the desired results are two 64-bit
+     integers ([1]|[0] and [3]|[2]).  Thus, no shift is performed.  */
+#else
   /* Sum across four integers with two integer results.  */
   result = vec_sum2s (vsum, (__vector signed int) zero);

If little-endian adds shifts to correct for the position and
big-endian does not, why not use the inline asm without the shifts for
both?  It seems confusing to add the inline asm only for LE instead of
always using it with the appropriate comment.

It's a good and valuable optimization for LE.  Fewer variants are less
fragile, easier to test and easier to maintain.  If you're going to go
to the trouble of using inline asm for LE, use it for both.

Thanks, David

Reply via email to