Re: [PATCH, rs6000] 1/2 Add x86 SSE2 <emmintrin,h> intrinsics to GCC PPC64LE target

2017-10-24 Thread Steven Munroe
On Mon, 2017-10-23 at 16:21 -0500, Segher Boessenkool wrote:
> Hi!
> 
> On Tue, Oct 17, 2017 at 01:24:45PM -0500, Steven Munroe wrote:
> > Some inline assembler is required. There a several cases where we need 
> > to generate Data Cache Block instruction. There are no existing builtin
> > for flush and touch for store transient.
> 
> Would builtins for those help?  Would anything else want to use such
> builtins, I mean?
> 
Yes I think NVMe and In-memory DB in general will want easy access to
these instructions.

Intel provides intrinsic functions

Builtin or intrinsic will be easier then finding and reading the
PowerISA and trying to write your own inline asm

> > +   For PowerISA Scalar double in FPRs (left most 64-bits of the
> > +   low 32 VSRs), while X86_64 SSE2 uses the right most 64-bits of
> > +   the XMM. These differences require extra steps on POWER to match
> > +   the SSE2 scalar double semantics.
> 
> Maybe say "is in FPRs"?  (And two space after a full stop, here and
> elsewhere).
> 
Ok

> > +/* We need definitions from the SSE header files*/
> 
> Dot space space.
>
Ok

> > +/* Sets the low DPFP value of A from the low value of B.  */
> > +extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, 
> > __artificial__))
> > +_mm_move_sd (__m128d __A, __m128d __B)
> > +{
> > +#if 1
> > +  __v2df result = (__v2df) __A;
> > +  result [0] = ((__v2df) __B)[0];
> > +  return (__m128d) result;
> > +#else
> > +  return (vec_xxpermdi(__A, __B, 1));
> > +#endif
> > +}
> 
Meant to check what trunk generated and them pick one. 

Done.

> You probably forgot to finish this?  Or, what are the two versions,
> and why are they both here?  Same question later a few times.
> 
> > +/* Add the lower double-precision (64-bit) floating-point element in
> > + * a and b, store the result in the lower element of dst, and copy
> > + * the upper element from a to the upper element of dst. */
> 
> No leading stars on block comments please.
> 
> > +extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, 
> > __artificial__))
> > +_mm_cmpnge_pd (__m128d __A, __m128d __B)
> > +{
> > +  return ((__m128d)vec_cmplt ((__v2df ) __A, (__v2df ) __B));
> > +}
> 
> You have some spaces before closing parentheses here (and elsewhere --
> please check).
> 
Ok

> > +extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, 
> > __artificial__))
> > +_mm_cvtpd_epi32 (__m128d __A)
> > +{
> > +  __v2df rounded = vec_rint (__A);
> > +  __v4si result, temp;
> > +  const __v4si vzero =
> > +{ 0, 0, 0, 0 };
> > +
> > +  /* VSX Vector truncate Double-Precision to integer and Convert to
> > +   Signed Integer Word format with Saturate.  */
> > +  __asm__(
> > +  "xvcvdpsxws %x0,%x1;\n"
> > +  : "=wa" (temp)
> > +  : "wa" (rounded)
> > +  : );
> 
> Why the ";\n"?  And no empty clobber list please.
> 
Ok

> > +extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, 
> > __artificial__))
> > +_mm_cvtps_pd (__m128 __A)
> > +{
> > +  /* Check if vec_doubleh is defined by . If so use that. */
> > +#ifdef vec_doubleh
> > +  return (__m128d) vec_doubleh ((__v4sf)__A);
> > +#else
> > +  /* Otherwise the compiler is not current and so need to generate the
> > + equivalent code.  */
> 
> Do we need this?  The compiler will always be current.
> 
the vec_double* and vec_float* builtins where in flux at the time and
this deferred the problem.

Not sure what their status is now.

Would still need this if we want to backport to GCC7 (AT11) and there
are more places where now we only have asm.


> > +extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, 
> > __artificial__))
> > +_mm_loadl_pd (__m128d __A, double const *__B)
> > +{
> > +  __v2df result = (__v2df)__A;
> > +  result [0] = *__B;
> > +  return (__m128d)result;
> > +}
> > +#ifdef _ARCH_PWR8
> > +/* Intrinsic functions that require PowerISA 2.07 minimum.  */
> 
> You want an empty line before that #ifdef.
> 
Ok fixed

> > +extern __inline int __attribute__((__gnu_inline__, __always_inline__, 
> > __artificial__))
> > +_mm_movemask_pd (__m128d  __A)
> > +{
> > +  __vector __m64 result;
> > +  static const __vector unsigned int perm_mask =
> > +{
> > +#ifdef __LITTLE_ENDIAN__
> > +   0x80800040, 0x80808080, 0x80808080, 0x80808080
> > +#elif __BIG_ENDIAN__
> > +  0x80808080, 0x80808080, 0x80808080, 0x80800040
> 
> Wrong indent in the LE case?
> 
OK fixed

> > +extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, 
> > __artificial__))
> > +_mm_slli_epi16 (__m128i __A, int __B)
> > +{
> > +  __v8hu lshift;
> > +  __v8hi result =
> > +{ 0, 0, 0, 0, 0, 0, 0, 0 };
> 
> Could as well fit that on the same line.
> 
Ok, not sure why the formatter does this.

> > +  if (__B < 16)
> > +{
> > +  if (__builtin_constant_p(__B))
> > +   {
> > + lshift = (__v8hu) vec_splat_s16(__B);
> > +   }
> > +  else
> > +   {
> > + lshift = vec_splats ((unsigned 

Re: [PATCH, rs6000] 1/2 Add x86 SSE2 <emmintrin,h> intrinsics to GCC PPC64LE target

2017-10-23 Thread Segher Boessenkool
Hi!

On Tue, Oct 17, 2017 at 01:24:45PM -0500, Steven Munroe wrote:
> Some inline assembler is required. There a several cases where we need 
> to generate Data Cache Block instruction. There are no existing builtin
> for flush and touch for store transient.

Would builtins for those help?  Would anything else want to use such
builtins, I mean?

> +   For PowerISA Scalar double in FPRs (left most 64-bits of the
> +   low 32 VSRs), while X86_64 SSE2 uses the right most 64-bits of
> +   the XMM. These differences require extra steps on POWER to match
> +   the SSE2 scalar double semantics.

Maybe say "is in FPRs"?  (And two space after a full stop, here and
elsewhere).

> +/* We need definitions from the SSE header files*/

Dot space space.

> +/* Sets the low DPFP value of A from the low value of B.  */
> +extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, 
> __artificial__))
> +_mm_move_sd (__m128d __A, __m128d __B)
> +{
> +#if 1
> +  __v2df result = (__v2df) __A;
> +  result [0] = ((__v2df) __B)[0];
> +  return (__m128d) result;
> +#else
> +  return (vec_xxpermdi(__A, __B, 1));
> +#endif
> +}

You probably forgot to finish this?  Or, what are the two versions,
and why are they both here?  Same question later a few times.

> +/* Add the lower double-precision (64-bit) floating-point element in
> + * a and b, store the result in the lower element of dst, and copy
> + * the upper element from a to the upper element of dst. */

No leading stars on block comments please.

> +extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, 
> __artificial__))
> +_mm_cmpnge_pd (__m128d __A, __m128d __B)
> +{
> +  return ((__m128d)vec_cmplt ((__v2df ) __A, (__v2df ) __B));
> +}

You have some spaces before closing parentheses here (and elsewhere --
please check).

> +extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, 
> __artificial__))
> +_mm_cvtpd_epi32 (__m128d __A)
> +{
> +  __v2df rounded = vec_rint (__A);
> +  __v4si result, temp;
> +  const __v4si vzero =
> +{ 0, 0, 0, 0 };
> +
> +  /* VSX Vector truncate Double-Precision to integer and Convert to
> +   Signed Integer Word format with Saturate.  */
> +  __asm__(
> +  "xvcvdpsxws %x0,%x1;\n"
> +  : "=wa" (temp)
> +  : "wa" (rounded)
> +  : );

Why the ";\n"?  And no empty clobber list please.

> +extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, 
> __artificial__))
> +_mm_cvtps_pd (__m128 __A)
> +{
> +  /* Check if vec_doubleh is defined by . If so use that. */
> +#ifdef vec_doubleh
> +  return (__m128d) vec_doubleh ((__v4sf)__A);
> +#else
> +  /* Otherwise the compiler is not current and so need to generate the
> + equivalent code.  */

Do we need this?  The compiler will always be current.

> +extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, 
> __artificial__))
> +_mm_loadl_pd (__m128d __A, double const *__B)
> +{
> +  __v2df result = (__v2df)__A;
> +  result [0] = *__B;
> +  return (__m128d)result;
> +}
> +#ifdef _ARCH_PWR8
> +/* Intrinsic functions that require PowerISA 2.07 minimum.  */

You want an empty line before that #ifdef.

> +extern __inline int __attribute__((__gnu_inline__, __always_inline__, 
> __artificial__))
> +_mm_movemask_pd (__m128d  __A)
> +{
> +  __vector __m64 result;
> +  static const __vector unsigned int perm_mask =
> +{
> +#ifdef __LITTLE_ENDIAN__
> + 0x80800040, 0x80808080, 0x80808080, 0x80808080
> +#elif __BIG_ENDIAN__
> +  0x80808080, 0x80808080, 0x80808080, 0x80800040

Wrong indent in the LE case?

> +extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, 
> __artificial__))
> +_mm_slli_epi16 (__m128i __A, int __B)
> +{
> +  __v8hu lshift;
> +  __v8hi result =
> +{ 0, 0, 0, 0, 0, 0, 0, 0 };

Could as well fit that on the same line.

> +  if (__B < 16)
> +{
> +  if (__builtin_constant_p(__B))
> + {
> +   lshift = (__v8hu) vec_splat_s16(__B);
> + }
> +  else
> + {
> +   lshift = vec_splats ((unsigned short) __B);
> + }

No blocks please, for single-line cases.

> +extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, 
> __artificial__))
> +_mm_cmpeq_epi8 (__m128i __A, __m128i __B)
> +{
> +#if 1
> +  return (__m128i ) vec_cmpeq ((__v16qi) __A, (__v16qi)__B);
> +#else
> +  return (__m128i) ((__v16qi)__A == (__v16qi)__B);
> +#endif
> +}

Here's another #if 1.

> +extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, 
> __artificial__))
> +_mm_sad_epu8 (__m128i __A, __m128i __B)
> +{
> +  __v16qu a, b;
> +  __v16qu vmin, vmax, vabsdiff;
> +  __v4si vsum;
> +  const __v4su zero = { 0, 0, 0, 0 };
> +  __v4si result;
> +
> +  a = (__v16qu) __A;
> +  b = (__v16qu) __B;
> +  vmin = vec_min (a, b);
> +  vmax = vec_max (a, b);
> +  vabsdiff = vec_sub (vmax, vmin);
> +  /* Sum four groups of bytes into integers.  */
> +  vsum = (__vector signed int) vec_sum4s (vabsdiff, zero);
> +  /* Sum across four 

[PATCH, rs6000] 1/2 Add x86 SSE2 <emmintrin,h> intrinsics to GCC PPC64LE target

2017-10-17 Thread Steven Munroe
These is the forth major contribution of X86 intrinsic equivalent
headers for PPC64LE.

X86 SSE2 technology adds double float (__m128d) support, filled in a
number 128-bit vector integer (__m128i) operations and added some MMX
conversions to and from 128-bit vector (XMM) operations.

In general the SSE2 (__m128) intrinsic's are a good match to the
PowerISA VSX 128-bit vector double facilities. This allows direct
mapping of the __m128d type to PowerPC __vector double type and allows
natural handling of parameter passing, return values, and SIMD double
operations. 

However, while both ISA's support double and float scalars in vector
registers the X86_64 and PowerPC64LE use different formats (and bits
within the vector register) for floating point scalars. This requires
extra PowerISA operations to exactly match the X86 SSE scalar (intrinsic
functions ending in *_sd) semantics. The intent is to provide a
functionally correct implementation at some reduction in performance.

Some inline assembler is required. There a several cases where we need 
to generate Data Cache Block instruction. There are no existing builtin
for flush and touch for store transient.  Also some of the double to and
from 32-bit float and int required assembler to the correct semantics
at reasonable cost. Perhaps these can be revisited when the team
completes the builtins for vec_double* and vec_float*.

part 2 adds the associated 131 DG test cases.

./gcc/ChangeLog:

2017-10-17  Steven Munroe  

* config.gcc (powerpc*-*-*): Add emmintrin.h.
* config/rs6000/emmintrin.h: New file.
* config/rs6000/x86intrin.h [__ALTIVEC__]: Include emmintrin.h.

Index: gcc/config.gcc
===
--- gcc/config.gcc  (revision 253786)
+++ gcc/config.gcc  (working copy)
@@ -459,7 +459,7 @@ powerpc*-*-*)
extra_objs="rs6000-string.o rs6000-p8swap.o"
extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
extra_headers="${extra_headers} bmi2intrin.h bmiintrin.h"
-   extra_headers="${extra_headers} xmmintrin.h mm_malloc.h"
+   extra_headers="${extra_headers} xmmintrin.h mm_malloc.h emmintrin.h"
extra_headers="${extra_headers} mmintrin.h x86intrin.h"
extra_headers="${extra_headers} ppu_intrinsics.h spu2vmx.h vec_types.h 
si2vmx.h"
extra_headers="${extra_headers} paired.h"
Index: gcc/config/rs6000/x86intrin.h
===
--- gcc/config/rs6000/x86intrin.h   (revision 253786)
+++ gcc/config/rs6000/x86intrin.h   (working copy)
@@ -39,6 +39,8 @@
 #include 
 
 #include 
+
+#include 
 #endif /* __ALTIVEC__ */
 
 #include 
Index: gcc/config/rs6000/emmintrin.h
===
--- gcc/config/rs6000/emmintrin.h   (revision 0)
+++ gcc/config/rs6000/emmintrin.h   (revision 0)
@@ -0,0 +1,2413 @@
+/* Copyright (C) 2003-2017 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+/* Implemented from the specification included in the Intel C++ Compiler
+   User Guide and Reference, version 9.0.  */
+
+#ifndef NO_WARN_X86_INTRINSICS
+/* This header is distributed to simplify porting x86_64 code that
+   makes explicit use of Intel intrinsics to powerpc64le.
+   It is the user's responsibility to determine if the results are
+   acceptable and make additional changes as necessary.
+   Note that much code that uses Intel intrinsics can be rewritten in
+   standard C or GNU C extensions, which are more portable and better
+   optimized across multiple targets.
+
+   In the specific case of X86 SSE2 (__m128i, __m128d) intrinsics,
+   the PowerPC VMX/VSX ISA is a good match for vector double SIMD
+   operations.  However scalar double operations in vector (XMM)
+   registers require the POWER8 VSX ISA (2.07) level. Also there are
+   important differences for data format and placement of double
+   scalars in the vector register.
+
+   For PowerISA Scalar double in FPRs