On Wed, Dec 30, 2015 at 5:53 PM, Roland Scheidegger <[email protected]> wrote: > Am 30.12.2015 um 10:57 schrieb Oded Gabbay: >> On Wed, Dec 30, 2015 at 3:18 AM, Roland Scheidegger <[email protected]> >> wrote: >>> Am 29.12.2015 um 17:12 schrieb Oded Gabbay: >>>> This file provides a portability layer that will make it easier to convert >>>> SSE-based functions to VMX/VSX-based functions. >>>> >>>> All the functions implemented in this file are prefixed using "vec_". >>>> Therefore, when converting from SSE-based function, one needs to simply >>>> replace the "_mm_" prefix of the SSE function being called to "vec_". >>>> >>>> Having said that, not all functions could be converted as such, due to the >>>> differences between the architectures. So, when doing such >>>> conversion hurt the performance, I preferred to implement a more ad-hoc >>>> solution. For example, converting the _mm_shuffle_epi32 needed to be done >>>> using ad-hoc masks instead of a generic function. >>>> >>>> All the functions in this file support both little-endian and big-endian. >>>> >>>> All of the functions are implemented using the Altivec/VMX intrinsics, >>>> except one where I needed to use inline assembly (due to missing >>>> intrinsic). >>>> >>>> Signed-off-by: Oded Gabbay <[email protected]> >>>> --- >>>> src/gallium/auxiliary/util/u_pwr8.h | 298 >>>> ++++++++++++++++++++++++++++++++++++ >>>> 1 file changed, 298 insertions(+) >>>> create mode 100644 src/gallium/auxiliary/util/u_pwr8.h >>>> >>>> diff --git a/src/gallium/auxiliary/util/u_pwr8.h >>>> b/src/gallium/auxiliary/util/u_pwr8.h >>>> new file mode 100644 >>>> index 0000000..c6e5fac >>>> --- /dev/null >>>> +++ b/src/gallium/auxiliary/util/u_pwr8.h >>>> @@ -0,0 +1,298 @@ >>>> +/* >>>> + * Copyright 2015 Red Hat Inc. >>>> + * >>>> + * Permission is hereby granted, free of charge, to any person obtaining a >>>> + * copy of this software and associated documentation files (the >>>> "Software"), >>>> + * to deal in the Software without restriction, including without >>>> limitation >>>> + * on the rights to use, copy, modify, merge, publish, distribute, sub >>>> + * license, and/or sell copies of the Software, and to permit persons to >>>> whom >>>> + * the Software is furnished to do so, subject to the following >>>> conditions: >>>> + * >>>> + * The above copyright notice and this permission notice (including the >>>> next >>>> + * paragraph) shall be included in all copies or substantial portions of >>>> the >>>> + * Software. >>>> + * >>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, >>>> EXPRESS OR >>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF >>>> MERCHANTABILITY, >>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT >>>> SHALL >>>> + * THE AUTHOR(S) AND/OR THEIR SUPPLIERS BE LIABLE FOR ANY CLAIM, >>>> + * DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR >>>> + * OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR >>>> THE >>>> + * USE OR OTHER DEALINGS IN THE SOFTWARE. >>>> + * >>>> + * Author: Oded Gabbay <[email protected]> >>>> + */ >>>> + >>>> +/** >>>> + * @file >>>> + * POWER8 intrinsics portability header. >>>> + * >>>> + */ >>>> + >>>> +#ifndef U_PWR8_H_ >>>> +#define U_PWR8_H_ >>>> + >>>> +#ifdef _ARCH_PWR8 >>>> + >>>> +#define VECTOR_ALIGN_16 __attribute__ ((__aligned__ (16))) >>>> + >>>> +typedef VECTOR_ALIGN_16 vector unsigned char __m128i; >>>> + >>>> +typedef VECTOR_ALIGN_16 union m128i { >>>> + __m128i m128i; >>>> + vector signed int m128si; >>>> + vector unsigned int m128ui; >>>> + ubyte ub[16]; >>>> + ushort us[8]; >>>> + int i[4]; >>>> + uint ui[4]; >>>> +} __m128i_union; >>>> + >>>> +static inline __m128i >>>> +vec_set_epi32 (int i3, int i2, int i1, int i0) >>>> +{ >>>> + __m128i_union vdst; >>>> + >>>> +#ifdef PIPE_ARCH_LITTLE_ENDIAN >>>> + vdst.i[0] = i0; >>>> + vdst.i[1] = i1; >>>> + vdst.i[2] = i2; >>>> + vdst.i[3] = i3; >>>> +#else >>>> + vdst.i[3] = i0; >>>> + vdst.i[2] = i1; >>>> + vdst.i[1] = i2; >>>> + vdst.i[0] = i3; >>>> +#endif >>>> + >>>> + return (__m128i) vdst.m128si; >>>> +} >>>> + >>>> +static inline __m128i >>>> +vec_setr_epi32 (int i0, int i1, int i2, int i3) >>>> +{ >>>> + return vec_set_epi32 (i3, i2, i1, i0); >>>> +} >>>> + >>>> +static inline __m128i >>>> +vec_unpacklo_epi32 (__m128i even, __m128i odd) >>>> +{ >>>> + static const __m128i perm_mask = >>>> +#ifdef PIPE_ARCH_LITTLE_ENDIAN >>>> + { 0, 1, 2, 3, 16, 17, 18, 19, 4, 5, 6, 7, 20, 21, 22, 23}; >>>> +#else >>>> + {24, 25, 26, 27, 8, 9, 10, 11, 28, 29, 30, 31, 12, 13, 14, 15}; >>>> +#endif >>>> + >>>> + return vec_perm (even, odd, perm_mask); >>>> +} >>>> + >>>> +static inline __m128i >>>> +vec_unpackhi_epi32 (__m128i even, __m128i odd) >>>> +{ >>>> + static const __m128i perm_mask = >>>> +#ifdef PIPE_ARCH_LITTLE_ENDIAN >>>> + { 8, 9, 10, 11, 24, 25, 26, 27, 12, 13, 14, 15, 28, 29, 30, 31}; >>>> +#else >>>> + {16, 17, 18, 19, 0, 1, 2, 3, 20, 21, 22, 23, 4, 5, 6, 7}; >>>> +#endif >>>> + >>>> + return vec_perm (even, odd, perm_mask); >>>> +} >>>> + >>>> +static inline __m128i >>>> +vec_unpacklo_epi64 (__m128i even, __m128i odd) >>>> +{ >>>> + static const __m128i perm_mask = >>>> +#ifdef PIPE_ARCH_LITTLE_ENDIAN >>>> + { 0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21, 22, 23}; >>>> +#else >>>> + {24, 25, 26, 27, 28, 29, 30, 31, 8, 9, 10, 11, 12, 13, 14, 15}; >>>> +#endif >>>> + >>>> + return vec_perm (even, odd, perm_mask); >>>> +} >>>> + >>>> +static inline __m128i >>>> +vec_unpackhi_epi64 (__m128i even, __m128i odd) >>>> +{ >>>> + static const __m128i perm_mask = >>>> +#ifdef PIPE_ARCH_LITTLE_ENDIAN >>>> + { 8, 9, 10, 11, 12, 13, 14, 15, 24, 25, 26, 27, 28, 29, 30, 31}; >>>> +#else >>>> + {16, 17, 18, 19, 20, 21, 22, 23, 0, 1, 2, 3, 4, 5, 6, 7}; >>>> +#endif >>>> + >>>> + return vec_perm (even, odd, perm_mask); >>>> +} >>>> + >>>> +static inline __m128i >>>> +vec_add_epi32 (__m128i a, __m128i b) >>>> +{ >>>> + return (__m128i) vec_add ((vector signed int) a, (vector signed int) >>>> b); >>>> +} >>>> + >>>> +static inline __m128i >>>> +vec_sub_epi32 (__m128i a, __m128i b) >>>> +{ >>>> + return (__m128i) vec_sub ((vector signed int) a, (vector signed int) >>>> b); >>>> +} >>>> + >>>> +static inline __m128i >>>> +vec_mullo_epi32 (__m128i a, __m128i b) >>>> +{ >>>> + __m128i v; >>>> + >>>> + __asm__( >>>> + "vmuluwm %0, %1, %2 \n" >>>> + : "=v" (v) >>>> + : "v" (a), "v" (b) >>>> + ); >>>> + >>>> + return v; >>>> +} >>>> + >>>> +static inline void >>>> +transpose4_epi32(const __m128i * restrict a, >>>> + const __m128i * restrict b, >>>> + const __m128i * restrict c, >>>> + const __m128i * restrict d, >>>> + __m128i * restrict o, >>>> + __m128i * restrict p, >>>> + __m128i * restrict q, >>>> + __m128i * restrict r) >>>> +{ >>>> + __m128i t0 = vec_unpacklo_epi32(*a, *b); >>>> + __m128i t1 = vec_unpacklo_epi32(*c, *d); >>>> + __m128i t2 = vec_unpackhi_epi32(*a, *b); >>>> + __m128i t3 = vec_unpackhi_epi32(*c, *d); >>>> + >>>> + *o = vec_unpacklo_epi64(t0, t1); >>>> + *p = vec_unpackhi_epi64(t0, t1); >>>> + *q = vec_unpacklo_epi64(t2, t3); >>>> + *r = vec_unpackhi_epi64(t2, t3); >>>> +} >>>> + >>>> +static inline __m128i >>>> +vec_slli_epi32 (__m128i vsrc, unsigned int count) >>>> +{ >>>> + __m128i_union vec_count; >>>> + >>>> + if (count >= 32) >>>> + return (__m128i) vec_splats (0); >>>> + else if (count == 0) >>>> + return vsrc; >>>> + >>>> + /* In VMX, all shift count fields must contain the same value */ >>>> + vec_count.m128si = (vector signed int) vec_splats (count); >>>> + return (__m128i) vec_sl ((vector signed int) vsrc, vec_count.m128ui); >>>> +} >>>> + >>>> +static inline __m128i >>>> +vec_srli_epi32 (__m128i vsrc, unsigned int count) >>>> +{ >>>> + __m128i_union vec_count; >>>> + >>>> + if (count >= 32) >>>> + return (__m128i) vec_splats (0); >>>> + else if (count == 0) >>>> + return vsrc; >>>> + >>>> + /* In VMX, all shift count fields must contain the same value */ >>>> + vec_count.m128si = (vector signed int) vec_splats (count); >>>> + return (__m128i) vec_sr ((vector signed int) vsrc, vec_count.m128ui); >>>> +} >>>> + >>>> +static inline __m128i >>>> +vec_srai_epi32 (__m128i vsrc, unsigned int count) >>>> +{ >>>> + __m128i_union vec_count; >>>> + >>>> + if (count >= 32) >>>> + return (__m128i) vec_splats (0); >>>> + else if (count == 0) >>>> + return vsrc; >>>> + >>>> + /* In VMX, all shift count fields must contain the same value */ >>>> + vec_count.m128si = (vector signed int) vec_splats (count); >>>> + return (__m128i) vec_sra ((vector signed int) vsrc, vec_count.m128ui); >>>> +} >>>> + >>>> +static inline __m128i >>>> +vec_cmpeq_epi32 (__m128i a, __m128i b) >>>> +{ >>>> + return (__m128i) vec_cmpeq ((vector signed int) a, (vector signed int) >>>> b); >>>> +} >>>> + >>>> +static inline __m128i >>>> +vec_loadu_si128 (const uint32_t* src) >>>> +{ >>>> + __m128i_union vsrc; >>>> + >>>> +#ifdef PIPE_ARCH_LITTLE_ENDIAN >>>> + >>>> + vsrc.m128ui = *((vector unsigned int *) src); >>>> + >>>> +#else >>>> + >>>> + __m128i vmask, tmp1, tmp2; >>>> + >>>> + vmask = vec_lvsl(0, src); >>>> + >>>> + tmp1 = (typeof(tmp1))vec_ld (0, src); >>>> + tmp2 = (typeof(tmp2))vec_ld (15, src); >>>> + vsrc.m128ui = (typeof(vsrc.m128ui)) vec_perm (tmp1, tmp2, vmask); >>>> + >>>> +#endif >>>> + >>>> + return vsrc.m128i; >>>> +} >>>> + >>>> +static inline void >>>> +vec_store_si128 (uint32_t* dest, __m128i vdata) >>>> +{ >>>> + vec_st ((vector unsigned int) vdata, 0, dest); >>>> +} >>>> + >>>> +static inline int >>>> +vec_movemask_epi8 (__m128i vsrc) >>>> +{ >>>> + __m128i_union vtemp; >>>> + int result; >>>> + >>>> + vtemp.m128i = __builtin_vec_vgbbd(vsrc); >>>> + >>>> +#ifdef PIPE_ARCH_LITTLE_ENDIAN >>>> + result = vtemp.ub[15] << 8 | vtemp.ub[7]; >>>> +#else >>>> + result = vtemp.ub[0] << 8 | vtemp.ub[8]; >>>> +#endif >>>> + >>>> + return result; >>>> +} >>>> + >>>> +static inline __m128i >>>> +vec_packs_epi16 (__m128i a, __m128i b) >>>> +{ >>>> +#ifdef PIPE_ARCH_LITTLE_ENDIAN >>>> + return (__m128i) vec_packs ((vector signed short) a, >>>> + (vector signed short) b); >>>> +#else >>>> + return (__m128i) vec_packs ((vector signed short) b, >>>> + (vector signed short) a); >>>> +#endif >>>> +} >>>> + >>>> +static inline __m128i >>>> +vec_packs_epi32 (__m128i a, __m128i b) >>>> +{ >>>> +#ifdef PIPE_ARCH_LITTLE_ENDIAN >>>> + return (__m128i) vec_packs ((vector signed int) a, (vector signed int) >>>> b); >>>> +#else >>>> + return (__m128i) vec_packs ((vector signed int) b, (vector signed int) >>>> a); >>>> +#endif >>>> +} >>>> + >>>> +#endif /* _ARCH_PWR8 */ >>>> + >>>> +#endif /* U_PWR8_H_ */ >>>> >>> >>> >>> For which functions do you actually need power8? >>> I believe __builtin_vec_vgbbd used for vec_movemask_epi8 is one? >> Yes, but not only that. >> I also need power8 for vmuluwm, which is the parallel of >> _mm_mullo_epi32. Without it, the only option is to just do four >> regular multiplications, which again ruins the performance (makes it >> worse than original code). > Ah I see. I don't know anything about power8, albeit I suppose it is > possible the big penalty is more due to domain transition rather than > the additional muls (if you just access the vector through memory, I > have no idea how good the forwarding between int and vector domain is). > If so, looks like you don't have 32bit integer vector muls - you could > try to emulate that with some 16bit muls (3 should be sufficient) plus > some more shuffle (or shift) and add (in fact looks like altivec has > some mul+sum instructions which should help with that). But if you don't > care about power7, probably not worth it...
Yeah, that's my current thought. > >> >> When I started to write this patch series, I had no intention of >> limiting it to POWER8. But without vec_vgbbd and vmuluwm, the >> performance of the vector code is much worse (around 10-15%) than the >> non-vector code. >> >>> I have actually been thinking about how we deal with these masks some >>> time ago. It looks suboptimal since for the most part we're packing the >>> masks and then just have to unpack them again in the jit code, so I >>> always believed there's just overhead due to that (could at least pass >>> as one-byte-per-mask bit). The problem with that though is that those >>> masks also interact with the c code, albeit typically just checking for >>> a specific value, so as long as you can do that efficiently with vector >>> instructions it should be possible to work around it (sse41 in fact has >>> the very efficient ptest instruction, but unfortunately it can't really >>> be integrated well into surrounding c code due to the fact it just sets >>> the flags register and has no return value, plus it would need runtime >>> feature detection). >> >> I understand what you are saying but I need to learn more about it to >> be able to change the code. Although if I see you changing it for SSE, >> I will be able to copy it to the VMX/VSX code. > Yes, it's just something to consider for the future. I hope to be able > to optimize this a bit at some point though certainly I wouldn't mind > you doing it :-). > >> >>> Plus there's some problem since the "recursive" evaluation of blocks >>> (for size 64/16/4) mostly use the same functions, and the outer ones >>> really only need some scalar value which can be efficiently dealt with >>> with c code. >>> >>> FWIW looks like the unaligned load is somewhat complex too. >> >> Fortunately, in POWER8, there is almost zero penalty for doing >> unaligned loads into vector registers. That is a major improvement >> over previous generations. >> >>> I am not >>> entirely sure why we use unaligned values in the first place there, as >>> far as I can tell it should be possible to fix this with a couple >>> PIPE_ALIGN_VAR(16) for the affected struct (surely we'd prefer aligned >>> variables if we can). Unless there's some reason that doesn't work. >>> >>> Roland >>> >> >> Yes, that may give a modest boost. I, for one, have put the >> VECTOR_ALIGN_16 (which is __attribute__ ((__aligned__ (16)))) before >> any vector register in my code (see the start of u_pwr8.h). But adding >> that to additional structures might give even more boost. > > Yes, on x86 with newer cpus the penalties are very slim too (if you use > unaligend loads but it happens to be aligned there's no penalty at all, > and even if it's unaligned, it's pretty reasonable - don't try that on > the P4, though...). But still, if the data can easily be aligned, > there's no reason to use unaliged data. On a quick glance it actually > seems all loadu there originate from non-aligned fixed_position structs, > and they could trivially be aligned as far as I can tell. > > Roland > So I saw you changed that and therefore I added a new vec_load_si128 function to this file. Oded _______________________________________________ mesa-dev mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/mesa-dev
