Re: [PATCH] Inefficient code generation for 128-bit-≥256-bittypecast intrinsics (BZ #15712)

Craig Topper Sun, 04 Aug 2013 23:23:20 -0700

avxintrin.h is now fixed in r187716.


On Sat, Aug 3, 2013 at 10:45 AM, Craig Topper <[email protected]>wrote:

> Commited in r187694. And I'll fix the avxintrin header in just a moment.
>
>
> On Sat, Aug 3, 2013 at 1:08 AM, Craig Topper <[email protected]>wrote:
>
>> Option (2) directly matches the capabilities of the shufflevector
>> instruction in the LLVM IR. I have attached a patch that will allow -1 to
>> become undef in the IR.
>>
>> So
>>
>> __builtin_shufflevector( x, y, 0, 4, -1, 5 );
>>
>> becomes
>>
>> shufflevector <4 x float> %x, <4 x float> %y, <4 x i32> <i32 0, i32 4,
>> i32 undef, i32 5>
>>
>>
>> On Fri, Aug 2, 2013 at 6:15 PM, Katya Romanova <
>> [email protected]> wrote:
>>
>>>
>>>
>>> Craig Topper <craig.topper@...> writes:
>>>
>>> >
>>> >
>>> > Ok so -1 isn't valid for indices, and i have even more questions about
>>> __builtin_shufflevector the more i look at it. See my message in cfe-dev.
>>> >
>>> >
>>> > On Thu, Jul 18, 2013 at 6:12 PM, Chandler Carruth
>>> <[email protected]> wrote:
>>> >
>>> > On Thu, Jul 18, 2013 at 6:11 PM, Craig Topper
>>> <[email protected]> wrote:
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Would __builtin_shufflevector(__a, __a, 0, 1, -1, -1)  work?
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Personally, I would prefer a defined way to produce an undef input in
>>> general... but if folks are worried about exposing such an interface,
>>> then
>>> sure, we could just allow the shuffle builtin itself to designate an
>>> "undef"
>>> input with goofy indices.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Jul 18, 2013 at 5:42 PM, Chandler Carruth
>>> <[email protected]> wrote:
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Jul 18, 2013 at 5:32 PM, Katya Romanova
>>> <[email protected]> wrote:-
>>>  __m128d __zero = _mm_setzero_pd();
>>> > -  return __builtin_shufflevector(__a, __zero, 0, 1, 2, 2);
>>> > +  return (__m256d)__builtin_ia32_pd256_pd((__v2df)__a);
>>> >
>>> >
>>> > I think this is the wrong approach.
>>> >
>>> > Rather than switching these to use an x86-specific builtin, instead it
>>> would be better to provide some generic form to produce an undef input
>>> to a
>>> shufflevector. That is a generally useful and completely target
>>> independent
>>> concept.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > cfe-commits mailing listcfe-commits <at>
>>> cs.uiuc.eduhttp://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
>>> >
>>> >
>>> > -- ~Craig
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > -- ~Craig
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > cfe-commits mailing list
>>> > cfe-commits@...
>>> > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
>>> >
>>>
>>>
>>>
>>> I agree with Chandler that it's better to use a shuffle with undef input
>>> (which is target independent), even though we generate code for AVX
>>> intrinsics. The reason I initially ended up using a x86-specific builtin
>>> is
>>> because there I couldn't find a generic way to create "undef" input for a
>>> shuffle.
>>>
>>> I tried the following, but I didn't like it, because the compiler gives a
>>> warning when compiling avxintrin.h
>>>
>>> static __inline __m256d __attribute__((__always_inline__, __nodebug__))
>>> _mm256_castpd128_pd256(__m128d in)
>>> {
>>>   __m128d undef;
>>>   return __builtin_shufflevector(in, undef, 0, 1, 2, 2);
>>> }
>>>
>>> I tried this as well and I didn't like it either:
>>>
>>> static __inline __m256d __attribute__((__always_inline__, __nodebug__))
>>> _mm256_castpd128_pd256(__m128d in)
>>> {
>>>   __v2df __in = (__v2df) in;
>>>   __v4df ret;
>>>   ret[0]=in[0];
>>>   ret[1]=in[1];
>>>   return (__m256d)ret;
>>> }
>>>
>>> So, I ended up introducing a x86_64 builtin and lowered it later to a
>>> shuffle with undef (not a target-independent solution).
>>>
>>> static __inline __m256d __attribute__((__always_inline__, __nodebug__))
>>>  _mm256_castpd128_pd256(__m128d __a)  {
>>>   return (__m256d)__builtin_ia32_pd256_pd((__v2df)__a);
>>> }
>>>
>>>
>>> I've read Craig's proposal about using shuffle builtin with negative
>>> indeces
>>> (-1) to indicate shuffle with undef. This solution looks good. However,
>>> "-1"
>>> shuffle index is presently considered invalid. We need to discuss
>>> extending
>>> shuffle syntax/semantics and then implement this extension before I could
>>> use a shuffle with negative indices for AVX typecast builtins. It looks
>>> like
>>> it will take some time...
>>>
>>> I was wondering if it's possible to check in my current fix that is using
>>> x86_86 builtins (instead of a shuffle) for AVX typecast intrinsics for
>>> now.
>>> When shuffle learns to understand negative indices, I could easily
>>> replaces
>>> my changes with something like that:
>>>
>>> __builtin_shufflevector(__a, __a, 0, 1, -1, -1)
>>>
>>> If this interim solution doesn’t sound inappropriate, we should start a
>>> discussion about extending shuffle builtin functionality to understand
>>> negative indexes.
>>>
>>> Here are several ideas:
>>>
>>> We could use "unary" form of __builtin_shufflevector when negative
>>> indices
>>> are used.
>>> A "binary" form could be used with negive indexes as well, but semantic
>>> analysis should ensure that the first and the second parameter is
>>> actually
>>> the same vector. Here is the reason for this limitation:
>>>
>>> If negative indices specify "undef" and a binary form of
>>> __builtin_shufflevector is used with different first and second
>>> parameter,
>>> e.g. __builtin_shufflevector(a, b, 0, 1, 7, -1)
>>> then, in fact, we will be shuffling 3 vectors (a, b and undef). I don’t
>>> think that it’s a good idea to extend __builtin_shufflevector semantic
>>> to do
>>> that.
>>>
>>>
>>>  Which solution is preferred?
>>> (1) Support negative indices for unary form of __builtin_shufflevector
>>> only.
>>> (2) Support negative indices for binary form of __builtin_shufflevector
>>> only
>>> and ensure that the first and the second parameter is the same vector.
>>> (3) Support both (1) and (2).
>>> (4) Another possible (though very different from proposed above) solution
>>> that allows to use "undef" in shuffles would be adding a
>>> target-independent
>>> builtin (e.g __builtin_undef(vector a)), which creates an “undef” vector
>>> with the same type and the same number of elements as its vector
>>> argument.
>>> With this "undef" builtin, I could implement AVX typecast builtins like
>>> that:
>>>
>>> static __inline __m256d __attribute__((__always_inline__, __nodebug__))
>>> _mm256_castpd128_pd256(__m128d in)
>>> {
>>>   __m128d undef = __builtin_undef(in);
>>>   return __builtin_shufflevector(in, undef, 0, 1, 2, 2);
>>> }
>>>
>>> Thoughts?
>>>
>>>
>>> Thank you!
>>> Katya.
>>>
>>>
>>>
>>> _______________________________________________
>>> cfe-commits mailing list
>>> [email protected]
>>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
>>>
>>
>>
>>
>> --
>> ~Craig
>>
>
>
>
> --
> ~Craig
>



-- 
~Craig

_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits

Re: [PATCH] Inefficient code generation for 128-bit-≥256-bittypecast intrinsics (BZ #15712)

Reply via email to