On Wed, Oct 12, 2016 at 10:29 AM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Wed, Oct 12, 2016 at 9:12 AM, Richard Biener
> <richard.guent...@gmail.com> wrote:
>> On Tue, Oct 11, 2016 at 5:03 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>> Hi,
>>> Given below test case,
>>> int foo (unsigned short a[], unsigned int x)
>>> {
>>>   unsigned int i;
>>>   for (i = 0; i < 1000; i++)
>>>     {
>>>       x = a[i];
>>>       a[i] = (unsigned short)(x >= 32768 ? x - 32768 : 0);
>>>     }
>>>   return x;
>>> }
>>>
>>> it now can be vectorized on AArch64, but generated assembly is way from 
>>> optimal:
>>> .L4:
>>>         ldr     q4, [x3, x1]
>>>         add     w2, w2, 1
>>>         cmp     w2, w0
>>>         ushll   v1.4s, v4.4h, 0
>>>         ushll2  v0.4s, v4.8h, 0
>>>         add     v3.4s, v1.4s, v6.4s
>>>         add     v2.4s, v0.4s, v6.4s
>>>         cmhi    v1.4s, v1.4s, v5.4s
>>>         cmhi    v0.4s, v0.4s, v5.4s
>>>         and     v1.16b, v3.16b, v1.16b
>>>         and     v0.16b, v2.16b, v0.16b
>>>         xtn     v2.4h, v1.4s
>>>         xtn2    v2.8h, v0.4s
>>>         str     q2, [x3, x1]
>>>         add     x1, x1, 16
>>>         bcc     .L4
>>>
>>> The vectorized loop has 15 instructions, which can be greatly simplified by 
>>> turning cond_expr into max_expr, as below:
>>> .L4:
>>>         ldr     q1, [x3, x1]
>>>         add     w2, w2, 1
>>>         cmp     w2, w0
>>>         umax    v0.8h, v1.8h, v2.8h
>>>         add     v0.8h, v0.8h, v2.8h
>>>         str     q0, [x3, x1]
>>>         add     x1, x1, 16
>>>         bcc     .L4
>>>
>>> This patch addresses the issue by adding new vectorization pattern.
>>> Bootstrap and test on x86_64 and AArch64.  Is it OK?
>>
>> So the COND_EXPRs are generated this way by if-conversion, right?  I
> Though ?: is used in source code, yes, it is if-conv regenerated COND_EXPR.
>> believe that
>> the MAX/MIN_EXPR form is always preferrable and thus it looks like 
>> if-conversion
>> might want to either directly generate it or make sure to fold the
>> introduced stmts
>> (and have a match.pd pattern catching this).
> Hmm, I also noticed saturation cases which should be better
> transformed before vectorization in scalar optimizers.  But this case
> is a bit different because there is additional computation involved
> other than type conversion.  We need to prove the computation can be
> done in either large or small types.  It is quite specific case and I
> don't see good (general) solution in if-conv.  Vect-pattern looks like
> a natural place doing this.  I am also looking at general saturation
> cases, but this one is different?

(vect-patterns should go away ...)

But as if-conversion results may also prevail for scalar code doing the
pattern in match.pd would be better - that is, "apply" the pattern
already during if-conversion.

Yes, if-conversion fails to fold the stmts it generates (it only uses
generic folding on the trees it builds - it can need some TLC here).

Richard.

> Thanks,
> bin
>>
>> Richard.
>>
>>> Thanks,
>>> bin
>>>
>>> 2016-10-11  Bin Cheng  <bin.ch...@arm.com>
>>>
>>>         * tree-vect-patterns.c (vect_recog_min_max_modify_pattern): New.
>>>         (vect_vect_recog_func_ptrs): New element for above pattern.
>>>         * tree-vectorizer.h (NUM_PATTERNS): Increase by 1.
>>>
>>> gcc/testsuite/ChangeLog
>>> 2016-10-11  Bin Cheng  <bin.ch...@arm.com>
>>>
>>>         * gcc.dg/vect/vect-umax-modify-pattern.c: New test.
>>>         * gcc.dg/vect/vect-umin-modify-pattern.c: New test.

Reply via email to