https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750
--- Comment #10 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Uroš Bizjak from comment #9)
> (In reply to Thiago Macieira from comment #0)
> > Testcase:
> ...
> > The assembly for this produces:
> >
> > vmovdqu16 (%rdi), %ymm1
> > vmovdqu16 32(%rdi), %ymm2
> > vpcmpuw $0, %ymm0, %ymm1, %k0
> > vpcmpuw $0, %ymm0, %ymm2, %k1
> > kmovw %k0, %edx
> > kmovw %k1, %eax
> > kortestw %k1, %k0
> > je .L10
> >
> > Those two KMOVW instructions aren't required for the check that follows.
> > They're also dispatched on port 0, same as the KORTESTW, meaning the KORTEST
> > can't be dispatched until those two have executed, thus introducing a
> > 2-cycle delay in this loop.
>
> These are not NOP moves but zero-extensions.
>
> vmovdqu16 (%rdi), %ymm1 # 93 [c=17 l=6]
> movv16hi_internal/2
> vmovdqu16 32(%rdi), %ymm2 # 94 [c=21 l=7]
> movv16hi_internal/2
> vpcmpuw $0, %ymm0, %ymm1, %k0 # 21 [c=4 l=7]
> avx512vl_ucmpv16hi3
> vpcmpuw $0, %ymm0, %ymm2, %k1 # 27 [c=4 l=7]
> avx512vl_ucmpv16hi3
> kmovw %k0, %edx # 30 [c=4 l=4] *zero_extendhisi2/1
> kmovw %k1, %eax # 29 [c=4 l=4] *zero_extendhisi2/1
> kortestw %k1, %k0 # 31 [c=4 l=4] kortesthi
>
> since for some reason tree optimizers give us:
>
> _28 = VIEW_CONVERT_EXPR<__v16hi>(_31);
> _29 = __builtin_ia32_ucmpw256_mask (_28, _20, 0, 65535);
> _26 = VIEW_CONVERT_EXPR<__v16hi>(_30);
> _27 = __builtin_ia32_ucmpw256_mask (_26, _20, 0, 65535);
> _2 = (int) _27;
> _3 = (int) _29;
> _15 = __builtin_ia32_kortestzhi (_3, _2);
>
>
Is there any way to avoid zero_extension for
> _2 = (int) _27;
> _3 = (int) _29;
Since __builtin_ia32_kortestzhi is defined to accept 2 short parameters. Also
ABI doesn't ask for clearing the upper bits.
i.e. for testcase
int
__attribute__((noipa))
foo (short a)
{
return a;
}
int
foo1 (short a)
{
return foo (a);
}
_Z3foos:
movswl %di, %eax
ret
_Z4foo1s:
movswl %di, %edi
jmp _Z3foos
movswl in foo1 seems redundant.