Tamar Christina <[email protected]> writes:
>> -----Original Message-----
>> From: Richard Sandiford <[email protected]>
>> Sent: Thursday, February 1, 2024 4:42 PM
>> To: Tamar Christina <[email protected]>
>> Cc: Andrew Pinski <[email protected]>; [email protected]; nd
>> <[email protected]>; Richard Earnshaw <[email protected]>; Marcus
>> Shawcroft <[email protected]>; Kyrylo Tkachov
>> <[email protected]>
>> Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output
>> 
>> Tamar Christina <[email protected]> writes:
>> >> -----Original Message-----
>> >> From: Richard Sandiford <[email protected]>
>> >> Sent: Thursday, February 1, 2024 2:24 PM
>> >> To: Andrew Pinski <[email protected]>
>> >> Cc: Tamar Christina <[email protected]>; [email protected]; nd
>> >> <[email protected]>; Richard Earnshaw <[email protected]>; Marcus
>> >> Shawcroft <[email protected]>; Kyrylo Tkachov
>> >> <[email protected]>
>> >> Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output
>> >>
>> >> Andrew Pinski <[email protected]> writes:
>> >> > On Thu, Feb 1, 2024 at 1:26 AM Tamar Christina <[email protected]>
>> >> wrote:
>> >> >>
>> >> >> Hi All,
>> >> >>
>> >> >> In the vget_set_lane_1.c test the following entries now generate a zip1
>> instead
>> >> of an INS
>> >> >>
>> >> >> BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
>> >> >> BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
>> >> >> BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)
>> >> >>
>> >> >> This is because the non-Q variant for indices 0 and 1 are just 
>> >> >> shuffling values.
>> >> >> There is no perf difference between INS SIMD to SIMD and ZIP, as such 
>> >> >> just
>> >> update the
>> >> >> test file.
>> >> > Hmm, is this true on all cores? I suspect there is a core out there
>> >> > where INS is implemented with a much lower latency than ZIP.
>> >> > If we look at config/aarch64/thunderx.md, we can see INS is 2 cycles
>> >> > while ZIP is 6 cycles (3/7 for q versions).
>> >> > Now I don't have any invested interest in that core any more but I
>> >> > just wanted to point out that is not exactly true for all cores.
>> >>
>> >> Thanks for the pointer.  In that case, perhaps we should prefer
>> >> aarch64_evpc_ins over aarch64_evpc_zip in
>> aarch64_expand_vec_perm_const_1?
>> >> That's enough to fix this failure, but it'll probably require other
>> >> tests to be adjusted...
>> >
>> > I think given that Thundex-X is a 10 year old micro-architecture that is 
>> > several
>> cases where
>> > often used instructions have very high latencies that generic codegen 
>> > should not
>> be blocked
>> > from progressing because of it.
>> >
>> > we use zips in many things and if thunderx codegen is really of that much
>> importance then I
>> > think the old codegen should be gated behind -mcpu=thunderx rather than
>> preventing generic
>> > changes.
>> 
>> But you said there was no perf difference between INS and ZIP, so it
>> sounds like for all known cases, using INS rather than ZIP is either
>> neutral or better.
>> 
>> There's also the possible secondary benefit that the INS patterns use
>> standard RTL operations whereas the ZIP patterns use unspecs.
>> 
>> Keeping ZIP seems OK there's a specific reason to prefer it over INS for
>> more modern cores though.
>
> Ok, that's a fair point.  Doing some due diligence, Neoverse-E1 and
> Cortex-A65 SWoGs seem to imply that there ZIPs have better throughput
> than INSs. However the entries are inconsistent and I can't measure the
> difference so I believe this to be a documentation bug.
>
> That said, switching the operands seems to show one issue in that preferring
> INS degenerates code in cases where we are inserting the top bits of the first
> parameter into the bottom of the second parameter and returning,
>
> Zip being a Three operand instruction allows us to put the result into the 
> final
> destination register with one operation whereas INS requires an fmov:
>
> foo_uzp1_s32:
>         ins     v0.s[1], v1.s[0]
>         fmov    d0, d0
>         ret
> foo_uzp2_s32:
>         ins     v1.s[0], v0.s[1]
>         fmov    d0, d1
>         ret

Ah, yeah, I should have thought about that.

In that case, the original patch is OK, thanks.

Richard

Reply via email to