On Thu, 15 Feb 2024, Richard Henderson wrote:

> On 2/14/24 22:47, Alexander Monakov wrote:
> > 
> > On Wed, 14 Feb 2024, Richard Henderson wrote:
> > 
> >> Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely
> >> double-check with the compiler flags for __ARM_NEON and don't bother with
> >> a runtime check.  Otherwise, model the loop after the x86 SSE2 function,
> >> and use VADDV to reduce the four vector comparisons.
> > 
> > I am not very familiar with Neon but I wonder if this couldn't use SHRN
> > for the final 128b->64b reduction similar to 2022 Glibc optimizations:
> > https://inbox.sourceware.org/libc-alpha/20220620174628.2820531-1-dani...@google.com/
> 
> The reason they use SHRN for memchr is that they have also applied a mask
> to the comparison so that they can identify which byte contained the match.
> That is not required here, so any reduction will do.

Right, so we can pick the cheapest reduction method, and if I'm reading
Neoverse-N1 SOG right, SHRN is marginally cheaper than ADDV (latency 2
instead of 3), and it should be generally preferable on other cores, no?

For that matter, cannot UQXTN (unsigned saturating extract narrow) be
used in place of CMEQ+ADDV here?

Alexander

Reply via email to