Great catch! From the intrinsic manual: 

Cast vector of type __m128i to type __m512i; the upper 384 bits of the result 
are undefined.

Replacing that with _mm512_zextsi128_si512 fixes the problem. 

> -----Original Message-----
> From: Nathan Bossart <[email protected]>
> Sent: Monday, June 16, 2025 3:14 PM
> To: Devulapalli, Raghuveer <[email protected]>
> Cc: John Naylor <[email protected]>; Andy Fan
> <[email protected]>; Jesper Pedersen <[email protected]>;
> Tomas Vondra <[email protected]>; [email protected];
> Shankaran, Akash <[email protected]>
> Subject: Re: Improve CRC32C performance on SSE4.2
> 
> On Mon, Jun 16, 2025 at 06:31:11PM +0000, Devulapalli, Raghuveer wrote:
> > Attached is a simple reproducer. It passes with clang v16 -O0, but
> > fails with 17 and 18 only when built with -O0..
> 
> I've just started looking into this, but the difference in code generated for
> _mm512_castsi128_si512() between gcc, clang 16, and clang 17 looks 
> interesting.
> 
> --
> nathan


Reply via email to