Bug ID: 41512
Summary: Conversion from int to XMM is handled inefficiently on
OS: Windows NT
Component: Backend: X86
CC: craig.top...@gmail.com, firstname.lastname@example.org,
Created attachment 21786
In attempt to swicth all our builds to SSE4 from SSSE3 we found out that code
as simple as
const __m128i lo = _mm_cvtsi32_si128(d0[value]);
const __m128i hi = _mm_cvtsi32_si128(d0[value+1024]);
val = _mm_add_epi64(val, _mm_unpacklo_epi64(lo, hi));
const __m128i all = _mm_set_epi32(0, d0[value], 0, d0[value+1024]);
val = _mm_add_epi64(val, all);
When inlined into loop performs worse when compiled with -sse4.1 than with just
The problem is that _mm_cvtsi32_si128() and _mm_set_epi32() both modeled via
%13 = insertelement <4 x i32> <i32 undef, i32 0, i32 undef, i32 0>, i32 %12,
i32 0, !dbg !287
Lowered to single movd instruction prior to SSE4 and to xor+pinsrd on SSE4.
* Notice that in a kernel fucntion in 2nd case there are couple of movd's, but
when used in loop it results in pair of pinsrd from memory into same register.
This seems to me like poor instruction selection both from performance and code
I suggset steering instruction selection for this idiomatic case of
INSERT_VECTOR_ELT to SCALAR_TO_VECTOR. This will directly lead to movd
Proposed change to lib/Target/X86/X86ISelLowering.cpp is attached.
You are receiving this mail because:
You are on the CC list for the bug.
llvm-bugs mailing list