https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109944
Bug ID: 109944 Summary: vector CTOR with byte elements and SSE2 has STLF fail Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rguenth at gcc dot gnu.org Target Milestone: --- I've experimented with CTORs from smaller elements and byte handling with plain SSE2 is quite bad (for word we have pinsrw). void foo(char *a, char *m, char d, char e) { char b = *m; char c = m[2]; a[0] = b; a[1] = c; a[2] = d; a[3] = e; a[4] = b; a[5] = c; a[6] = b; a[7] = c; a[8] = b; a[9] = c; a[10] = b; a[11] = c; a[12] = b; a[13] = c; a[14] = b; a[15] = c; } generates movzbl 2(%rsi), %r8d movl %edx, %r9d movzbl (%rsi), %edx movzbl %cl, %ecx movzbl %r9b, %r9d movq %r8, %rax salq $8, %rax orq %rdx, %rax salq $8, %rax orq %r8, %rax salq $8, %rax orq %rdx, %rax salq $8, %rax orq %rax, %rcx orq %r8, %rax salq $8, %rcx salq $8, %rax orq %r9, %rcx orq %rdx, %rax salq $8, %rcx salq $8, %rax orq %r8, %rcx orq %r8, %rax salq $8, %rcx salq $8, %rax orq %rdx, %rcx orq %rdx, %rax movq %rcx, -24(%rsp) movq %rax, -16(%rsp) movdqa -24(%rsp), %xmm0 movups %xmm0, (%rdi) while we can handle a splat from QImode via movzbl (%rsi), %eax movd %eax, %xmm0 punpcklbw %xmm0, %xmm0 punpcklwd %xmm0, %xmm0 pshufd $0, %xmm0, %xmm0 movups %xmm0, (%rdi) I think we can go and for a generic V16QImode CTOR and SSE2 create two V8HImode vectors using pinsrw, for the first from zero-extended QImode values of the even elements and for the second from zero-extended and left-shifted values of the odd elements and then IOR the two vectors. Alternatively the above needs to be pessimized better in the cost model. Btw, for HImode elements I see we do movzwl (%rsi), %eax movd %eax, %xmm0 movdqa %xmm0, %xmm1 movdqa %xmm0, %xmm2 pinsrw $1, 4(%rsi), %xmm1 ... not sure why we don't do pxor %xmm1, %xmm1 pinsrw $0, (%rsi), %xmm1 and thus avoid the round-trip through the GPR for the initial element?