On Mon, 2025-11-24 at 15:27 +0800, Li Wei wrote:
> Currently, the shuffle in which LoongArch selects two vectors at
> corresponding positions is implemented through the [x]vshuf instruction,
> but this will introduce additional index copies. In this case, the
> [x]vbitsel.v instruction can be used for optimization.
We can use [x]vshuf.b instruction instead. It does not require the
output register to be same as the input register for the selector as
well. And comparing to [x]vbitsel.v it works for any constant selector,
not only some selectors.
For example
la.local $t0, .LC0
vori.b $vr2, $vr1, 0
vld $vr0, $t0
vshuf.w $vr0, $vr2, $vr1
.LC0:
.word 1
.word 1
.word 4
.word 5
can be
la.local $t0, .LC0
vld $vr2, $t0, 0
vshuf.b $vr0, $vr0, $vr1, $vr2
.LC0:
.byte 4; .byte 5; .byte 6; .byte 7;
.byte 4; .byte 5; .byte 6; .byte 7;
.byte 16; .byte 17; .byte 18; .byte 19;
.byte 20; .byte 21; .byte 22; .byte 23;
instead. But vbitsel.b cannot work for this case.
I have a large (3000-line) patch containing various modifications to
vector shuffling with constant selector locally, including the lowering
to [x]vshuf.b. But I didn't have time to test it thoroughly and
separate it into a reviewable patch series in GCC 16 stage 1.
IMO it should be deferred to GCC 17, however if you think it's OK for
GCC 16 (even though we are in stage 3 now) I can extract the part of
lowering to [x]vshuf.b out and send it.
--
Xi Ruoyao <[email protected]>