On 9/23/25 1:45 PM, Paul-Antoine Arras wrote:
I experimented with this patch which allows to remove a vfmv when a floating-point op can be loaded directly from memory with a zero-stride vlse.

In terms of benchmarks, I measured the following reductions in icount:
* 503.bwaves: -4.0%
* 538.imagick: -3.3%
* 549.fotonik3d: -0.34%

However, the icount for 507.cactuBSSN increased by 0.43%. In addition, measurements on the BPI board show that the patch actually increases execution times by 5 to 11%.

This may still be beneficial for some uarchs but would have to be tunable, wouldn't it?
Is worth proceeding with this?
So I looked a bit deeper at the instruction mix data for bwaves; I was kind of hoping to see something odd happening that would explain the performance behavior, but no such luck.

If we were running into something weird like a failure to hoist a memory reference out of a loop in the vlse64 version we'd see significant discrepancies in how the icounts change.

In the original code we have approx 14b fld instructions and 11b vfmv.v.f instructions. After the change we have roughly 11b vlse64 instructions, 3b fld instructions and virtually no vfmv.v.f. It's almost an exact match for what one would expect.

All the meaningful changes happen in one qemu translation block

ORIG:
  0x00000000000192c2   60057002880 27.4641% mat_times_vec_
      mv                      t4,s4
      add                     a1,a5,s10
      add                     a6,a5,s11
      snez                    a4,a4
      ld                      a3,16(sp)
      mv                      t3,s3
      fld                     fa2,0(a1)
      add                     a1,a5,s7
      fld                     fa1,0(a6)
      neg                     a4,a4
      mv                      t1,s2
      mv                      a6,s0
      fld                     fa4,0(a1)
      vmv.v.x                 v0,a4
      mv                      a0,t2
      mv                      a1,t0
      mv                      a4,s5
      sh3add                  a7,a2,a5
      ld                      a2,8(sp)
      fld                     fa0,0(a7)
      vmsne.vi                v0,v0,0
      mv                      a7,s1
      vfmv.v.f                v15,ft0
      vfmv.v.f                v13,fa1
      sh3add                  a2,a2,a5
      add                     a5,a5,s6
      vfmv.v.f                v12,fa2
      fld                     fa3,0(a2)
      fld                     fa5,0(a5)
      vfmv.v.f                v10,fa4
      mv                      a2,s5
      vfmv.v.f                v14,fa0
      vfmv.v.f                v11,fa3
      vfmv.v.f                v9,fa5
nop nop vsetvli a5,a3,e64,m1,ta,ma

NEW:
  0x00000000000192b8   56,810,678,400 27.0298% mat_times_vec_
      mv                      t4,s4
      mv                      t3,s3
      mv                      t1,s2
      add                     a3,a5,s10
      sh3add                  a4,s5,a5
      sh3add                  a1,s7,a5
      add                     a2,a5,s11
      vlse64.v                v9,(t6),zero
      mv                      a7,s1
      mv                      a6,s0
      mv                      a0,t2
      vlse64.v                v13,(a3),zero
      ld                      a3,24(sp)
      sd                      a3,8(sp)
      vlse64.v                v12,(a4),zero
      ld                      a4,16(sp)
      add                     a4,a4,a5
      add                     a5,a5,a3
      ld                      a3,32(sp)
      vlse64.v                v10,(a5),zero
      vlse64.v                v11,(a4),zero
      addi                    a4,t5,-1
      snez                    a4,a4
      neg                     a4,a4
      vmv.v.x                 v0,a4
      mv                      a4,s6
      vmsne.vi                v0,v0,0
      vlse64.v                v15,(a1),zero
      mv                      a1,t0
      vlse64.v                v14,(a2),zero
      mv                      a2,s6
nop nop nop vsetvli a5,a3,e64,m1,ta,ma


Which again looks exactly like one would expect from this optimization. I haven't verified with 100% certainty, but I'm pretty sure the vectors in question are full 8x64bit doubles based on finding what I'm fairly sure is the vsetvl controlling these instructions.

I can only conclude that the optimization is behaving per design and that our uarch isn't handling this idiom performantly in the FP domain.

So what I would suggest would be to add another tuning flag so that we can distinguish between FP and integer cases and make this change conditional on the uarch asking for this behavior.

Given we haven't yet seen a design where this is profitable, just make it false across the board for all the upstreamed uarchs except -Os where we likely want it on. Obviously it's disappointing, but I wouldn't want to lose the work as I do think this performance quirk we're seeing will be fixed in future designs.

Other thoughts?

Jeff


Reply via email to