On Wed, 2024-03-27 at 18:39 +0800, Xi Ruoyao wrote: > On Wed, 2024-03-27 at 10:38 +0800, chenglulu wrote: > > > > 在 2024/3/26 下午5:48, Xi Ruoyao 写道: > > > The latency of LA464 and LA664 division instructions depends on the > > > input. When I updated the costs in r14-6642, I unintentionally set the > > > division costs to the best-case latency (when the first operand is 0). > > > Per a recent discussion [1] we should use "something sensible" instead > > > of it. > > > > > > Use the average of the minimum and maximum latency observed instead. > > > This enables multiplication to reciprocal sequence reduction and speeds > > > up the following test case for about 30%: > > > > > > int > > > main (void) > > > { > > > unsigned long stat = 0xdeadbeef; > > > for (int i = 0; i < 100000000; i++) > > > stat = (stat * stat + stat * 114514 + 1919810) % 1000000007; > > > asm(""::"r"(stat)); > > > } > > > > > > [1]: https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648348.html > > > > The test case div-const-reduction.c is modified to assemble the instruction > > sequence as follows: > > lu12i.w $r12,999997440>>12 # 0x3b9ac000 > > ori $r12,$r12,2567 > > mod.w $r13,$r13,$r12 > > > > This sequence of instructions takes 5 clock cycles.
It actually may take 5 to 8 cycles depending on the input. And multiplication is fully pipelined while division is not, so the reciprocal sequence should still produce a better throughput. > Hmm indeed, it seems a waste to do this reduction for int / 1000000007. > I'll try to make a better heuristic as Richard suggests... Oops, it seems impossible (w/o refactoring the generic code). See my reply to Richi :(. Can you also try benchmarking with the costs of SI and DI division increased to (10, 10) instead of (14, 22) - allowing more CSE but not reciprocal sequence reduction, and (10, 22) - only allowing reduction for DI but not SI? -- Xi Ruoyao <xry...@xry111.site> School of Aerospace Science and Technology, Xidian University