On Wed, Jan 21, 2026 at 8:44 AM Feng Jiang <[email protected]> wrote:
> On 2026/1/20 15:36, Andy Shevchenko wrote:
> > On Tue, Jan 20, 2026 at 02:58:44PM +0800, Feng Jiang wrote:

...

> >> Performance Summary (Improvement %):
> >> ---------------------------------------------------------------
> >> Function  |  16 B (Short) |  512 B (Mid) |  4096 B (Long)
> >> ---------------------------------------------------------------
> >> strnlen   |    +64.0%     |   +346.2%    |    +410.7%
> >
> > This is still suspicious.
>
> Regarding the +410% gain, it becomes clearer when looking at the inner loop 
> of the
> Zbb implementation 
> (https://docs.riscv.org/reference/isa/unpriv/b-st-ext.html#zbb).
> For a 64-bit system, the core loop uses only 5 instructions to process 8 
> bytes.
> Note that t3 is pre-loaded with -1 (0xFFFFFFFFFFFFFFFF):
>
> 1:
>     REG_L   t1, SZREG(t0)      /* Load 8 bytes */
>     addi    t0, t0, SZREG      /* Advance pointer */
>     orc.b   t1, t1             /* Bitwise OR-Combine: 0x00 becomes 0x00, 
> others 0xFF */
>     bgeu    t0, t4, 4f         /* Boundary check (max_len) */
>     beq     t1, t3, 1b         /* If t1 == 0xFFFFFFFFFFFFFFFF (no NUL), loop 
> */
>
> In contrast, the generic C implementation performs byte-by-byte comparisons, 
> which involves
> significantly more loads and branches for every single byte processed. The 
> Zbb approach is
> much leaner: the orc.b instruction collapses the NUL-check for an entire 
> 8-byte word into a
> single step. By shifting from a byte-oriented loop to this 
> hardware-accelerated word-at-a-time
> logic, we drastically reduce the instruction count and branch overhead, which 
> explains the
> massive jump in TCG throughput for long strings.
>
> Beyond the main loop, the Zbb implementation also utilizes ctz (Count 
> Trailing Zeros)
> to handle the tail and alignment. Once orc.b identifies a NUL byte within a 
> register,
> we can precisely locate its position in just two instructions:
>
>     not     t1, t1             /* Flip bits: NUL byte (0x00) becomes 0xFF */
>     ctz     t1, t1             /* Count bits before the first NUL byte */
>     srli    t1, t1, 3          /* Divide by 8 to get byte offset */
>
> In a generic C implementation, calculating this byte offset typically 
> requires a series
> of shifts, masks, or a sub-loop, which adds significant overhead. By 
> combining orc.b and
> ctz, we eliminate all branching and lookup tables for the tail-end 
> calculation, further
> contributing to the performance gains observed in the benchmarks.
>
> >> strchr    |    +4.0%      |   +6.4%      |    +1.5%
> >> strrchr   |    +6.6%      |   +2.8%      |    +0.0%
>
> As for strchr() and strrchr(), the relatively modest improvements are because 
> the current
> versions in this series are implemented using simple byte-by-byte assembly. 
> These primarily
> gain performance by reducing function call overhead and eliminating stack 
> frame management
> compared to the generic C version.
>
> Unlike strnlen(), they do not yet utilize Zbb extensions. I plan to introduce 
> Zbb-optimized
> versions for these functions in a future patch set, which I expect will bring 
> performance
> gains similar to what we now see with strnlen().

Thanks for the details regarding the assembly native implementation.

> >> word-at-a-time logic, showing significant gains as the string length
> >> increases.
> >
> > Hmm... Have you tried to optimise the generic implementation to use
> > word-at-a-time logic and compare?
>
> Regarding the generic implementation, even if we were to optimize the C code
> to use word-at-a-time logic (the has_zero() style bit-manipulation), it still
> wouldn't match the Zbb version's efficiency.
>
> The traditional C-based word-level detection requires a sequence of arithmetic
> operations to identify NUL bytes. In contrast, the RISC-V orc.b instruction
> collapses this entire check into a single hardware cycle. I've focused on this
> architectural approach to fully leverage these specific Zbb features, which
> provides a level of instruction density that generic C math cannot achieve.

I understand that. My point is if we move the generic implementation
to use word-at-a-time technique the difference should not go 4x,
right? Perhaps 1.5x or so. I believe this will be a very useful
exercise.

-- 
With Best Regards,
Andy Shevchenko

Reply via email to