On Wed, Jan 21, 2026 at 8:44 AM Feng Jiang <[email protected]> wrote: > On 2026/1/20 15:36, Andy Shevchenko wrote: > > On Tue, Jan 20, 2026 at 02:58:44PM +0800, Feng Jiang wrote:
... > >> Performance Summary (Improvement %): > >> --------------------------------------------------------------- > >> Function | 16 B (Short) | 512 B (Mid) | 4096 B (Long) > >> --------------------------------------------------------------- > >> strnlen | +64.0% | +346.2% | +410.7% > > > > This is still suspicious. > > Regarding the +410% gain, it becomes clearer when looking at the inner loop > of the > Zbb implementation > (https://docs.riscv.org/reference/isa/unpriv/b-st-ext.html#zbb). > For a 64-bit system, the core loop uses only 5 instructions to process 8 > bytes. > Note that t3 is pre-loaded with -1 (0xFFFFFFFFFFFFFFFF): > > 1: > REG_L t1, SZREG(t0) /* Load 8 bytes */ > addi t0, t0, SZREG /* Advance pointer */ > orc.b t1, t1 /* Bitwise OR-Combine: 0x00 becomes 0x00, > others 0xFF */ > bgeu t0, t4, 4f /* Boundary check (max_len) */ > beq t1, t3, 1b /* If t1 == 0xFFFFFFFFFFFFFFFF (no NUL), loop > */ > > In contrast, the generic C implementation performs byte-by-byte comparisons, > which involves > significantly more loads and branches for every single byte processed. The > Zbb approach is > much leaner: the orc.b instruction collapses the NUL-check for an entire > 8-byte word into a > single step. By shifting from a byte-oriented loop to this > hardware-accelerated word-at-a-time > logic, we drastically reduce the instruction count and branch overhead, which > explains the > massive jump in TCG throughput for long strings. > > Beyond the main loop, the Zbb implementation also utilizes ctz (Count > Trailing Zeros) > to handle the tail and alignment. Once orc.b identifies a NUL byte within a > register, > we can precisely locate its position in just two instructions: > > not t1, t1 /* Flip bits: NUL byte (0x00) becomes 0xFF */ > ctz t1, t1 /* Count bits before the first NUL byte */ > srli t1, t1, 3 /* Divide by 8 to get byte offset */ > > In a generic C implementation, calculating this byte offset typically > requires a series > of shifts, masks, or a sub-loop, which adds significant overhead. By > combining orc.b and > ctz, we eliminate all branching and lookup tables for the tail-end > calculation, further > contributing to the performance gains observed in the benchmarks. > > >> strchr | +4.0% | +6.4% | +1.5% > >> strrchr | +6.6% | +2.8% | +0.0% > > As for strchr() and strrchr(), the relatively modest improvements are because > the current > versions in this series are implemented using simple byte-by-byte assembly. > These primarily > gain performance by reducing function call overhead and eliminating stack > frame management > compared to the generic C version. > > Unlike strnlen(), they do not yet utilize Zbb extensions. I plan to introduce > Zbb-optimized > versions for these functions in a future patch set, which I expect will bring > performance > gains similar to what we now see with strnlen(). Thanks for the details regarding the assembly native implementation. > >> word-at-a-time logic, showing significant gains as the string length > >> increases. > > > > Hmm... Have you tried to optimise the generic implementation to use > > word-at-a-time logic and compare? > > Regarding the generic implementation, even if we were to optimize the C code > to use word-at-a-time logic (the has_zero() style bit-manipulation), it still > wouldn't match the Zbb version's efficiency. > > The traditional C-based word-level detection requires a sequence of arithmetic > operations to identify NUL bytes. In contrast, the RISC-V orc.b instruction > collapses this entire check into a single hardware cycle. I've focused on this > architectural approach to fully leverage these specific Zbb features, which > provides a level of instruction density that generic C math cannot achieve. I understand that. My point is if we move the generic implementation to use word-at-a-time technique the difference should not go 4x, right? Perhaps 1.5x or so. I believe this will be a very useful exercise. -- With Best Regards, Andy Shevchenko
