perf on fsck.erofs in gcc reports that z_erofs_load_compact_lcluster was spending 20% of its time doing the div instruction. While the function itself is ~40% of user runtime. In the source code, it seems that dividing by vcnt doesn't optimize to a shift despite the two possible states being powers of 2.
Changing the division into a ilog2() function call encourages the compiler to recognize it as a power of 2. Thus performing a shift. Running a benchmark on lzma compressed freebsd code on x86, shows there is a ~4% increase in performance in gcc. While clang shows virtually no regression in performance. The tradeoff is slightly obfuscated source code. The following command was run locally on x86. $ hyperfine -w 5 -m 30 "./fsck.erofs ./bsd.erofs.lzma" patch on gcc 15.2.1 Time (mean ± σ): 354.8 ms ± 6.0 ms \ [User: 227.8 ms, System: 126.1 ms] Range (min … max): 345.8 ms … 366.2 ms 30 runs dev on gcc 15.2.1 Time (mean ± σ): 370.7 ms ± 6.7 ms \ [User: 246.5 ms, System: 123.4 ms] Range (min … max): 362.7 ms … 390.7 ms 30 runs patch on clang 21.1.8 Time (mean ± σ): 371.9 ms ± 2.4 ms \ [User: 247.2 ms, System: 123.9 ms] Range (min … max): 369.1 ms … 380.0 ms 30 runs dev on clang 21.1.8 Time (mean ± σ): 371.0 ms ± 1.9 ms \ [User: 245.5 ms, System: 124.5 ms] Range (min … max): 368.4 ms … 377.7 ms 30 runs Signed-off-by: Ashley Lee <[email protected]> --- v2: changed vdiv to ilog2 call lib/zmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/zmap.c b/lib/zmap.c index baec278..3ac7fe9 100644 --- a/lib/zmap.c +++ b/lib/zmap.c @@ -160,7 +160,7 @@ static int z_erofs_load_compact_lcluster(struct z_erofs_maprecorder *m, m->nextpackoff = round_down(pos, vcnt << amortizedshift) + (vcnt << amortizedshift); lobits = max(lclusterbits, ilog2(Z_EROFS_LI_D0_CBLKCNT) + 1U); - encodebits = ((vcnt << amortizedshift) - sizeof(__le32)) * 8 / vcnt; + encodebits = (((vcnt << amortizedshift) - sizeof(__le32)) * 8) >> ilog2(vcnt); bytes = pos & ((vcnt << amortizedshift) - 1); in -= bytes; i = bytes >> amortizedshift; -- 2.53.0
