Hi Lee,

A quick search shows that x86 kernel implementation also use `div` instruction

under Linux v6.19 and GCC 15.2.1, add GCC correctly generate shift instruction

in my arm64 machine with GCC 14.2.0.

Could you also consider evaluate this optimization in kernel?


Thanks,

Yifan

On 2/26/2026 1:30 AM, Ashley Lee wrote:
perf on fsck.erofs in gcc reports that z_erofs_load_compact_lcluster
was spending 20% of its time doing the div instruction. While the
function itself is ~40% of user runtime. In the source code, it seems
that dividing by vcnt doesn't optimize to a shift despite the two
possible states being powers of 2.

Changing the division into a ilog2() function call encourages the
compiler to recognize it as a power of 2. Thus performing a shift.

Running a benchmark on lzma compressed freebsd code on x86, shows
there is a ~4% increase in performance in gcc. While clang shows
virtually no regression in performance. The tradeoff is slightly
obfuscated source code.

The following command was run locally on x86.

$ hyperfine -w 5 -m 30 "./fsck.erofs ./bsd.erofs.lzma"

patch on gcc 15.2.1
Time (mean ± σ):     354.8 ms ±   6.0 ms    \
   [User: 227.8 ms, System: 126.1 ms]
Range (min … max):   345.8 ms … 366.2 ms    30 runs

dev on gcc 15.2.1
Time (mean ± σ):     370.7 ms ±   6.7 ms    \
   [User: 246.5 ms, System: 123.4 ms]
Range (min … max):   362.7 ms … 390.7 ms    30 runs

patch on clang 21.1.8
Time (mean ± σ):     371.9 ms ±   2.4 ms    \
   [User: 247.2 ms, System: 123.9 ms]
Range (min … max):   369.1 ms … 380.0 ms    30 runs

dev on clang 21.1.8
Time (mean ± σ):     371.0 ms ±   1.9 ms    \
   [User: 245.5 ms, System: 124.5 ms]
Range (min … max):   368.4 ms … 377.7 ms    30 runs

Signed-off-by: Ashley Lee <[email protected]>
---
v2: changed vdiv to ilog2 call

  lib/zmap.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/zmap.c b/lib/zmap.c
index baec278..3ac7fe9 100644
--- a/lib/zmap.c
+++ b/lib/zmap.c
@@ -160,7 +160,7 @@ static int z_erofs_load_compact_lcluster(struct 
z_erofs_maprecorder *m,
        m->nextpackoff = round_down(pos, vcnt << amortizedshift) +
                         (vcnt << amortizedshift);
        lobits = max(lclusterbits, ilog2(Z_EROFS_LI_D0_CBLKCNT) + 1U);
-       encodebits = ((vcnt << amortizedshift) - sizeof(__le32)) * 8 / vcnt;
+       encodebits = (((vcnt << amortizedshift) - sizeof(__le32)) * 8) >> 
ilog2(vcnt);
        bytes = pos & ((vcnt << amortizedshift) - 1);
        in -= bytes;
        i = bytes >> amortizedshift;


Reply via email to