https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809
--- Comment #10 from Qing Zhao <qing.zhao at oracle dot com> --- >> From the data, we can see the inlined version of strcmp (by glibc) is much >> slower than the direct call to strcmp. (this is for size 2) >> I am using GCC farm machine gcc116: > > This result doesn't make sense - it looks like GCC is moving the strcmp call > in > the 2nd case as a loop invariant, so you're just measuring a loop with just a > subtract and orr instruction… Yes, Wilco is right here. -ftree-loop-im moves the call to strcmp out of the loop. in order to avoid this issue, I changed the options to -O -fno-tree-loop-im and checked the assembly of the routine “cmp2” for the INLINED and Non-INLINED version. Inlined version: cmp2: mov x4, x0 mov w2, 51712 movk w2, 0x3b9a, lsl 16 mov w0, 0 mov w3, 102 b .L3 .L2: neg w1, w1 orr w0, w0, w1 subs w2, w2, #1 beq .L5 .L3: ldrb w1, [x4] subs w1, w3, w1 bne .L2 ldrb w1, [x4, 1] neg w1, w1 b .L2 .L5: ret Non-inlined version: cmp2: stp x29, x30, [sp, -48]! add x29, sp, 0 stp x19, x20, [sp, 16] stp x21, x22, [sp, 32] mov x22, x0 mov w19, 51712 movk w19, 0x3b9a, lsl 16 mov w20, 0 adrp x21, .LC0 add x21, x21, :lo12:.LC0 .L2: mov x1, x21 mov x0, x22 bl strcmp orr w20, w20, w0 subs w19, w19, #1 bne .L2 mov w0, w20 ldp x19, x20, [sp, 16] ldp x21, x22, [sp, 32] ldp x29, x30, [sp], 48 ret Then, the run-time performance data is: qinzhao@gcc116:~/Bugs/78809/const_cmp/perf$ sh t_p /home/qinzhao/Install/latest/bin/gcc -O -fno-tree-loop-im t_p_1.c t_p.c -DINLINED inlined version 34.73user 0.00system 0:34.73elapsed 99%CPU (0avgtext+0avgdata 360maxresident)k 0inputs+0outputs (0major+135minor)pagefaults 0swaps /home/qinzhao/Install/latest/bin/gcc -O -fno-tree-loop-im t_p_1.c t_p.c non-inlined version 138.79user 0.00system 2:18.77elapsed 100%CPU (0avgtext+0avgdata 356maxresident)k 0inputs+0outputs (0major+135minor)pagefaults 0swaps Yes, looks like that the inlined version is much faster than the non-inlined version on aarch64 platform.