Hi all,
in the course of doing some benchmarks on arm with -Os, I noticed that
some list traversion code became significantly slower since gcc 5.3 when
instruction caches are cold.
Reproducer with relevant defines copied from the Linux kernel:
struct list_head {
struct list_head *next, *prev;
};
#define list_for_each(pos, head) \
for (pos = (head)->next; pos != (head); pos = pos->next)
void g(struct list_head *l);
void h(void);
void f(struct list_head *head)
{
struct list_head *l;
list_for_each(l, head) {
g(l);
}
h();
}
This behaviour can be tracked down to
r228318 ("bb-reorder: Add -freorder-blocks-algorithm= and wire it up")
Since this commit up to and including svn trunk, the output of the
above snippet compiled with -Os looks like
00000000 <f>:
0: e92d4070 push {r4, r5, r6, lr}
4: e1a05000 mov r5, r0
8: e5904000 ldr r4, [r0]
c: e1540005 cmp r4, r5
10: 1a000001 bne 1c <f+0x1c>
14: e8bd4070 pop {r4, r5, r6, lr}
18: eafffffe b 0 <h>
1c: e1a00004 mov r0, r4
20: ebfffffe bl 0 <g>
24: e5944000 ldr r4, [r4]
28: eafffff7 b c <f+0xc>
Observe how the bb corresponding to the loop's body, i.e. the one
invoking g(), gets jumped to by a conditional *forward* jump which is
bad for static branch prediction and thus, prefetching.
Compare this to gcc 4.9's output with -Os or alternatively, trunk's
output with -Os -freorder-blocks-algorithm=stc:
00000000 <f>:
0: e92d4070 push {r4, r5, r6, lr}
4: e1a05000 mov r5, r0
8: e5904000 ldr r4, [r0]
c: e1540005 cmp r4, r5
10: 0a000003 beq 24 <f+0x24>
14: e1a00004 mov r0, r4
18: ebfffffe bl 0 <g>
1c: e5944000 ldr r4, [r4]
20: eafffff9 b c <f+0xc>
24: e8bd4070 pop {r4, r5, r6, lr}
28: eafffffe b 0 <h>
Here, the loop body's bb is the fallthrough of the loop's conditional
branch which is good.
Measurements on a memory stressed system, i.e. one where the instruction
caches can be considered cold at the loop's first iteration, do indeed
show runtime differences of ~0.5us on average.
I believe that this drop in performance is due to the code of g() not
getting prefetched properly.
That being said, I could certainly go and submit a patch to the Linux
kernel setting -freorder-blocks-algorithm=stc for the -Os case.
Before I do this, I'd be glad to hear your opinions about this though as
I can't really quantify the implications.
>From the discussion on gcc-patches [1] of what is now the aforementioned
r228318 ("bb-reorder: Add -freorder-blocks-algorithm= and wire it up"),
it is not clear to me whether this change can actually reduce code size
beyond those 0.1% given there for -Os.
A sloppy test with an, admittedly heavily stripped down, Linux kernel
gives me exactly these 0.1%, too.
So first question:
Do you guys know of any code where there are more significant code size
savings achieved?
And second question:
If that isn't the case, would it possibly make sense to partly revert
gcc's behaviour and set -freorder-blocks-algorithm=stc at -Os?
Thank you very much!
Nicolai
[1] https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01869.html