On Thu, Feb 4, 2016 at 2:56 AM, Ingo Molnar <mi...@kernel.org> wrote: > > * Ingo Molnar <mi...@kernel.org> wrote: > >> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS >> >> > + >> > + /* Check length */ >> > +10: cmpl $8, %esi >> > + jg 30f >> > + jl 20f >> > + >> > + /* Exactly 8 bytes length */ >> > + addl (%rdi), %eax >> > + adcl 4(%rdi), %eax >> > + RETURN >> > + >> > + /* Less than 8 bytes length */ >> > +20: clc >> > + jmpq *branch_tbl_len(, %rsi, 8) >> > + >> > + /* Greater than 8 bytes length. Determine number of quads (n). Sum >> > + * over first n % 16 quads >> > + */ >> > +30: movl %esi, %ecx >> > + shrl $3, %ecx >> > + andl $0xf, %ecx >> > + negq %rcx >> > + lea 40f(, %rcx, 4), %r11 >> > + clc >> > + jmp *%r11 >> >> Are you absolutely sure that a jump table is the proper solution here? It >> defeats branch prediction on most x86 uarchs. Why not label the loop stages >> and >> jump in directly with a binary tree of branches? > > So just to expand on this a bit, attached below is a quick & simple & stupid > testcase that generates a 16 entries call table. (Indirect jumps and indirect > calls are predicted similarly on most x86 uarchs.) Just built it with: > > gcc -Wall -O2 -o jump-table jump-table.c > > Even on relatively modern x86 uarchs (I ran this on a post Nehalem IVB Intel > CPU > and also on AMD Opteron. The numbers are from the Intel box.) this gives a > high > branch miss rate with a 16 entries jump table: > > triton:~> taskset 1 perf stat --repeat 10 ./jump-table 16 > ... using 16 jump table entries. > ... using 100000000 loop iterations. > ... result: 10000000100000000 > [...] > > Performance counter stats for './jump-table 16' (10 runs): > > 1386.131780 task-clock (msec) # 1.001 CPUs utilized > ( +- 0.18% ) > 33 context-switches # 0.024 K/sec > ( +- 1.71% ) > 0 cpu-migrations # 0.000 K/sec > 52 page-faults # 0.037 K/sec > ( +- 0.71% ) > 6,247,215,683 cycles # 4.507 GHz > ( +- 0.18% ) > 3,895,337,877 stalled-cycles-frontend # 62.35% frontend cycles > idle ( +- 0.30% ) > 1,404,014,996 instructions # 0.22 insns per cycle > # 2.77 stalled cycles > per insn ( +- 0.02% ) > 300,820,988 branches # 217.022 M/sec > ( +- 0.02% ) > 87,518,741 branch-misses # 29.09% of all branches > ( +- 0.01% ) > > 1.385240076 seconds time elapsed > ( +- 0.21% ) > > ... as you can see the branch miss rate is very significant, causing a stalled > decoder and very low instruction throughput. > > I have to reduce the jump table to a single entry (!) to get good performance > on > this CPU: > > Performance counter stats for './jump-table 1' (10 runs): > > 739.173505 task-clock (msec) # 1.001 CPUs utilized > ( +- 0.26% ) > 37 context-switches # 0.051 K/sec > ( +- 16.79% ) > 0 cpu-migrations # 0.000 K/sec > 52 page-faults # 0.070 K/sec > ( +- 0.41% ) > 3,331,328,405 cycles # 4.507 GHz > ( +- 0.26% ) > 2,012,973,596 stalled-cycles-frontend # 60.43% frontend cycles > idle ( +- 0.47% ) > 1,403,880,792 instructions # 0.42 insns per cycle > # 1.43 stalled cycles > per insn ( +- 0.05% ) > 300,817,064 branches # 406.964 M/sec > ( +- 0.05% ) > 12,177 branch-misses # 0.00% of all branches > ( +- 12.39% ) > > 0.738616356 seconds time elapsed > ( +- 0.26% ) > > Note how the runtime got halved: that is because stalls got halved and the > instructions per cycle throughput doubled. > > Even a two entries jump table performs poorly: > > Performance counter stats for './jump-table 2' (10 runs): > > 1493.790686 task-clock (msec) # 1.001 CPUs utilized > ( +- 0.06% ) > 39 context-switches # 0.026 K/sec > ( +- 4.73% ) > 0 cpu-migrations # 0.000 K/sec > 52 page-faults # 0.035 K/sec > ( +- 0.26% ) > 6,732,372,612 cycles # 4.507 GHz > ( +- 0.06% ) > 4,229,130,302 stalled-cycles-frontend # 62.82% frontend cycles > idle ( +- 0.09% ) > 1,407,803,145 instructions # 0.21 insns per cycle > # 3.00 stalled cycles > per insn ( +- 0.08% ) > 301,688,946 branches # 201.962 M/sec > ( +- 0.09% ) > 100,022,150 branch-misses # 33.15% of all branches > ( +- 0.00% ) > > 1.492820524 seconds time elapsed > ( +- 0.08% ) > > In fact it's worse than the 16 entries runtime. > > ( Note that Intel SkyLake improved the indirect jump/call branch predictor. > On another Intel CPU I have the boundary between good and bad prediction is > at 4-6 entries. So this is highly uarch (and code layout) dependent. ) > > In contrast, a static hierarchy/tree of branches is predicted a lot better on > most > x86 uarchs, even with highly variable input. (This does not even count the > extra > calculations needed to calculate the jump table index, which, as you coded it > in > your patch, add 2-3 cycles of extra latency as well.) > > Such branch miss characteristics matter and they can become more visible with > smaller skb sizes. > > So I'm generally sceptical of jump tables and I'd like to see careful and > convincing perf stat runs on modern x86 uarchs, comparing the jump table > solution > to a branch hierarchy solution, before accepting a jump table solution into > the > x86 architecture. > > x86 uarchs are generally good at predicting function pointer indirect jumps > (which > tend to be static) - multi-target jump tables are a lot harder nut to crack, > especially if the jump index is calculated right before the jump is performed, > like your patch does it. > > The impact of branch misses is more pronounced on modern Intel CPUs, because > speculation is deeper. > > But as always I can be convinced with contrary numbers! :-) > Hi Ingo,
Thanks for the explanation and sample code. Expanding on your example, I added a switch statement to perform to function (code below). gcc seems to be implementing this switch as a jump table: jmpq *0x400ac8(,%rdx,8) Which is a different call than the function table implementation: callq *(%rsp,%rdx,8) The switch jump table method is what I coded in the assembly for the checksum routine (I basically just copied the assembly from what gcc was doing with the C code for the same function). Running using the functptr and switch table gives: taskset 1 perf stat --repeat 10 ./jump_table -S 16 ... using funcptr ... result: 10000000100000000 Performance counter stats for './jump_table -S 16' (10 runs): 2318.192392 task-clock (msec) # 0.999 CPUs utilized ( +- 1.02% ) 9 context-switches # 0.004 K/sec ( +- 9.51% ) 1 cpu-migrations # 0.000 K/sec ( +- 68.31% ) 132 page-faults # 0.057 K/sec ( +- 0.10% ) 6,328,766,014 cycles # 2.730 GHz ( +- 0.04% ) [49.99%] 3,325,346,741 stalled-cycles-frontend # 52.54% frontend cycles idle ( +- 0.10% ) [50.01%] 1,793,615,301 stalled-cycles-backend # 28.34% backend cycles idle ( +- 2.84% ) [50.04%] 1,509,733,276 instructions # 0.24 insns per cycle # 2.20 stalled cycles per insn ( +- 0.02% ) [50.03%] 301,788,275 branches # 130.183 M/sec ( +- 0.01% ) [50.01%] 100,030,120 branch-misses # 33.15% of all branches ( +- 0.01% ) [49.98%] 2.320510795 seconds time elapsed ( +- 1.02% ) taskset 1 perf stat --repeat 10 ./jump_table -S 16 -x ... using switch ... result: 10000000100000000 Performance counter stats for './jump_table -S 16 -x' (10 runs): 942.117858 task-clock (msec) # 0.999 CPUs utilized ( +- 1.07% ) 5 context-switches # 0.005 K/sec ( +- 13.04% ) 2 cpu-migrations # 0.003 K/sec ( +- 42.27% ) 132 page-faults # 0.140 K/sec ( +- 0.12% ) 2,521,873,028 cycles # 2.677 GHz ( +- 0.09% ) [50.01%] 1,021,412,581 stalled-cycles-frontend # 40.50% frontend cycles idle ( +- 0.17% ) [50.06%] 255,591,030 stalled-cycles-backend # 10.13% backend cycles idle ( +- 13.11% ) [50.10%] 1,104,081,483 instructions # 0.44 insns per cycle # 0.93 stalled cycles per insn ( +- 0.01% ) [50.05%] 300,713,493 branches # 319.189 M/sec ( +- 0.01% ) [49.98%] 11,500 branch-misses # 0.00% of all branches ( +- 12.18% ) [49.95%] 0.943088132 seconds time elapsed ( +- 1.07% ) So the switch implementation does not seem to be causing branch-misses to be counted and is a lot faster. This is also what I see when looking at the branch-misses with the checksum code-- I haven't yet found any test case thrashing lengths or alignments increase branch-misses and shows a performance degradation over the case where they are static. Tom /* * * Simple testcase to generate jump table usage. * */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <getopt.h> typedef unsigned long (fn_t) (unsigned long param); #define TABLE_SIZE_MAX 16L unsigned long global; unsigned long table_size = TABLE_SIZE_MAX; unsigned long loops = 100000000; int doswitch = 0; #define DEFINE_FUN(x) \ \ unsigned long fun_##x(unsigned long param) \ { \ return param+global; \ } DEFINE_FUN( 0); DEFINE_FUN( 1); DEFINE_FUN( 2); DEFINE_FUN( 3); DEFINE_FUN( 4); DEFINE_FUN( 5); DEFINE_FUN( 6); DEFINE_FUN( 7); DEFINE_FUN( 8); DEFINE_FUN( 9); DEFINE_FUN(10); DEFINE_FUN(11); DEFINE_FUN(12); DEFINE_FUN(13); DEFINE_FUN(14); DEFINE_FUN(15); static void usage(void) { printf("usage: jump-table [-S <table_size>] [-L <loops> ] [-h] [-x]\n"); exit(0); } unsigned long do_funcptr(void) { int i; unsigned long local = 0; fn_t *fn_ptr [TABLE_SIZE_MAX] = { fun_0 , fun_1 , fun_2 , fun_3 , fun_4 , fun_5 , fun_6 , fun_7 , fun_8 , fun_9, fun_10, fun_11, fun_12, fun_13, fun_14, fun_15, }; /* Call a set of simple arithmetic functions from a jump table: */ for (i = 0; i < loops; i++) { global++; local += fn_ptr[global % table_size](global); } return local; } unsigned long do_switch(void) { int i; unsigned long local = 0; /* Call a set of simple arithmetic functions via switch: */ #define CASE(X) case X: \ local += fun_##X(global); \ break; for (i = 0; i < loops; i++) { global++; switch (global % table_size) { CASE(0) CASE(1) CASE(2) CASE(3) CASE(4) CASE(5) CASE(6) CASE(7) CASE(8) CASE(9) CASE(10) CASE(11) CASE(12) CASE(13) CASE(14) CASE(15) } } return local; } int main(int argc, char **argv) { unsigned long local = 0; int c; while ((c = getopt(argc, argv, "hS:xL:")) != -1) switch (c) { case 'S': table_size = atoi(optarg); if (table_size <= 1) table_size = 1; if (table_size > TABLE_SIZE_MAX) table_size = TABLE_SIZE_MAX; break; case 'x': doswitch = 1; break; case 'L': loops = atoi(optarg); case 'h': default: usage(); } printf("... using %lu jump table entries.\n", table_size); printf("... using %lu loop iterations.\n", loops); if (doswitch) { printf("... using switch\n"); } else { printf("... using funcptr\n"); } local = doswitch ? do_switch() : do_funcptr(); printf("... result: %lu\n", local); return 0; } > Thanks, > > Ingo > > --------------------------------------> > /* > * Simple testcase to generate jump table usage. > */ > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > > typedef unsigned long (fn_t) (unsigned long param); > > unsigned long global; > > #define DEFINE_FUN(x) \ > \ > unsigned long fun_##x(unsigned long param) \ > { \ > return param+global; \ > } > > #define TABLE_SIZE_MAX 16L > > DEFINE_FUN( 1); > DEFINE_FUN( 2); > DEFINE_FUN( 3); > DEFINE_FUN( 4); > DEFINE_FUN( 5); > DEFINE_FUN( 6); > DEFINE_FUN( 7); > DEFINE_FUN( 8); > DEFINE_FUN( 9); > DEFINE_FUN(10); > DEFINE_FUN(11); > DEFINE_FUN(12); > DEFINE_FUN(13); > DEFINE_FUN(14); > DEFINE_FUN(15); > DEFINE_FUN(16); > > int main(int argc, char **argv) > { > fn_t *fn_ptr [TABLE_SIZE_MAX] = { fun_1 , fun_2 , fun_3 , > fun_4 , > fun_5 , fun_6 , fun_7 , > fun_8 , > fun_9 , fun_10, fun_11, > fun_12, > fun_13, fun_14, fun_15, > fun_16, > > }; > unsigned long local = 0; > unsigned long i; > unsigned long table_size = TABLE_SIZE_MAX; > unsigned long loops = 100000000; > > > if ((argc >= 2 && !strcmp(argv[1], "-h")) || argc >= 4) { > printf("usage: jump-table <table_size(1-%lu)(default: %lu)> > <loops(default: %lu)>\n", TABLE_SIZE_MAX, TABLE_SIZE_MAX, loops); > exit(0); > } > > if (argc >= 2) { > table_size = atol(argv[1]); > if (table_size <= 1) > table_size = 1; > if (table_size > TABLE_SIZE_MAX) > table_size = TABLE_SIZE_MAX; > } > printf("... using %lu jump table entries.\n", table_size); > > if (argc >= 3) > loops = atol(argv[2]); > printf("... using %lu loop iterations.\n", loops); > > /* Call a set of simple arithmetic functions from a jump table: */ > for (i = 0; i < loops; i++) { > global++; > local += fn_ptr[global % table_size](global); > } > > printf("... result: %lu\n", local); > }