RE: [PATCH net] x86: bpf_jit: fix compilation of large bpf programs
From: Eric Dumazet > Sent: 26 May 2015 16:30 > > > Yes, interesting, a benchmark that manages to run a lot of code 'cold > > cache'. > > We have binaries here at Google with 400 or 500 MBytes of text. > > Not benchmark, super real workloads you know. Indeed, and a lot of the code is likely to be running 'cold cache'. I was alluding to the problem where people will benchmark a small function by running in 1000s of times in a tight loop with exactly the same data. Not only is it 'hot cache' but any dynamic branch prediction is 'trained' to the specific data. David
Re: [PATCH net] x86: bpf_jit: fix compilation of large bpf programs
On Tue, 2015-05-26 at 15:13 +, David Laight wrote: > Yes, interesting, a benchmark that manages to run a lot of code 'cold cache'. We have binaries here at Google with 400 or 500 MBytes of text. Not benchmark, super real workloads you know. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH net] x86: bpf_jit: fix compilation of large bpf programs
From: Eric Dumazet > Sent: 26 May 2015 15:35 > On Tue, 2015-05-26 at 13:40 +, David Laight wrote: > > > If the JIT compiler is only changing the encoding of the constants > > in the x86 instructions (rather than changing the instructions themselves) > > then there is likely to me an unmeasurable change in the execution time. > > For instance I don't remember there being a difference in execution time > > between long and short branches - the only difference is the amount of > > cache they use. > > icache is precisely the matter here. In the end, it makes a difference. > > You could check this interesting study Ingo did recently : > > https://lkml.org/lkml/2015/5/19/1009 Yes, interesting, a benchmark that manages to run a lot of code 'cold cache'. Possibly dominated by having a full cache line of code to execute at the beginning of each function. My guess is that aligning function bodies helps, aligning branch targets (as some old versions of gcc used to do aggressively) is more likely to have a negative effect (increases the icache footprint too much). Unrolling loops further than needed to avoid data stalls needs very careful study. For JIT, once the obviously short branches have been reduced, further code size reduction might not be worth while. David N�r��yb�X��ǧv�^�){.n�+���z�^�)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥
Re: [PATCH net] x86: bpf_jit: fix compilation of large bpf programs
On Tue, 2015-05-26 at 13:40 +, David Laight wrote: > If the JIT compiler is only changing the encoding of the constants > in the x86 instructions (rather than changing the instructions themselves) > then there is likely to me an unmeasurable change in the execution time. > For instance I don't remember there being a difference in execution time > between long and short branches - the only difference is the amount of > cache they use. icache is precisely the matter here. In the end, it makes a difference. You could check this interesting study Ingo did recently : https://lkml.org/lkml/2015/5/19/1009 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH net] x86: bpf_jit: fix compilation of large bpf programs
From: Alexei Starovoitov > Sent: 22 May 2015 23:43 > x86 has variable length encoding. x86 JIT compiler is trying > to pick the shortest encoding for given bpf instruction. > While doing so the jump targets are changing, so JIT is doing > multiple passes over the program. Typical program needs 3 passes. > Some very short programs converge with 2 passes. Large programs > may need 4 or 5. But specially crafted bpf programs may hit the > pass limit and if the program converges on the last iteration > the JIT compiler will be producing an image full of 'int 3' insns. > Fix this corner case by doing final iteration over bpf program. If the JIT compiler is only changing the encoding of the constants in the x86 instructions (rather than changing the instructions themselves) then there is likely to me an unmeasurable change in the execution time. For instance I don't remember there being a difference in execution time between long and short branches - the only difference is the amount of cache they use. David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] x86: bpf_jit: fix compilation of large bpf programs
From: Alexei Starovoitov Date: Fri, 22 May 2015 15:42:55 -0700 > x86 has variable length encoding. x86 JIT compiler is trying > to pick the shortest encoding for given bpf instruction. > While doing so the jump targets are changing, so JIT is doing > multiple passes over the program. Typical program needs 3 passes. > Some very short programs converge with 2 passes. Large programs > may need 4 or 5. But specially crafted bpf programs may hit the > pass limit and if the program converges on the last iteration > the JIT compiler will be producing an image full of 'int 3' insns. > Fix this corner case by doing final iteration over bpf program. > > Fixes: 0a14842f5a3c ("net: filter: Just In Time compiler for x86-64") > Reported-by: Daniel Borkmann > Signed-off-by: Alexei Starovoitov Applied and queued up for -stable, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] x86: bpf_jit: fix compilation of large bpf programs
On 05/23/2015 12:42 AM, Alexei Starovoitov wrote: x86 has variable length encoding. x86 JIT compiler is trying to pick the shortest encoding for given bpf instruction. While doing so the jump targets are changing, so JIT is doing multiple passes over the program. Typical program needs 3 passes. Some very short programs converge with 2 passes. Large programs may need 4 or 5. But specially crafted bpf programs may hit the pass limit and if the program converges on the last iteration the JIT compiler will be producing an image full of 'int 3' insns. Fix this corner case by doing final iteration over bpf program. Fixes: 0a14842f5a3c ("net: filter: Just In Time compiler for x86-64") Reported-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov LGTM, thanks! Tested-by: Daniel Borkmann Acked-by: Daniel Borkmann -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html