Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On 06/23/2014 11:57 PM, Alexei Starovoitov wrote: On Fri, Jun 20, 2014 at 9:44 AM, Chema Gonzalez wrote: ... - nop: I'd like to have a nop. Do I know why? Nope. nope. Let's not add unnecessary instructions. A valid nop is a useful instruction: padding, filling up arrays of sock_filter_int correctly (as in lib/test_bpf.c, where we're currently using a "ld #0", which loads zero to register A), and other use cases (see http://en.wikipedia.org/wiki/NOP ). especially I don't like to add 'nop' instruction. code==0 to mean 'ld #0' is one of classic BPF ugliness. I think it was probably unintended to be able to have unreachable code e.g. filled with 'nops' where both jt, jf just jump over it, but that quirk we cannot change anymore in the classic checker and have to carry onwards. We're not filling up arrays with nops in lib/test_bpf.c Zero is invalid opcode in eBPF and should stay so, since it's an easy check for humans like me who are looking at eBPF in hex. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On 06/23/2014 11:57 PM, Alexei Starovoitov wrote: On Fri, Jun 20, 2014 at 9:44 AM, Chema Gonzalez ch...@google.com wrote: ... - nop: I'd like to have a nop. Do I know why? Nope. nope. Let's not add unnecessary instructions. A valid nop is a useful instruction: padding, filling up arrays of sock_filter_int correctly (as in lib/test_bpf.c, where we're currently using a ld #0, which loads zero to register A), and other use cases (see http://en.wikipedia.org/wiki/NOP ). especially I don't like to add 'nop' instruction. code==0 to mean 'ld #0' is one of classic BPF ugliness. I think it was probably unintended to be able to have unreachable code e.g. filled with 'nops' where both jt, jf just jump over it, but that quirk we cannot change anymore in the classic checker and have to carry onwards. We're not filling up arrays with nops in lib/test_bpf.c Zero is invalid opcode in eBPF and should stay so, since it's an easy check for humans like me who are looking at eBPF in hex. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On Fri, Jun 20, 2014 at 9:44 AM, Chema Gonzalez wrote: >> >> Model eBPF based on MIPS ISA? Ouch. >> That would be one ugly ISA that is not JITable on x64. > > Definitely I wasn't making my point clear: IMO if we're redesigning > the BPF ISA, we should get a clean one (clean=something that is simple > enough to be readable by a newcomer). I mentioned MIPS because it's > the one I know the most, and it's kind of clean. I'm definitely not > proposing to use MIPS as the BPF ISA. mips is a clean isa? When it was designed 30 years ago, it was good, but today it really shows its age: delay slots, integer arithmetic that traps on overflow, lack of real comparison operations, hi/lo, etc. I strongly believe eBPF isa is way cleaner and easier to understand than mips isa. It seems your proposals to make eBPF 'cleaner' are based on HW mindset, which is not applicable here. Details below: > 1. how we codify the ISA. Both BPF and eBPF devote 50% of the insn > space to load/store (4 opcodes/instruction class values of 8 > possible). In comparison, MIPS uses just 14% (9 of 64 opcodes). That that is a misleading comparison and leads to the wrong conclusions. Your proposal to remove useful instruction just to save a bit in bpf class encoding just doesn't make sense. We have infinite room to add new instructions and opcodes. Unlike HW ISA eBPF is not limited by 8-bit opcodes and 8-byte instructions. eBPF is not designed to be run directly by HW, so we're not trying to save rtl gates here. Among other things we're optimizing for interpreter performance, so removing instructions can only hurt. > gives me pause. I'm definitely not suggesting adding more insn just > because physical ISAs have it, but I think it makes sense to avoid > using the whole insn space just because it's there, or because classic > BPF was using it all. wrong. Classic BPF is a legacy that we have to live with and we should not sacrifice performance of classic bpf filters just to reduce number of eBPF instructions. It made sense only for one classic instruction: BPF_LD + MSH It is really single purpose instruction to do X = ip->length * 4. We didn't carry this ugliness into eBPF, since it's not generic, can be easily represented by generic instructions, complicates JITs. Since MSH is used at most once per tcpdump filter, few extra insns add a tiny penalty to overall filter execution time in interpreter and give no performance penalty at all when filter is JITed. > 2. instructions (sock_filter_int) hardcode 2 immediate values. This is > very unusual in ISAs. We're effectively doubling the insn size (from > 32 to 64 bits), and still we are cramming the whole insn space (only 1 > reserved instruction class of 8 possible ones). The rationale is to Comparisons with HW encoding is not applicable. mips picked 4 byte wide instruction and had to live with it. Just look at hi/lo insns and what compilers have to do with it. All eBPF instructions today are 8 byte wide. There is no reason to redesign them into <8 bytes or squeeze bits. It will hurt performance without giving us anything back. At the same time we can add 16-byte instruction to represent load 64-bit immediate, but as I was saying before many factors need to be considered before we proceed. > 3. name reuse: we're reusing names b etween classic BPF and eBPF. For > example, BPF_LD*|BPF_MEM in classic BPF refers to access to the M[] > buffer. BPF_LDX|BPF_MEM in eBPF is a generic memory access. I find > this very confusing, especially because we'll have to live with That's one opinion. I think names are fine and Documentation/networking/filter.txt explains both classic and eBPF encoding well enough. > classic BPF in userland filters for a long time. In fact, if you ask > me, I'll come up with some generic name for the generic linux > filtering mechanism (eBPF and internal BPF sound too much like BPF), > to make it clear that this is not just BPF. I don't think it's a good idea. I like BPF abbreviation, since the name implies the use case. Renaming eBPF to ISA_X will be confusing. Now everyone understands that BPF is a safe instruction set that can be dynamically loaded into the kernel. eBPF is the same plus more. > 4. simplicity: both BPF and eBPF have 4 ld/st operations types (LD, > LDX, ST, STX) and many addressing modes/modifiers (6 for BPF, 4 for > eBPF), where only a subset of the {operation types, modifier} tuples > are valid. I think we can make it simpler. For the eBPF case, we > currently have 6 valid combinations: That's a good summary. I think documentation explained it already, but if you feel it's still missing pieces, please send a patch to improve the doc. > The two eBPF BPF_LD insn are BPF_LDX|BPF_MEM insn to access an skbuff. > For example, BPF_LD|BPF_ABS does "dst_reg = packet[imm32]", which is a > BPF_LDX|BPF_MEM with the following differences: > a. packet is skb->data == CTX == BPF_R6 (this is equivalent to > src_reg==R6 in BPF_LDX|BPF_MEM) > b. output is left in R0, not in
RE: [PATCH v2 net-next 0/2] split BPF out of core networking
From: Chema Gonzalez ... > 4.5. BPF_ST|BPF_MEM > Operation: *(size *) (dst_reg + off16) = imm32 > > This insn encodes 2 immediate values (the offset and the imm32 value) > in the insn, and actually forces the sock_filter_int 64-bit struct to > have both a 16-bit offset field and a 32-bit immediate field). In > fact, it's the only instructions that uses .off and .imm at the same > time (for all other instructions, at least one of the fields is always > 0). > > This did not exist in classic BPF (where BPF_ST|BPF_MEM actually did > "mem[pc->k] = A;"). In fact, it's rare to find an ISA that allows > encoding 2 immediate values in a single insn. My impression (after > checking the x86 JIT implementation, which works on the eBPF code) is > that this was added as an x86 optimization, because x86 allows > encoding 2 values (offset and immediate) by using the displacement and > immediate suffixes. I wonder whether the ISA would be more readable if > we did this in 2 insn, one to put dst_reg+off16 in a temporary > register, and the second a simpler BPF_STX|BPF_MEM. Then we could use > the same space for the immediate and offset fields. One option is to add code to the x86 JIT to detect the two instruction sequence and generate a single instruction. Thinks further, the JIT might be easier to write if there is a temporary register that is defined to be only valid for the next instruction (or two). Then the JIT can completely optimise away any assignments to it without requiring a full analysis of the entire program. David
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On Fri, Jun 20, 2014 at 9:44 AM, Chema Gonzalez ch...@google.com wrote: Model eBPF based on MIPS ISA? Ouch. That would be one ugly ISA that is not JITable on x64. Definitely I wasn't making my point clear: IMO if we're redesigning the BPF ISA, we should get a clean one (clean=something that is simple enough to be readable by a newcomer). I mentioned MIPS because it's the one I know the most, and it's kind of clean. I'm definitely not proposing to use MIPS as the BPF ISA. mips is a clean isa? When it was designed 30 years ago, it was good, but today it really shows its age: delay slots, integer arithmetic that traps on overflow, lack of real comparison operations, hi/lo, etc. I strongly believe eBPF isa is way cleaner and easier to understand than mips isa. It seems your proposals to make eBPF 'cleaner' are based on HW mindset, which is not applicable here. Details below: 1. how we codify the ISA. Both BPF and eBPF devote 50% of the insn space to load/store (4 opcodes/instruction class values of 8 possible). In comparison, MIPS uses just 14% (9 of 64 opcodes). That that is a misleading comparison and leads to the wrong conclusions. Your proposal to remove useful instruction just to save a bit in bpf class encoding just doesn't make sense. We have infinite room to add new instructions and opcodes. Unlike HW ISA eBPF is not limited by 8-bit opcodes and 8-byte instructions. eBPF is not designed to be run directly by HW, so we're not trying to save rtl gates here. Among other things we're optimizing for interpreter performance, so removing instructions can only hurt. gives me pause. I'm definitely not suggesting adding more insn just because physical ISAs have it, but I think it makes sense to avoid using the whole insn space just because it's there, or because classic BPF was using it all. wrong. Classic BPF is a legacy that we have to live with and we should not sacrifice performance of classic bpf filters just to reduce number of eBPF instructions. It made sense only for one classic instruction: BPF_LD + MSH It is really single purpose instruction to do X = ip-length * 4. We didn't carry this ugliness into eBPF, since it's not generic, can be easily represented by generic instructions, complicates JITs. Since MSH is used at most once per tcpdump filter, few extra insns add a tiny penalty to overall filter execution time in interpreter and give no performance penalty at all when filter is JITed. 2. instructions (sock_filter_int) hardcode 2 immediate values. This is very unusual in ISAs. We're effectively doubling the insn size (from 32 to 64 bits), and still we are cramming the whole insn space (only 1 reserved instruction class of 8 possible ones). The rationale is to Comparisons with HW encoding is not applicable. mips picked 4 byte wide instruction and had to live with it. Just look at hi/lo insns and what compilers have to do with it. All eBPF instructions today are 8 byte wide. There is no reason to redesign them into 8 bytes or squeeze bits. It will hurt performance without giving us anything back. At the same time we can add 16-byte instruction to represent load 64-bit immediate, but as I was saying before many factors need to be considered before we proceed. 3. name reuse: we're reusing names b etween classic BPF and eBPF. For example, BPF_LD*|BPF_MEM in classic BPF refers to access to the M[] buffer. BPF_LDX|BPF_MEM in eBPF is a generic memory access. I find this very confusing, especially because we'll have to live with That's one opinion. I think names are fine and Documentation/networking/filter.txt explains both classic and eBPF encoding well enough. classic BPF in userland filters for a long time. In fact, if you ask me, I'll come up with some generic name for the generic linux filtering mechanism (eBPF and internal BPF sound too much like BPF), to make it clear that this is not just BPF. I don't think it's a good idea. I like BPF abbreviation, since the name implies the use case. Renaming eBPF to ISA_X will be confusing. Now everyone understands that BPF is a safe instruction set that can be dynamically loaded into the kernel. eBPF is the same plus more. 4. simplicity: both BPF and eBPF have 4 ld/st operations types (LD, LDX, ST, STX) and many addressing modes/modifiers (6 for BPF, 4 for eBPF), where only a subset of the {operation types, modifier} tuples are valid. I think we can make it simpler. For the eBPF case, we currently have 6 valid combinations: That's a good summary. I think documentation explained it already, but if you feel it's still missing pieces, please send a patch to improve the doc. The two eBPF BPF_LD insn are BPF_LDX|BPF_MEM insn to access an skbuff. For example, BPF_LD|BPF_ABS does dst_reg = packet[imm32], which is a BPF_LDX|BPF_MEM with the following differences: a. packet is skb-data == CTX == BPF_R6 (this is equivalent to src_reg==R6 in BPF_LDX|BPF_MEM) b. output is left in R0, not in dst_reg c. result is
RE: [PATCH v2 net-next 0/2] split BPF out of core networking
From: Chema Gonzalez ... 4.5. BPF_ST|BPF_MEM Operation: *(size *) (dst_reg + off16) = imm32 This insn encodes 2 immediate values (the offset and the imm32 value) in the insn, and actually forces the sock_filter_int 64-bit struct to have both a 16-bit offset field and a 32-bit immediate field). In fact, it's the only instructions that uses .off and .imm at the same time (for all other instructions, at least one of the fields is always 0). This did not exist in classic BPF (where BPF_ST|BPF_MEM actually did mem[pc-k] = A;). In fact, it's rare to find an ISA that allows encoding 2 immediate values in a single insn. My impression (after checking the x86 JIT implementation, which works on the eBPF code) is that this was added as an x86 optimization, because x86 allows encoding 2 values (offset and immediate) by using the displacement and immediate suffixes. I wonder whether the ISA would be more readable if we did this in 2 insn, one to put dst_reg+off16 in a temporary register, and the second a simpler BPF_STX|BPF_MEM. Then we could use the same space for the immediate and offset fields. One option is to add code to the x86 JIT to detect the two instruction sequence and generate a single instruction. Thinks further, the JIT might be easier to write if there is a temporary register that is defined to be only valid for the next instruction (or two). Then the JIT can completely optimise away any assignments to it without requiring a full analysis of the entire program. David
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
[Sorry for the delay in the answer. Been mired somewhere else.] On Tue, Jun 3, 2014 at 5:38 PM, Alexei Starovoitov wrote: > On Tue, Jun 3, 2014 at 2:40 PM, Chema Gonzalez wrote: >> First of all, and just to join the crowd, kernel/bpf/ FTW. >> >> Now, I have some suggestions about eBPF. IMO classic BPF is an ISA >> oriented to filter (meaning returning a single integer that states how >> many bytes of the packet must be captured) packets (e.g. consider the >> 6 load modes, where 3 provide access the packet -- abs, ind, msh --, >> one to an skb field -- len--, the 5th one to the memory itself -- mem >> --, and the 6th is an immediate set mode --imm-- ) that has been used >> in other environments (seccomp, tracing, etc.) by (a) extending the >> idea of a "packet" into a "buffer", and (b) adding ancillary loads. >> >> eBPF should be a generic ISA that can be used by many environments, >> including those served today by classic BPF. IMO, we should get a >> nicely-defined ISA (MIPS anyone?) and check what should go into eBPF >> and what should not. > > Model eBPF based on MIPS ISA? Ouch. > That would be one ugly ISA that is not JITable on x64. Definitely I wasn't making my point clear: IMO if we're redesigning the BPF ISA, we should get a clean one (clean=something that is simple enough to be readable by a newcomer). I mentioned MIPS because it's the one I know the most, and it's kind of clean. I'm definitely not proposing to use MIPS as the BPF ISA. In particular, I have 4 cleanliness concerns related to load/stores: 1. how we codify the ISA. Both BPF and eBPF devote 50% of the insn space to load/store (4 opcodes/instruction class values of 8 possible). In comparison, MIPS uses just 14% (9 of 64 opcodes). That gives me pause. I'm definitely not suggesting adding more insn just because physical ISAs have it, but I think it makes sense to avoid using the whole insn space just because it's there, or because classic BPF was using it all. 2. instructions (sock_filter_int) hardcode 2 immediate values. This is very unusual in ISAs. We're effectively doubling the insn size (from 32 to 64 bits), and still we are cramming the whole insn space (only 1 reserved instruction class of 8 possible ones). The rationale is to support a new addressing mode (BPF_ST|BPF_MEM), where both the offset to src_reg and the immediate value are hardcoded in the insn. (See more below.) 3. name reuse: we're reusing names b etween classic BPF and eBPF. For example, BPF_LD*|BPF_MEM in classic BPF refers to access to the M[] buffer. BPF_LDX|BPF_MEM in eBPF is a generic memory access. I find this very confusing, especially because we'll have to live with classic BPF in userland filters for a long time. In fact, if you ask me, I'll come up with some generic name for the generic linux filtering mechanism (eBPF and internal BPF sound too much like BPF), to make it clear that this is not just BPF. 4. simplicity: both BPF and eBPF have 4 ld/st operations types (LD, LDX, ST, STX) and many addressing modes/modifiers (6 for BPF, 4 for eBPF), where only a subset of the {operation types, modifier} tuples are valid. I think we can make it simpler. For the eBPF case, we currently have 6 valid combinations: 4.1. BPF_LDX|BPF_MEM Operation: dst_reg = *(size *) (src_reg + off16) This is the basic load insn. It's used to convert most of the classic BPF addressing modes by setting the right src_reg (FP in the classic BPF M[] access, CTX for the classic BPF BPF_LD*|BPF_LEN access and seccomp_data access, etc.) 4.2. BPF_LD|BPF_ABS Operation: BPF_R0 = ntoh(*(size *) (skb->data + imm32)) 4.3. BPF_LD|BPF_IND Operation: BPF_R0 = ntoh(*(size *) (skb->data + src_reg + imm32)) The two eBPF BPF_LD insn are BPF_LDX|BPF_MEM insn to access an skbuff. For example, BPF_LD|BPF_ABS does "dst_reg = packet[imm32]", which is a BPF_LDX|BPF_MEM with the following differences: a. packet is skb->data == CTX == BPF_R6 (this is equivalent to src_reg==R6 in BPF_LDX|BPF_MEM) b. output is left in R0, not in dst_reg c. result is ntohs()|ntohl() d. every packet access is checked using load_pointer()/skb_header_pointer()/bpf_internal_load_pointer_neg_helper() Now, (a,b,c) seem like details that should be worked with multiple instructions (in fact the x86 JIT does that). (d) is IMO the only part important enough to justify a different insn. I'd call this mode BPF_SKBUFF_PROTECTED (or something like that), because that is the main idea of this instruction: that any memory access (ld or st) is checked during runtime assuming it's an skbuff. The BPF_LD|BPF_IND insn could be replaced in 2 steps, one to get src_reg+imm32 into a tmp register, and the other to perform the final load based on the tmp register. 4.4. BPF_STX|BPF_MEM Operation: *(size *) (dst_reg + off16) = src_reg This is the basic store insn. LGTM. 4.5. BPF_ST|BPF_MEM Operation: *(size *) (dst_reg + off16) = imm32 This insn encodes 2 immediate values (the offset and the imm32 value) in the insn, and actually forces the
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
[Sorry for the delay in the answer. Been mired somewhere else.] On Tue, Jun 3, 2014 at 5:38 PM, Alexei Starovoitov a...@plumgrid.com wrote: On Tue, Jun 3, 2014 at 2:40 PM, Chema Gonzalez ch...@google.com wrote: First of all, and just to join the crowd, kernel/bpf/ FTW. Now, I have some suggestions about eBPF. IMO classic BPF is an ISA oriented to filter (meaning returning a single integer that states how many bytes of the packet must be captured) packets (e.g. consider the 6 load modes, where 3 provide access the packet -- abs, ind, msh --, one to an skb field -- len--, the 5th one to the memory itself -- mem --, and the 6th is an immediate set mode --imm-- ) that has been used in other environments (seccomp, tracing, etc.) by (a) extending the idea of a packet into a buffer, and (b) adding ancillary loads. eBPF should be a generic ISA that can be used by many environments, including those served today by classic BPF. IMO, we should get a nicely-defined ISA (MIPS anyone?) and check what should go into eBPF and what should not. Model eBPF based on MIPS ISA? Ouch. That would be one ugly ISA that is not JITable on x64. Definitely I wasn't making my point clear: IMO if we're redesigning the BPF ISA, we should get a clean one (clean=something that is simple enough to be readable by a newcomer). I mentioned MIPS because it's the one I know the most, and it's kind of clean. I'm definitely not proposing to use MIPS as the BPF ISA. In particular, I have 4 cleanliness concerns related to load/stores: 1. how we codify the ISA. Both BPF and eBPF devote 50% of the insn space to load/store (4 opcodes/instruction class values of 8 possible). In comparison, MIPS uses just 14% (9 of 64 opcodes). That gives me pause. I'm definitely not suggesting adding more insn just because physical ISAs have it, but I think it makes sense to avoid using the whole insn space just because it's there, or because classic BPF was using it all. 2. instructions (sock_filter_int) hardcode 2 immediate values. This is very unusual in ISAs. We're effectively doubling the insn size (from 32 to 64 bits), and still we are cramming the whole insn space (only 1 reserved instruction class of 8 possible ones). The rationale is to support a new addressing mode (BPF_ST|BPF_MEM), where both the offset to src_reg and the immediate value are hardcoded in the insn. (See more below.) 3. name reuse: we're reusing names b etween classic BPF and eBPF. For example, BPF_LD*|BPF_MEM in classic BPF refers to access to the M[] buffer. BPF_LDX|BPF_MEM in eBPF is a generic memory access. I find this very confusing, especially because we'll have to live with classic BPF in userland filters for a long time. In fact, if you ask me, I'll come up with some generic name for the generic linux filtering mechanism (eBPF and internal BPF sound too much like BPF), to make it clear that this is not just BPF. 4. simplicity: both BPF and eBPF have 4 ld/st operations types (LD, LDX, ST, STX) and many addressing modes/modifiers (6 for BPF, 4 for eBPF), where only a subset of the {operation types, modifier} tuples are valid. I think we can make it simpler. For the eBPF case, we currently have 6 valid combinations: 4.1. BPF_LDX|BPF_MEM Operation: dst_reg = *(size *) (src_reg + off16) This is the basic load insn. It's used to convert most of the classic BPF addressing modes by setting the right src_reg (FP in the classic BPF M[] access, CTX for the classic BPF BPF_LD*|BPF_LEN access and seccomp_data access, etc.) 4.2. BPF_LD|BPF_ABS Operation: BPF_R0 = ntohsize(*(size *) (skb-data + imm32)) 4.3. BPF_LD|BPF_IND Operation: BPF_R0 = ntohsize(*(size *) (skb-data + src_reg + imm32)) The two eBPF BPF_LD insn are BPF_LDX|BPF_MEM insn to access an skbuff. For example, BPF_LD|BPF_ABS does dst_reg = packet[imm32], which is a BPF_LDX|BPF_MEM with the following differences: a. packet is skb-data == CTX == BPF_R6 (this is equivalent to src_reg==R6 in BPF_LDX|BPF_MEM) b. output is left in R0, not in dst_reg c. result is ntohs()|ntohl() d. every packet access is checked using load_pointer()/skb_header_pointer()/bpf_internal_load_pointer_neg_helper() Now, (a,b,c) seem like details that should be worked with multiple instructions (in fact the x86 JIT does that). (d) is IMO the only part important enough to justify a different insn. I'd call this mode BPF_SKBUFF_PROTECTED (or something like that), because that is the main idea of this instruction: that any memory access (ld or st) is checked during runtime assuming it's an skbuff. The BPF_LD|BPF_IND insn could be replaced in 2 steps, one to get src_reg+imm32 into a tmp register, and the other to perform the final load based on the tmp register. 4.4. BPF_STX|BPF_MEM Operation: *(size *) (dst_reg + off16) = src_reg This is the basic store insn. LGTM. 4.5. BPF_ST|BPF_MEM Operation: *(size *) (dst_reg + off16) = imm32 This insn encodes 2 immediate values (the offset and the imm32 value) in the insn, and actually forces the
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On Tue, Jun 3, 2014 at 2:40 PM, Chema Gonzalez wrote: > First of all, and just to join the crowd, kernel/bpf/ FTW. > > Now, I have some suggestions about eBPF. IMO classic BPF is an ISA > oriented to filter (meaning returning a single integer that states how > many bytes of the packet must be captured) packets (e.g. consider the > 6 load modes, where 3 provide access the packet -- abs, ind, msh --, > one to an skb field -- len--, the 5th one to the memory itself -- mem > --, and the 6th is an immediate set mode --imm-- ) that has been used > in other environments (seccomp, tracing, etc.) by (a) extending the > idea of a "packet" into a "buffer", and (b) adding ancillary loads. > > eBPF should be a generic ISA that can be used by many environments, > including those served today by classic BPF. IMO, we should get a > nicely-defined ISA (MIPS anyone?) and check what should go into eBPF > and what should not. Model eBPF based on MIPS ISA? Ouch. That would be one ugly ISA that is not JITable on x64. eBPF ISA wasn't invented overnight. It was a gigantic effort that took a lot of time to narrow down x64 into _verifiable_ ISA. I had to take into account arm64, mips64, sparcv9 architectures too. Of course, minor things can be improved here or there. Ugliness of ISA hits compiler writers first. I've seen many times how cpu designers add new instructions only to be told by compiler guys that they just wasted silicon. Fact that llvm/gcc compile C into eBPF is the strongest statement that eBPF ISA is 99.9% complete. New instructions may or may not make sense. Let's examine your proposal: > - 1. we should considering separating the eBPF ISA farther from classic BPF > - eBPF still uses a_reg and x_reg as the names of the 2 op > registers. This is very confusing, especially when dealing with > translated filters that do move data between A and X. I've had a_reg > being X, and x_reg being A. We should rename them d_reg and s_reg. that is renaming of two fields in sock_filter_int structure. No change to actual ISA. You're proposing the following: diff --git a/include/linux/filter.h b/include/linux/filter.h index 0e463ee77bb2..bf50fa440ef8 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -287,8 +287,8 @@ enum { struct sock_filter_int { __u8code; /* opcode */ - __u8a_reg:4;/* dest register */ - __u8x_reg:4;/* source register */ + __u8dst_reg:4; /* dest register */ + __u8src_reg:4; /* source register */ __s16 off;/* signed offset */ __s32 imm;/* signed immediate constant */ }; sure. I thought comment was explicit enough, but I agree the fields could have been named better. Will do a patch to rename it. > - BPF_LD vs. BPF_LDX: this made sense in classic BPF, as there was > only one register, and d_reg was implicit in the name of the insn > code. Now, why are we keeping both in eBPF, when the register we're > writing to is made explicit in d_reg (I already forgot if d_reg was > a_reg or x_reg ;) ? Removing one of them will save us 1/8th of the > insns. > - BPF_ST vs. BPF_STX: same here. Note that the current > sk_convert_filter() just converts all stores to BPF_STX. nope. No extra bits can be saved here. STX means: *(dest_reg + off) = src_reg; ST means: *(dest_reg + off) = imm; LDX means: dest_reg = *(src_reg + off) LD we had to carry over from classic as only two non-generic instructions: LD_ABS and LD_IND. > - 2. there are other insn that we should consider adding: > - lui: AFAICT, there is no clean way to build a 64-bit number (you > can LD_IMM the upper part, lsh 32, and then add the lower part). correct. in tracing filters I do this: + /* construct 64-bit address */ + emit(BPF_ALU64_IMM(BPF_MOV, BPF_REG_2, addr >> 32), ctx); + emit(BPF_ALU64_IMM(BPF_LSH, BPF_REG_2, 32), ctx); + emit(BPF_ALU32_IMM(BPF_MOV, BPF_REG_1, (u32) addr), ctx); + emit(BPF_ALU64_REG(BPF_OR, BPF_REG_1, BPF_REG_2), ctx); so there is a way to construct 64-bit immediate. The question is how often do we need to do it? Is it in critical path? Naive answer "one instruction is better than 4" doesn't count. See my point above 'cpu designer vs compiler writer'. None of the risc ISAs have 64-bit imm and eBPF has to consider simplicity of JITs otherwise those architectures will have a hard time mapping eBPF to native. If JIT would be too difficult to do, then there will be no JIT. I don't want eBPF to become an instruction set that can be JITed only on one architecture... 'mov dest_reg, imm64' may still be ok to add, since x64 can JIT it with one instruction, arm64 with 4 instructions, but JITs for other archs will be ugly. They can JIT it as load from memory, but I need to think it through. Let me explore it more carefully. Two must have requirements for eBPF: 1. verifiable
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
First of all, and just to join the crowd, kernel/bpf/ FTW. Now, I have some suggestions about eBPF. IMO classic BPF is an ISA oriented to filter (meaning returning a single integer that states how many bytes of the packet must be captured) packets (e.g. consider the 6 load modes, where 3 provide access the packet -- abs, ind, msh --, one to an skb field -- len--, the 5th one to the memory itself -- mem --, and the 6th is an immediate set mode --imm-- ) that has been used in other environments (seccomp, tracing, etc.) by (a) extending the idea of a "packet" into a "buffer", and (b) adding ancillary loads. eBPF should be a generic ISA that can be used by many environments, including those served today by classic BPF. IMO, we should get a nicely-defined ISA (MIPS anyone?) and check what should go into eBPF and what should not. - 1. we should considering separating the eBPF ISA farther from classic BPF - eBPF still uses a_reg and x_reg as the names of the 2 op registers. This is very confusing, especially when dealing with translated filters that do move data between A and X. I've had a_reg being X, and x_reg being A. We should rename them d_reg and s_reg. - BPF_LD vs. BPF_LDX: this made sense in classic BPF, as there was only one register, and d_reg was implicit in the name of the insn code. Now, why are we keeping both in eBPF, when the register we're writing to is made explicit in d_reg (I already forgot if d_reg was a_reg or x_reg ;) ? Removing one of them will save us 1/8th of the insns. - BPF_ST vs. BPF_STX: same here. Note that the current sk_convert_filter() just converts all stores to BPF_STX. - 2. there are other insn that we should consider adding: - lui: AFAICT, there is no clean way to build a 64-bit number (you can LD_IMM the upper part, lsh 32, and then add the lower part). - nop: I'd like to have a nop. Do I know why? Nope. On Tue, Jun 3, 2014 at 1:58 PM, Alexei Starovoitov wrote: > On Tue, Jun 3, 2014 at 1:35 PM, Daniel Borkmann wrote: >> On 06/03/2014 05:44 PM, Alexei Starovoitov wrote: >> ... >>> >>> All of your points are valid. They are right questions to ask. I just >>> >>> don't see why you're still arguing about first step of filter.c split, >>> whereas your concerns are about steps 2, 3, 4. >> >> >> Fair enough, lets keep them in mind though for future work. Btw, > > Ok :) > >> are other files planned for kernel/bpf/ or should it instead just >> simply be kernel/bpf.c? > > The most obvious one is eBPF verifier in separate file (kernel/bpf/verifier.c) > bpf maps is yet another thing, but that's different topic. > Probably a set of bpf-callable functions in another file. Like right now > for sockets these helpers are __skb_get_pay_offset(), __skb_get_nlattr() > For tracing there will be a different set of helper functions and eventually > some will be common. Like __get_raw_cpu_id() from filter.c could > eventually move to kernel/bpf/helpers.c LGTM. I like the idea of every user (packet filter, seccomp, etc.) providing a map of the bpf calls that are ok, as in the packet filter stating that {1->__skb_get_pay_offset(), 2->__skb_get_nlattr(), ...}, but seccomp providing a completely different (or even empty) map. -Chema > I'm not a fan of squeezing different logic into one file. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On Tue, Jun 3, 2014 at 1:35 PM, Daniel Borkmann wrote: > On 06/03/2014 05:44 PM, Alexei Starovoitov wrote: > ... >> >> All of your points are valid. They are right questions to ask. I just >> >> don't see why you're still arguing about first step of filter.c split, >> whereas your concerns are about steps 2, 3, 4. > > > Fair enough, lets keep them in mind though for future work. Btw, Ok :) > are other files planned for kernel/bpf/ or should it instead just > simply be kernel/bpf.c? The most obvious one is eBPF verifier in separate file (kernel/bpf/verifier.c) bpf maps is yet another thing, but that's different topic. Probably a set of bpf-callable functions in another file. Like right now for sockets these helpers are __skb_get_pay_offset(), __skb_get_nlattr() For tracing there will be a different set of helper functions and eventually some will be common. Like __get_raw_cpu_id() from filter.c could eventually move to kernel/bpf/helpers.c I'm not a fan of squeezing different logic into one file. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On 06/03/2014 05:44 PM, Alexei Starovoitov wrote: ... All of your points are valid. They are right questions to ask. I just don't see why you're still arguing about first step of filter.c split, whereas your concerns are about steps 2, 3, 4. Fair enough, lets keep them in mind though for future work. Btw, are other files planned for kernel/bpf/ or should it instead just simply be kernel/bpf.c? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
* Alexei Starovoitov wrote: > On Mon, Jun 2, 2014 at 7:16 AM, Arnaldo Carvalho de Melo > wrote: > > Em Mon, Jun 02, 2014 at 09:24:56AM -0400, Steven Rostedt escreveu: > >> On Mon, 2 Jun 2014 08:15:45 -0500 > >> Jonathan Corbet wrote: > > > >> > On Mon, 2 Jun 2014 00:01:44 -0700 > >> > Alexei Starovoitov wrote: > > > >> > > This patch set splits BPF out of core networking into generic component > > > >> > Quick, probably dumb question: if you're going to split it out, why not > >> > split it out entirely, into kernel/ or (perhaps better) lib/? The > >> > whole point seems to be that BPF is outgrowing its networking home, so > >> > it seems like it might be better to make it truly generic. > > > >> I believe this is what Ingo suggested as well. If it is become generic, > >> it belongs in lib/ > > > > Yes, that was his suggestion, which I agree with, FWIW. > > I guess I posted v2 too quickly :) v2 splits filter.c into > kernel/bpf/. I think it's a better location than lib/bpf, since lib > feels too constrained by definition of 'library'. bpf is more than a > set of library calls. Yeah, the upgrade to kernel/bpf/ is a better place for BPF IMO: BPF is really an 'active', stateful subsystem, with non-trivial per arch implementations, while lib/ is generally for standalone, generic, platform-decoupled library functions (with a few exceptions). Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On Tue, Jun 3, 2014 at 1:56 AM, Daniel Borkmann wrote: > On 06/02/2014 09:02 PM, Alexei Starovoitov wrote: > ... >> >> Classic has all sorts of hard coded assumptions. The whole >> >> concept of 'load from magic constant' to mean different things >> is flawed. We all got used to it and now think that it's normal >> for "ld_abs -4056" to mean "a ^= x" > > > I think everyone knows that, no? Sure it doesn't fit into the > concept, but I think at the time BPF extensions were introduced, > it was probably seen as the best trade-off available to access > useful skb fields while still trying to minimize exposure to uapi > as much as possible. Exactly. It _was_ seen as right trade-off in the past. Now we have a lot more bpf users, so considerations are different. >> This split is not trying to make classic easier to hack. >> With eBPF underneath classic, it got a lot easier to add extensions >> to classic, but we shouldn't be doing it. >> Classic BPF is not generic and cannot become one. It's eBPF's job. >> >> The split is mainly helping to clearly see the boundary of eBPF core >> vs its socket use case. It doesn't change or add any API. > > > So what's the plan with everything in arch/*/net/, tools/net/ and > in Documentation/networking/filter.txt, plus MAINTAINERS file, that > the current patch doesn't address? I have multi-year long plan of actions in eBPF area and as was seen in past several month I would have to adjust it many times based on community feedback. The plan includes taking care of arch/*/net, but I'm not bringing it up right now, since the filter.c split itself doesn't depend on what we're going to do with JITs in arch/*/net/ As you saw I mentioned JITs in the cover letter, so I obviously thought about it before proposing this filter.c split. I even have rough patches to take care of it, but let's not get ahead of ourselves. My plan also includes upstreaming of LLVM eBPF backend, but linux needs to expose it to userspace first. It includes eBPF assembler to write programs like: r1 = r5 *(u32 *) (fp - 10) = 0 call foo if (r0 == 0) goto Label ^^above is assembler. I don't like current bpf_asm syntax, since it's too assemblish. C-looking assembler is easier to understand. It includes bpf maps, 'perf run filter.c' and all sorts of other things. I cannot put the year long plan in one email, since tl;dr kicks in. filter.c split is a tiny first step. next step is filter.h split renaming arch/*/net/bpf_jit_comp.c into arch/*/bpf/jit_comp.c is the least of my concerns. If JITs stay with strong dependency to NET, it's also fine. As I said in cover letter filter.c split is not about NET dependency. Even tiny embedded systems rely on networking, so all real world .config's will include 'NET'. The split is about logical separation of eBPF vs sockets. Having them in one file just not doing any good, since people are jumping into hacking things quickly without seeing that eBPF is not only about sockets. MAINTAINERS file is a good question too. I would be happy to maintain bpf/ebpf, since it's my full time job anyway, but again let's not jump the gun. > We want changes to go via net...@vger.kernel.org as they always > did, since [ although other use cases pop up ] the main user, as > I said, is simply still packet filtering in various networking > subsystems, no? Obviously sockets is the main, but not the only user, so I think both lkml and netdev would need to be cc-ed in the future. Or we can create 'bpf' alias for anyone interested. All of your points are valid. They are right questions to ask. I just don't see why you're still arguing about first step of filter.c split, whereas your concerns are about steps 2, 3, 4. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On 06/02/2014 09:02 PM, Alexei Starovoitov wrote: ... Classic has all sorts of hard coded assumptions. The whole concept of 'load from magic constant' to mean different things is flawed. We all got used to it and now think that it's normal for "ld_abs -4056" to mean "a ^= x" I think everyone knows that, no? Sure it doesn't fit into the concept, but I think at the time BPF extensions were introduced, it was probably seen as the best trade-off available to access useful skb fields while still trying to minimize exposure to uapi as much as possible. This split is not trying to make classic easier to hack. With eBPF underneath classic, it got a lot easier to add extensions to classic, but we shouldn't be doing it. Classic BPF is not generic and cannot become one. It's eBPF's job. The split is mainly helping to clearly see the boundary of eBPF core vs its socket use case. It doesn't change or add any API. So what's the plan with everything in arch/*/net/, tools/net/ and in Documentation/networking/filter.txt, plus MAINTAINERS file, that the current patch doesn't address? We want changes to go via net...@vger.kernel.org as they always did, since [ although other use cases pop up ] the main user, as I said, is simply still packet filtering in various networking subsystems, no? This in-kernel API cleanup was done in commit 5fe821a9dee2 You even acked it back then :) I agreed with that change, otherwise I wouldn't have acked it, of course. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On 06/02/2014 09:02 PM, Alexei Starovoitov wrote: ... Classic has all sorts of hard coded assumptions. The whole concept of 'load from magic constant' to mean different things is flawed. We all got used to it and now think that it's normal for ld_abs -4056 to mean a ^= x I think everyone knows that, no? Sure it doesn't fit into the concept, but I think at the time BPF extensions were introduced, it was probably seen as the best trade-off available to access useful skb fields while still trying to minimize exposure to uapi as much as possible. This split is not trying to make classic easier to hack. With eBPF underneath classic, it got a lot easier to add extensions to classic, but we shouldn't be doing it. Classic BPF is not generic and cannot become one. It's eBPF's job. The split is mainly helping to clearly see the boundary of eBPF core vs its socket use case. It doesn't change or add any API. So what's the plan with everything in arch/*/net/, tools/net/ and in Documentation/networking/filter.txt, plus MAINTAINERS file, that the current patch doesn't address? We want changes to go via net...@vger.kernel.org as they always did, since [ although other use cases pop up ] the main user, as I said, is simply still packet filtering in various networking subsystems, no? This in-kernel API cleanup was done in commit 5fe821a9dee2 You even acked it back then :) I agreed with that change, otherwise I wouldn't have acked it, of course. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On Tue, Jun 3, 2014 at 1:56 AM, Daniel Borkmann dbork...@redhat.com wrote: On 06/02/2014 09:02 PM, Alexei Starovoitov wrote: ... Classic has all sorts of hard coded assumptions. The whole concept of 'load from magic constant' to mean different things is flawed. We all got used to it and now think that it's normal for ld_abs -4056 to mean a ^= x I think everyone knows that, no? Sure it doesn't fit into the concept, but I think at the time BPF extensions were introduced, it was probably seen as the best trade-off available to access useful skb fields while still trying to minimize exposure to uapi as much as possible. Exactly. It _was_ seen as right trade-off in the past. Now we have a lot more bpf users, so considerations are different. This split is not trying to make classic easier to hack. With eBPF underneath classic, it got a lot easier to add extensions to classic, but we shouldn't be doing it. Classic BPF is not generic and cannot become one. It's eBPF's job. The split is mainly helping to clearly see the boundary of eBPF core vs its socket use case. It doesn't change or add any API. So what's the plan with everything in arch/*/net/, tools/net/ and in Documentation/networking/filter.txt, plus MAINTAINERS file, that the current patch doesn't address? I have multi-year long plan of actions in eBPF area and as was seen in past several month I would have to adjust it many times based on community feedback. The plan includes taking care of arch/*/net, but I'm not bringing it up right now, since the filter.c split itself doesn't depend on what we're going to do with JITs in arch/*/net/ As you saw I mentioned JITs in the cover letter, so I obviously thought about it before proposing this filter.c split. I even have rough patches to take care of it, but let's not get ahead of ourselves. My plan also includes upstreaming of LLVM eBPF backend, but linux needs to expose it to userspace first. It includes eBPF assembler to write programs like: r1 = r5 *(u32 *) (fp - 10) = 0 call foo if (r0 == 0) goto Label ^^above is assembler. I don't like current bpf_asm syntax, since it's too assemblish. C-looking assembler is easier to understand. It includes bpf maps, 'perf run filter.c' and all sorts of other things. I cannot put the year long plan in one email, since tl;dr kicks in. filter.c split is a tiny first step. next step is filter.h split renaming arch/*/net/bpf_jit_comp.c into arch/*/bpf/jit_comp.c is the least of my concerns. If JITs stay with strong dependency to NET, it's also fine. As I said in cover letter filter.c split is not about NET dependency. Even tiny embedded systems rely on networking, so all real world .config's will include 'NET'. The split is about logical separation of eBPF vs sockets. Having them in one file just not doing any good, since people are jumping into hacking things quickly without seeing that eBPF is not only about sockets. MAINTAINERS file is a good question too. I would be happy to maintain bpf/ebpf, since it's my full time job anyway, but again let's not jump the gun. We want changes to go via net...@vger.kernel.org as they always did, since [ although other use cases pop up ] the main user, as I said, is simply still packet filtering in various networking subsystems, no? Obviously sockets is the main, but not the only user, so I think both lkml and netdev would need to be cc-ed in the future. Or we can create 'bpf' alias for anyone interested. All of your points are valid. They are right questions to ask. I just don't see why you're still arguing about first step of filter.c split, whereas your concerns are about steps 2, 3, 4. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
* Alexei Starovoitov a...@plumgrid.com wrote: On Mon, Jun 2, 2014 at 7:16 AM, Arnaldo Carvalho de Melo a...@kernel.org wrote: Em Mon, Jun 02, 2014 at 09:24:56AM -0400, Steven Rostedt escreveu: On Mon, 2 Jun 2014 08:15:45 -0500 Jonathan Corbet cor...@lwn.net wrote: On Mon, 2 Jun 2014 00:01:44 -0700 Alexei Starovoitov a...@plumgrid.com wrote: This patch set splits BPF out of core networking into generic component Quick, probably dumb question: if you're going to split it out, why not split it out entirely, into kernel/ or (perhaps better) lib/? The whole point seems to be that BPF is outgrowing its networking home, so it seems like it might be better to make it truly generic. I believe this is what Ingo suggested as well. If it is become generic, it belongs in lib/ Yes, that was his suggestion, which I agree with, FWIW. I guess I posted v2 too quickly :) v2 splits filter.c into kernel/bpf/. I think it's a better location than lib/bpf, since lib feels too constrained by definition of 'library'. bpf is more than a set of library calls. Yeah, the upgrade to kernel/bpf/ is a better place for BPF IMO: BPF is really an 'active', stateful subsystem, with non-trivial per arch implementations, while lib/ is generally for standalone, generic, platform-decoupled library functions (with a few exceptions). Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On 06/03/2014 05:44 PM, Alexei Starovoitov wrote: ... All of your points are valid. They are right questions to ask. I just don't see why you're still arguing about first step of filter.c split, whereas your concerns are about steps 2, 3, 4. Fair enough, lets keep them in mind though for future work. Btw, are other files planned for kernel/bpf/ or should it instead just simply be kernel/bpf.c? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On Tue, Jun 3, 2014 at 1:35 PM, Daniel Borkmann dbork...@redhat.com wrote: On 06/03/2014 05:44 PM, Alexei Starovoitov wrote: ... All of your points are valid. They are right questions to ask. I just don't see why you're still arguing about first step of filter.c split, whereas your concerns are about steps 2, 3, 4. Fair enough, lets keep them in mind though for future work. Btw, Ok :) are other files planned for kernel/bpf/ or should it instead just simply be kernel/bpf.c? The most obvious one is eBPF verifier in separate file (kernel/bpf/verifier.c) bpf maps is yet another thing, but that's different topic. Probably a set of bpf-callable functions in another file. Like right now for sockets these helpers are __skb_get_pay_offset(), __skb_get_nlattr() For tracing there will be a different set of helper functions and eventually some will be common. Like __get_raw_cpu_id() from filter.c could eventually move to kernel/bpf/helpers.c I'm not a fan of squeezing different logic into one file. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
First of all, and just to join the crowd, kernel/bpf/ FTW. Now, I have some suggestions about eBPF. IMO classic BPF is an ISA oriented to filter (meaning returning a single integer that states how many bytes of the packet must be captured) packets (e.g. consider the 6 load modes, where 3 provide access the packet -- abs, ind, msh --, one to an skb field -- len--, the 5th one to the memory itself -- mem --, and the 6th is an immediate set mode --imm-- ) that has been used in other environments (seccomp, tracing, etc.) by (a) extending the idea of a packet into a buffer, and (b) adding ancillary loads. eBPF should be a generic ISA that can be used by many environments, including those served today by classic BPF. IMO, we should get a nicely-defined ISA (MIPS anyone?) and check what should go into eBPF and what should not. - 1. we should considering separating the eBPF ISA farther from classic BPF - eBPF still uses a_reg and x_reg as the names of the 2 op registers. This is very confusing, especially when dealing with translated filters that do move data between A and X. I've had a_reg being X, and x_reg being A. We should rename them d_reg and s_reg. - BPF_LD vs. BPF_LDX: this made sense in classic BPF, as there was only one register, and d_reg was implicit in the name of the insn code. Now, why are we keeping both in eBPF, when the register we're writing to is made explicit in d_reg (I already forgot if d_reg was a_reg or x_reg ;) ? Removing one of them will save us 1/8th of the insns. - BPF_ST vs. BPF_STX: same here. Note that the current sk_convert_filter() just converts all stores to BPF_STX. - 2. there are other insn that we should consider adding: - lui: AFAICT, there is no clean way to build a 64-bit number (you can LD_IMM the upper part, lsh 32, and then add the lower part). - nop: I'd like to have a nop. Do I know why? Nope. On Tue, Jun 3, 2014 at 1:58 PM, Alexei Starovoitov a...@plumgrid.com wrote: On Tue, Jun 3, 2014 at 1:35 PM, Daniel Borkmann dbork...@redhat.com wrote: On 06/03/2014 05:44 PM, Alexei Starovoitov wrote: ... All of your points are valid. They are right questions to ask. I just don't see why you're still arguing about first step of filter.c split, whereas your concerns are about steps 2, 3, 4. Fair enough, lets keep them in mind though for future work. Btw, Ok :) are other files planned for kernel/bpf/ or should it instead just simply be kernel/bpf.c? The most obvious one is eBPF verifier in separate file (kernel/bpf/verifier.c) bpf maps is yet another thing, but that's different topic. Probably a set of bpf-callable functions in another file. Like right now for sockets these helpers are __skb_get_pay_offset(), __skb_get_nlattr() For tracing there will be a different set of helper functions and eventually some will be common. Like __get_raw_cpu_id() from filter.c could eventually move to kernel/bpf/helpers.c LGTM. I like the idea of every user (packet filter, seccomp, etc.) providing a map of the bpf calls that are ok, as in the packet filter stating that {1-__skb_get_pay_offset(), 2-__skb_get_nlattr(), ...}, but seccomp providing a completely different (or even empty) map. -Chema I'm not a fan of squeezing different logic into one file. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On Tue, Jun 3, 2014 at 2:40 PM, Chema Gonzalez ch...@google.com wrote: First of all, and just to join the crowd, kernel/bpf/ FTW. Now, I have some suggestions about eBPF. IMO classic BPF is an ISA oriented to filter (meaning returning a single integer that states how many bytes of the packet must be captured) packets (e.g. consider the 6 load modes, where 3 provide access the packet -- abs, ind, msh --, one to an skb field -- len--, the 5th one to the memory itself -- mem --, and the 6th is an immediate set mode --imm-- ) that has been used in other environments (seccomp, tracing, etc.) by (a) extending the idea of a packet into a buffer, and (b) adding ancillary loads. eBPF should be a generic ISA that can be used by many environments, including those served today by classic BPF. IMO, we should get a nicely-defined ISA (MIPS anyone?) and check what should go into eBPF and what should not. Model eBPF based on MIPS ISA? Ouch. That would be one ugly ISA that is not JITable on x64. eBPF ISA wasn't invented overnight. It was a gigantic effort that took a lot of time to narrow down x64 into _verifiable_ ISA. I had to take into account arm64, mips64, sparcv9 architectures too. Of course, minor things can be improved here or there. Ugliness of ISA hits compiler writers first. I've seen many times how cpu designers add new instructions only to be told by compiler guys that they just wasted silicon. Fact that llvm/gcc compile C into eBPF is the strongest statement that eBPF ISA is 99.9% complete. New instructions may or may not make sense. Let's examine your proposal: - 1. we should considering separating the eBPF ISA farther from classic BPF - eBPF still uses a_reg and x_reg as the names of the 2 op registers. This is very confusing, especially when dealing with translated filters that do move data between A and X. I've had a_reg being X, and x_reg being A. We should rename them d_reg and s_reg. that is renaming of two fields in sock_filter_int structure. No change to actual ISA. You're proposing the following: diff --git a/include/linux/filter.h b/include/linux/filter.h index 0e463ee77bb2..bf50fa440ef8 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -287,8 +287,8 @@ enum { struct sock_filter_int { __u8code; /* opcode */ - __u8a_reg:4;/* dest register */ - __u8x_reg:4;/* source register */ + __u8dst_reg:4; /* dest register */ + __u8src_reg:4; /* source register */ __s16 off;/* signed offset */ __s32 imm;/* signed immediate constant */ }; sure. I thought comment was explicit enough, but I agree the fields could have been named better. Will do a patch to rename it. - BPF_LD vs. BPF_LDX: this made sense in classic BPF, as there was only one register, and d_reg was implicit in the name of the insn code. Now, why are we keeping both in eBPF, when the register we're writing to is made explicit in d_reg (I already forgot if d_reg was a_reg or x_reg ;) ? Removing one of them will save us 1/8th of the insns. - BPF_ST vs. BPF_STX: same here. Note that the current sk_convert_filter() just converts all stores to BPF_STX. nope. No extra bits can be saved here. STX means: *(dest_reg + off) = src_reg; ST means: *(dest_reg + off) = imm; LDX means: dest_reg = *(src_reg + off) LD we had to carry over from classic as only two non-generic instructions: LD_ABS and LD_IND. - 2. there are other insn that we should consider adding: - lui: AFAICT, there is no clean way to build a 64-bit number (you can LD_IMM the upper part, lsh 32, and then add the lower part). correct. in tracing filters I do this: + /* construct 64-bit address */ + emit(BPF_ALU64_IMM(BPF_MOV, BPF_REG_2, addr 32), ctx); + emit(BPF_ALU64_IMM(BPF_LSH, BPF_REG_2, 32), ctx); + emit(BPF_ALU32_IMM(BPF_MOV, BPF_REG_1, (u32) addr), ctx); + emit(BPF_ALU64_REG(BPF_OR, BPF_REG_1, BPF_REG_2), ctx); so there is a way to construct 64-bit immediate. The question is how often do we need to do it? Is it in critical path? Naive answer one instruction is better than 4 doesn't count. See my point above 'cpu designer vs compiler writer'. None of the risc ISAs have 64-bit imm and eBPF has to consider simplicity of JITs otherwise those architectures will have a hard time mapping eBPF to native. If JIT would be too difficult to do, then there will be no JIT. I don't want eBPF to become an instruction set that can be JITed only on one architecture... 'mov dest_reg, imm64' may still be ok to add, since x64 can JIT it with one instruction, arm64 with 4 instructions, but JITs for other archs will be ugly. They can JIT it as load from memory, but I need to think it through. Let me explore it more carefully. Two must have requirements for eBPF: 1. verifiable instructions, meaning that
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On Mon, Jun 2, 2014 at 10:04 AM, Daniel Borkmann wrote: > On 06/02/2014 05:41 PM, Alexei Starovoitov wrote: > ... > >> Glad you brought up this point :) >> 100% agree that current double verification done by seccomp is far from >> being generic and quite hard to maintain, since any change done to >> classic BPF verifier needs to be thought through from >> seccomp_check_filter() >> perspective as well. > > > Glad we're on the same page. > > >> BPF's input context, set of allowed calls need to be expressed in a >> generic way. >> Obviously this split by itself won't make classic BPF all of a sudden >> generic. >> It rather defines a boundary of eBPF core. > > > Note, I'm not at all against using it in tracing, I think it's probably > a good idea, but shouldn't we _first_ think about how to overcome such > deficits as above by improving upon its in-kernel API design, thus to > better prepare it to be generic? I feel this step is otherwise just > skipped and quickly 'hacked' around ... ;) Are you talking about classic 'deficit' or eBPF 'deficit' ? Classic has all sorts of hard coded assumptions. The whole concept of 'load from magic constant' to mean different things is flawed. We all got used to it and now think that it's normal for "ld_abs -4056" to mean "a ^= x" This split is not trying to make classic easier to hack. With eBPF underneath classic, it got a lot easier to add extensions to classic, but we shouldn't be doing it. Classic BPF is not generic and cannot become one. It's eBPF's job. The split is mainly helping to clearly see the boundary of eBPF core vs its socket use case. It doesn't change or add any API. We need to carefully design eBPF APIs when we expose it to user space. I have a proposal for that too, but that's separate discussion. In terms of in-kernel eBPF API there is nothing to be done. eBPF program 'prog' is generated by whatever means and then: struct sk_filter *fp; fp = kzalloc(sk_filter_size(prog_len), GFP_KERNEL); memcpy(fp->insni, prog, prog_len * sizeof(fp->insni[0])); fp->len = prog_len; sk_filter_select_runtime(fp); // select interpreter or JIT SK_RUN_FILTER(fp, ctx); // run the program sk_filter_free(fp); // free program that's how sockets, testsuite, seccomp, tracing are doing it. All have different ways of producing 'prog' and 'prog_len'. This in-kernel API cleanup was done in commit 5fe821a9dee2 You even acked it back then :) If you're referring to eBPF verifier in-kernel API then yeah, it's missing, just like the whole eBPF verifier :) Ideally any kernel component that generates eBPF on the fly sends eBPF program to verifier first just to double check that generated program is valid. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On 06/02/2014 05:41 PM, Alexei Starovoitov wrote: ... Glad you brought up this point :) 100% agree that current double verification done by seccomp is far from being generic and quite hard to maintain, since any change done to classic BPF verifier needs to be thought through from seccomp_check_filter() perspective as well. Glad we're on the same page. BPF's input context, set of allowed calls need to be expressed in a generic way. Obviously this split by itself won't make classic BPF all of a sudden generic. It rather defines a boundary of eBPF core. Note, I'm not at all against using it in tracing, I think it's probably a good idea, but shouldn't we _first_ think about how to overcome such deficits as above by improving upon its in-kernel API design, thus to better prepare it to be generic? I feel this step is otherwise just skipped and quickly 'hacked' around ... ;) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On Mon, Jun 2, 2014 at 1:57 AM, Daniel Borkmann wrote: > On 06/02/2014 09:01 AM, Alexei Starovoitov wrote: >> >> This patch set splits BPF out of core networking into generic component >> >> patch #1 splits filter.c into two logical pieces: generic BPF core and >> socket >> filters. It only moves functions around. No real changes. >> >> patch #2 adds hidden CONFIG_BPF that seccomp/tracing can select >> >> The main value of the patch is not a NET separation, but rather logical >> boundary >> between generic BPF core and socket filtering. All socket specific code >> stays in >> net/core/filter.c and kernel/bpf/core.c is for generic BPF infrastructure >> (both >> classic and internal). >> >> Note that CONFIG_BPF_JIT is still under NET, so NET-less configs cannot >> use >> BPF JITs yet. This can be cleaned up in the future. Also it seems to makes >> sense >> to split up filter.h into generic and socket specific as well to cleanup >> the >> boundary further. > > > Hm, I really don't like that 'ripping code and headers apart' and then we > believe > it's a generic abstraction. So far seccomp-BPF could live with the current > state > since it was introduced, the rest of users (vast majority) is in the > networking > domain (and invoked through tcpdump et al) ... > > There are still parts in seccomp that show some BPF weaknesses in terms of > being > 'generic', for example shown in seccomp, we need to go once again over the > filter > instructions after doing the usual filter sanity checks, just to whitelist > what > seccomp may do in BPF. > > I have not yet thought about it deeply enough, but I think we should avoid > something similar in other non-networking areas but abstract that cleanly > w/o > such hacks first, for example. Glad you brought up this point :) 100% agree that current double verification done by seccomp is far from being generic and quite hard to maintain, since any change done to classic BPF verifier needs to be thought through from seccomp_check_filter() perspective as well. imo lack of generality in classic BPF is the main reason why we should stop adding extensions to classic and switch to eBPF for any new features. the eBPF verifier I posted now long ago is trying to be generic through customization. The verifier core needs to stay independent of the use case. BPF's input context, set of allowed calls need to be expressed in a generic way. Obviously this split by itself won't make classic BPF all of a sudden generic. It rather defines a boundary of eBPF core. In eBPF only two instructions are not generic. It's LD_ABS/LD_IND which are legacy instruction that we had to carry over from classic. They require input context == sk_buff. That's why core.c had to #include and do '__weak skb_copy_bits()'. Alternative to that was to #ifdef these two instructions out of interpreter and #ifndef NET #include and ld_abs helper functions in core.c IMO that would have been ugly for code style, maintenance and testing, but then core.c would have only one #include and we can say: 'look eBPF core.c is really generic' In the next set of patches I'll repost verifier and will explain how single eBPF verifier core can be used for socket, seccomp, tracing and other things. Note I'm not saying that we should use eBPF now everywhere. Classic BPF has its niche and that niche we have to maintain forever. So let's make sure that eBPF interpreter, its instruction set and its verifier are staying generic. This split is only first step in that direction that creates a file boundary between eBPF core and sockets. > >> Tested with several NET and NET-less configs on arm and x86 >> >> V1->V2: >> rebase on top of net-next >> split filter.c into kernel/bpf/core.c instead of net/bpf/core.c >> >> Alexei Starovoitov (2): >>net: filter: split filter.c into two files >>net: filter: split BPF out of core networking >> >> arch/Kconfig |6 +- >> include/linux/filter.h |2 + >> kernel/Makefile|1 + >> kernel/bpf/Makefile|5 + >> kernel/bpf/core.c | 1063 >> >> net/Kconfig|1 + >> net/core/filter.c | 1023 >> +- >> 7 files changed, 1079 insertions(+), 1022 deletions(-) >> create mode 100644 kernel/bpf/Makefile >> create mode 100644 kernel/bpf/core.c >> > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On Mon, Jun 2, 2014 at 7:16 AM, Arnaldo Carvalho de Melo wrote: > Em Mon, Jun 02, 2014 at 09:24:56AM -0400, Steven Rostedt escreveu: >> On Mon, 2 Jun 2014 08:15:45 -0500 >> Jonathan Corbet wrote: > >> > On Mon, 2 Jun 2014 00:01:44 -0700 >> > Alexei Starovoitov wrote: > >> > > This patch set splits BPF out of core networking into generic component > >> > Quick, probably dumb question: if you're going to split it out, why not >> > split it out entirely, into kernel/ or (perhaps better) lib/? The >> > whole point seems to be that BPF is outgrowing its networking home, so >> > it seems like it might be better to make it truly generic. > >> I believe this is what Ingo suggested as well. If it is become generic, >> it belongs in lib/ > > Yes, that was his suggestion, which I agree with, FWIW. I guess I posted v2 too quickly :) v2 splits filter.c into kernel/bpf/. I think it's a better location than lib/bpf, since lib feels too constrained by definition of 'library'. bpf is more than a set of library calls. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
Em Mon, Jun 02, 2014 at 09:24:56AM -0400, Steven Rostedt escreveu: > On Mon, 2 Jun 2014 08:15:45 -0500 > Jonathan Corbet wrote: > > On Mon, 2 Jun 2014 00:01:44 -0700 > > Alexei Starovoitov wrote: > > > This patch set splits BPF out of core networking into generic component > > Quick, probably dumb question: if you're going to split it out, why not > > split it out entirely, into kernel/ or (perhaps better) lib/? The > > whole point seems to be that BPF is outgrowing its networking home, so > > it seems like it might be better to make it truly generic. > I believe this is what Ingo suggested as well. If it is become generic, > it belongs in lib/ Yes, that was his suggestion, which I agree with, FWIW. - Arnaldo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On Mon, 2 Jun 2014 08:15:45 -0500 Jonathan Corbet wrote: > On Mon, 2 Jun 2014 00:01:44 -0700 > Alexei Starovoitov wrote: > > > This patch set splits BPF out of core networking into generic component > > Quick, probably dumb question: if you're going to split it out, why not > split it out entirely, into kernel/ or (perhaps better) lib/? The > whole point seems to be that BPF is outgrowing its networking home, so > it seems like it might be better to make it truly generic. I believe this is what Ingo suggested as well. If it is become generic, it belongs in lib/ -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On Mon, 2 Jun 2014 00:01:44 -0700 Alexei Starovoitov wrote: > This patch set splits BPF out of core networking into generic component Quick, probably dumb question: if you're going to split it out, why not split it out entirely, into kernel/ or (perhaps better) lib/? The whole point seems to be that BPF is outgrowing its networking home, so it seems like it might be better to make it truly generic. jon -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On 06/02/2014 09:01 AM, Alexei Starovoitov wrote: This patch set splits BPF out of core networking into generic component patch #1 splits filter.c into two logical pieces: generic BPF core and socket filters. It only moves functions around. No real changes. patch #2 adds hidden CONFIG_BPF that seccomp/tracing can select The main value of the patch is not a NET separation, but rather logical boundary between generic BPF core and socket filtering. All socket specific code stays in net/core/filter.c and kernel/bpf/core.c is for generic BPF infrastructure (both classic and internal). Note that CONFIG_BPF_JIT is still under NET, so NET-less configs cannot use BPF JITs yet. This can be cleaned up in the future. Also it seems to makes sense to split up filter.h into generic and socket specific as well to cleanup the boundary further. Hm, I really don't like that 'ripping code and headers apart' and then we believe it's a generic abstraction. So far seccomp-BPF could live with the current state since it was introduced, the rest of users (vast majority) is in the networking domain (and invoked through tcpdump et al) ... There are still parts in seccomp that show some BPF weaknesses in terms of being 'generic', for example shown in seccomp, we need to go once again over the filter instructions after doing the usual filter sanity checks, just to whitelist what seccomp may do in BPF. I have not yet thought about it deeply enough, but I think we should avoid something similar in other non-networking areas but abstract that cleanly w/o such hacks first, for example. Tested with several NET and NET-less configs on arm and x86 V1->V2: rebase on top of net-next split filter.c into kernel/bpf/core.c instead of net/bpf/core.c Alexei Starovoitov (2): net: filter: split filter.c into two files net: filter: split BPF out of core networking arch/Kconfig |6 +- include/linux/filter.h |2 + kernel/Makefile|1 + kernel/bpf/Makefile|5 + kernel/bpf/core.c | 1063 net/Kconfig|1 + net/core/filter.c | 1023 +- 7 files changed, 1079 insertions(+), 1022 deletions(-) create mode 100644 kernel/bpf/Makefile create mode 100644 kernel/bpf/core.c -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On 06/02/2014 09:01 AM, Alexei Starovoitov wrote: This patch set splits BPF out of core networking into generic component patch #1 splits filter.c into two logical pieces: generic BPF core and socket filters. It only moves functions around. No real changes. patch #2 adds hidden CONFIG_BPF that seccomp/tracing can select The main value of the patch is not a NET separation, but rather logical boundary between generic BPF core and socket filtering. All socket specific code stays in net/core/filter.c and kernel/bpf/core.c is for generic BPF infrastructure (both classic and internal). Note that CONFIG_BPF_JIT is still under NET, so NET-less configs cannot use BPF JITs yet. This can be cleaned up in the future. Also it seems to makes sense to split up filter.h into generic and socket specific as well to cleanup the boundary further. Hm, I really don't like that 'ripping code and headers apart' and then we believe it's a generic abstraction. So far seccomp-BPF could live with the current state since it was introduced, the rest of users (vast majority) is in the networking domain (and invoked through tcpdump et al) ... There are still parts in seccomp that show some BPF weaknesses in terms of being 'generic', for example shown in seccomp, we need to go once again over the filter instructions after doing the usual filter sanity checks, just to whitelist what seccomp may do in BPF. I have not yet thought about it deeply enough, but I think we should avoid something similar in other non-networking areas but abstract that cleanly w/o such hacks first, for example. Tested with several NET and NET-less configs on arm and x86 V1-V2: rebase on top of net-next split filter.c into kernel/bpf/core.c instead of net/bpf/core.c Alexei Starovoitov (2): net: filter: split filter.c into two files net: filter: split BPF out of core networking arch/Kconfig |6 +- include/linux/filter.h |2 + kernel/Makefile|1 + kernel/bpf/Makefile|5 + kernel/bpf/core.c | 1063 net/Kconfig|1 + net/core/filter.c | 1023 +- 7 files changed, 1079 insertions(+), 1022 deletions(-) create mode 100644 kernel/bpf/Makefile create mode 100644 kernel/bpf/core.c -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On Mon, 2 Jun 2014 00:01:44 -0700 Alexei Starovoitov a...@plumgrid.com wrote: This patch set splits BPF out of core networking into generic component Quick, probably dumb question: if you're going to split it out, why not split it out entirely, into kernel/ or (perhaps better) lib/? The whole point seems to be that BPF is outgrowing its networking home, so it seems like it might be better to make it truly generic. jon -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On Mon, 2 Jun 2014 08:15:45 -0500 Jonathan Corbet cor...@lwn.net wrote: On Mon, 2 Jun 2014 00:01:44 -0700 Alexei Starovoitov a...@plumgrid.com wrote: This patch set splits BPF out of core networking into generic component Quick, probably dumb question: if you're going to split it out, why not split it out entirely, into kernel/ or (perhaps better) lib/? The whole point seems to be that BPF is outgrowing its networking home, so it seems like it might be better to make it truly generic. I believe this is what Ingo suggested as well. If it is become generic, it belongs in lib/ -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
Em Mon, Jun 02, 2014 at 09:24:56AM -0400, Steven Rostedt escreveu: On Mon, 2 Jun 2014 08:15:45 -0500 Jonathan Corbet cor...@lwn.net wrote: On Mon, 2 Jun 2014 00:01:44 -0700 Alexei Starovoitov a...@plumgrid.com wrote: This patch set splits BPF out of core networking into generic component Quick, probably dumb question: if you're going to split it out, why not split it out entirely, into kernel/ or (perhaps better) lib/? The whole point seems to be that BPF is outgrowing its networking home, so it seems like it might be better to make it truly generic. I believe this is what Ingo suggested as well. If it is become generic, it belongs in lib/ Yes, that was his suggestion, which I agree with, FWIW. - Arnaldo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On Mon, Jun 2, 2014 at 7:16 AM, Arnaldo Carvalho de Melo a...@kernel.org wrote: Em Mon, Jun 02, 2014 at 09:24:56AM -0400, Steven Rostedt escreveu: On Mon, 2 Jun 2014 08:15:45 -0500 Jonathan Corbet cor...@lwn.net wrote: On Mon, 2 Jun 2014 00:01:44 -0700 Alexei Starovoitov a...@plumgrid.com wrote: This patch set splits BPF out of core networking into generic component Quick, probably dumb question: if you're going to split it out, why not split it out entirely, into kernel/ or (perhaps better) lib/? The whole point seems to be that BPF is outgrowing its networking home, so it seems like it might be better to make it truly generic. I believe this is what Ingo suggested as well. If it is become generic, it belongs in lib/ Yes, that was his suggestion, which I agree with, FWIW. I guess I posted v2 too quickly :) v2 splits filter.c into kernel/bpf/. I think it's a better location than lib/bpf, since lib feels too constrained by definition of 'library'. bpf is more than a set of library calls. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On Mon, Jun 2, 2014 at 1:57 AM, Daniel Borkmann dbork...@redhat.com wrote: On 06/02/2014 09:01 AM, Alexei Starovoitov wrote: This patch set splits BPF out of core networking into generic component patch #1 splits filter.c into two logical pieces: generic BPF core and socket filters. It only moves functions around. No real changes. patch #2 adds hidden CONFIG_BPF that seccomp/tracing can select The main value of the patch is not a NET separation, but rather logical boundary between generic BPF core and socket filtering. All socket specific code stays in net/core/filter.c and kernel/bpf/core.c is for generic BPF infrastructure (both classic and internal). Note that CONFIG_BPF_JIT is still under NET, so NET-less configs cannot use BPF JITs yet. This can be cleaned up in the future. Also it seems to makes sense to split up filter.h into generic and socket specific as well to cleanup the boundary further. Hm, I really don't like that 'ripping code and headers apart' and then we believe it's a generic abstraction. So far seccomp-BPF could live with the current state since it was introduced, the rest of users (vast majority) is in the networking domain (and invoked through tcpdump et al) ... There are still parts in seccomp that show some BPF weaknesses in terms of being 'generic', for example shown in seccomp, we need to go once again over the filter instructions after doing the usual filter sanity checks, just to whitelist what seccomp may do in BPF. I have not yet thought about it deeply enough, but I think we should avoid something similar in other non-networking areas but abstract that cleanly w/o such hacks first, for example. Glad you brought up this point :) 100% agree that current double verification done by seccomp is far from being generic and quite hard to maintain, since any change done to classic BPF verifier needs to be thought through from seccomp_check_filter() perspective as well. imo lack of generality in classic BPF is the main reason why we should stop adding extensions to classic and switch to eBPF for any new features. the eBPF verifier I posted now long ago is trying to be generic through customization. The verifier core needs to stay independent of the use case. BPF's input context, set of allowed calls need to be expressed in a generic way. Obviously this split by itself won't make classic BPF all of a sudden generic. It rather defines a boundary of eBPF core. In eBPF only two instructions are not generic. It's LD_ABS/LD_IND which are legacy instruction that we had to carry over from classic. They require input context == sk_buff. That's why core.c had to #include skbuff.h and do '__weak skb_copy_bits()'. Alternative to that was to #ifdef these two instructions out of interpreter and #ifndef NET #include skbuff.h and ld_abs helper functions in core.c IMO that would have been ugly for code style, maintenance and testing, but then core.c would have only one #include filter.h and we can say: 'look eBPF core.c is really generic' In the next set of patches I'll repost verifier and will explain how single eBPF verifier core can be used for socket, seccomp, tracing and other things. Note I'm not saying that we should use eBPF now everywhere. Classic BPF has its niche and that niche we have to maintain forever. So let's make sure that eBPF interpreter, its instruction set and its verifier are staying generic. This split is only first step in that direction that creates a file boundary between eBPF core and sockets. Tested with several NET and NET-less configs on arm and x86 V1-V2: rebase on top of net-next split filter.c into kernel/bpf/core.c instead of net/bpf/core.c Alexei Starovoitov (2): net: filter: split filter.c into two files net: filter: split BPF out of core networking arch/Kconfig |6 +- include/linux/filter.h |2 + kernel/Makefile|1 + kernel/bpf/Makefile|5 + kernel/bpf/core.c | 1063 net/Kconfig|1 + net/core/filter.c | 1023 +- 7 files changed, 1079 insertions(+), 1022 deletions(-) create mode 100644 kernel/bpf/Makefile create mode 100644 kernel/bpf/core.c -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On 06/02/2014 05:41 PM, Alexei Starovoitov wrote: ... Glad you brought up this point :) 100% agree that current double verification done by seccomp is far from being generic and quite hard to maintain, since any change done to classic BPF verifier needs to be thought through from seccomp_check_filter() perspective as well. Glad we're on the same page. BPF's input context, set of allowed calls need to be expressed in a generic way. Obviously this split by itself won't make classic BPF all of a sudden generic. It rather defines a boundary of eBPF core. Note, I'm not at all against using it in tracing, I think it's probably a good idea, but shouldn't we _first_ think about how to overcome such deficits as above by improving upon its in-kernel API design, thus to better prepare it to be generic? I feel this step is otherwise just skipped and quickly 'hacked' around ... ;) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/2] split BPF out of core networking
On Mon, Jun 2, 2014 at 10:04 AM, Daniel Borkmann dbork...@redhat.com wrote: On 06/02/2014 05:41 PM, Alexei Starovoitov wrote: ... Glad you brought up this point :) 100% agree that current double verification done by seccomp is far from being generic and quite hard to maintain, since any change done to classic BPF verifier needs to be thought through from seccomp_check_filter() perspective as well. Glad we're on the same page. BPF's input context, set of allowed calls need to be expressed in a generic way. Obviously this split by itself won't make classic BPF all of a sudden generic. It rather defines a boundary of eBPF core. Note, I'm not at all against using it in tracing, I think it's probably a good idea, but shouldn't we _first_ think about how to overcome such deficits as above by improving upon its in-kernel API design, thus to better prepare it to be generic? I feel this step is otherwise just skipped and quickly 'hacked' around ... ;) Are you talking about classic 'deficit' or eBPF 'deficit' ? Classic has all sorts of hard coded assumptions. The whole concept of 'load from magic constant' to mean different things is flawed. We all got used to it and now think that it's normal for ld_abs -4056 to mean a ^= x This split is not trying to make classic easier to hack. With eBPF underneath classic, it got a lot easier to add extensions to classic, but we shouldn't be doing it. Classic BPF is not generic and cannot become one. It's eBPF's job. The split is mainly helping to clearly see the boundary of eBPF core vs its socket use case. It doesn't change or add any API. We need to carefully design eBPF APIs when we expose it to user space. I have a proposal for that too, but that's separate discussion. In terms of in-kernel eBPF API there is nothing to be done. eBPF program 'prog' is generated by whatever means and then: struct sk_filter *fp; fp = kzalloc(sk_filter_size(prog_len), GFP_KERNEL); memcpy(fp-insni, prog, prog_len * sizeof(fp-insni[0])); fp-len = prog_len; sk_filter_select_runtime(fp); // select interpreter or JIT SK_RUN_FILTER(fp, ctx); // run the program sk_filter_free(fp); // free program that's how sockets, testsuite, seccomp, tracing are doing it. All have different ways of producing 'prog' and 'prog_len'. This in-kernel API cleanup was done in commit 5fe821a9dee2 You even acked it back then :) If you're referring to eBPF verifier in-kernel API then yeah, it's missing, just like the whole eBPF verifier :) Ideally any kernel component that generates eBPF on the fly sends eBPF program to verifier first just to double check that generated program is valid. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/