Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-24 Thread Daniel Borkmann

On 06/23/2014 11:57 PM, Alexei Starovoitov wrote:

On Fri, Jun 20, 2014 at 9:44 AM, Chema Gonzalez  wrote:

...

   - nop: I'd like to have a nop. Do I know why? Nope.

nope. Let's not add unnecessary instructions.

A valid nop is a useful instruction: padding, filling up arrays of
sock_filter_int correctly (as in lib/test_bpf.c, where we're currently
using a "ld #0", which loads zero to register A), and other use cases
(see http://en.wikipedia.org/wiki/NOP ).


especially I don't like to add 'nop' instruction.
code==0 to mean 'ld #0' is one of classic BPF ugliness.


I think it was probably unintended to be able to have unreachable
code e.g. filled with 'nops' where both jt, jf just jump over it,
but that quirk we cannot change anymore in the classic checker
and have to carry onwards.


We're not filling up arrays with nops in lib/test_bpf.c
Zero is invalid opcode in eBPF and should stay so, since it's
an easy check for humans like me who are looking at eBPF in hex.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-24 Thread Daniel Borkmann

On 06/23/2014 11:57 PM, Alexei Starovoitov wrote:

On Fri, Jun 20, 2014 at 9:44 AM, Chema Gonzalez ch...@google.com wrote:

...

   - nop: I'd like to have a nop. Do I know why? Nope.

nope. Let's not add unnecessary instructions.

A valid nop is a useful instruction: padding, filling up arrays of
sock_filter_int correctly (as in lib/test_bpf.c, where we're currently
using a ld #0, which loads zero to register A), and other use cases
(see http://en.wikipedia.org/wiki/NOP ).


especially I don't like to add 'nop' instruction.
code==0 to mean 'ld #0' is one of classic BPF ugliness.


I think it was probably unintended to be able to have unreachable
code e.g. filled with 'nops' where both jt, jf just jump over it,
but that quirk we cannot change anymore in the classic checker
and have to carry onwards.


We're not filling up arrays with nops in lib/test_bpf.c
Zero is invalid opcode in eBPF and should stay so, since it's
an easy check for humans like me who are looking at eBPF in hex.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-23 Thread Alexei Starovoitov
On Fri, Jun 20, 2014 at 9:44 AM, Chema Gonzalez  wrote:
>>
>> Model eBPF based on MIPS ISA? Ouch.
>> That would be one ugly ISA that is not JITable on x64.
>
> Definitely I wasn't making my point clear: IMO if we're redesigning
> the BPF ISA, we should get a clean one (clean=something that is simple
> enough to be readable by a newcomer). I mentioned MIPS because it's
> the one I know the most, and it's kind of clean. I'm definitely not
> proposing to use MIPS as the BPF ISA.

mips is a clean isa? When it was designed 30 years ago, it was good,
but today it really shows its age: delay slots, integer arithmetic that
traps on overflow, lack of real comparison operations, hi/lo, etc.
I strongly believe eBPF isa is way cleaner and easier to understand
than mips isa.

It seems your proposals to make eBPF 'cleaner' are based on
HW mindset, which is not applicable here. Details below:

> 1. how we codify the ISA. Both BPF and eBPF devote 50% of the insn
> space to load/store (4 opcodes/instruction class values of 8
> possible). In comparison, MIPS uses just 14% (9 of 64 opcodes). That

that is a misleading comparison and leads to the wrong conclusions.
Your proposal to remove useful instruction just to save a bit in bpf
class encoding just doesn't make sense. We have infinite room to
add new instructions and opcodes. Unlike HW ISA eBPF is not
limited by 8-bit opcodes and 8-byte instructions. eBPF is not designed
to be run directly by HW, so we're not trying to save rtl gates here.
Among other things we're optimizing for interpreter performance,
so removing instructions can only hurt.

> gives me pause. I'm definitely not suggesting adding more insn just
> because physical ISAs have it, but I think it makes sense to avoid
> using the whole insn space just because it's there, or because classic
> BPF was using it all.

wrong. Classic BPF is a legacy that we have to live with and we
should not sacrifice performance of classic bpf filters just to reduce
number of eBPF instructions.
It made sense only for one classic instruction: BPF_LD + MSH
It is really single purpose instruction to do X = ip->length * 4.
We didn't carry this ugliness into eBPF, since it's not generic,
can be easily represented by generic instructions, complicates JITs.
Since MSH is used at most once per tcpdump filter, few extra
insns add a tiny penalty to overall filter execution time in interpreter
and give no performance penalty at all when filter is JITed.

> 2. instructions (sock_filter_int) hardcode 2 immediate values. This is
> very unusual in ISAs. We're effectively doubling the insn size (from
> 32 to 64 bits), and still we are cramming the whole insn space (only 1
> reserved instruction class of 8 possible ones). The rationale is to

Comparisons with HW encoding is not applicable.
mips picked 4 byte wide instruction and had to live with it. Just look
at hi/lo insns and what compilers have to do with it.
All eBPF instructions today are 8 byte wide. There is no reason
to redesign them into <8 bytes or squeeze bits. It will hurt
performance without giving us anything back.
At the same time we can add 16-byte instruction to represent
load 64-bit immediate, but as I was saying before many factors
need to be considered before we proceed.

> 3. name reuse: we're reusing names b etween classic BPF and eBPF. For
> example, BPF_LD*|BPF_MEM in classic BPF refers to access to the M[]
> buffer. BPF_LDX|BPF_MEM in eBPF is a generic memory access. I find
> this very confusing, especially because we'll have to live with

That's one opinion. I think names are fine and
Documentation/networking/filter.txt explains both classic and eBPF
encoding well enough.

> classic BPF in userland filters for a long time. In fact, if you ask
> me, I'll come up with some generic name for the generic linux
> filtering mechanism (eBPF and internal BPF sound too much like BPF),
> to make it clear that this is not just BPF.

I don't think it's a good idea. I like BPF abbreviation, since the name
implies the use case. Renaming eBPF to ISA_X will be confusing.
Now everyone understands that BPF is a safe instruction set that
can be dynamically loaded into the kernel. eBPF is the same
plus more.

> 4. simplicity: both BPF and eBPF have 4 ld/st operations types (LD,
> LDX, ST, STX) and many addressing modes/modifiers (6 for BPF, 4 for
> eBPF), where only a subset of the {operation types, modifier} tuples
> are valid. I think we can make it simpler. For the eBPF case, we
> currently have 6 valid combinations:

That's a good summary. I think documentation explained it already,
but if you feel it's still missing pieces, please send a patch to
improve the doc.

> The two eBPF BPF_LD insn are BPF_LDX|BPF_MEM insn to access an skbuff.
> For example, BPF_LD|BPF_ABS does "dst_reg = packet[imm32]", which is a
> BPF_LDX|BPF_MEM with the following differences:
> a. packet is skb->data == CTX == BPF_R6 (this is equivalent to
> src_reg==R6 in BPF_LDX|BPF_MEM)
> b. output is left in R0, not in 

RE: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-23 Thread David Laight
From: Chema Gonzalez
...
> 4.5. BPF_ST|BPF_MEM
> Operation: *(size *) (dst_reg + off16) = imm32
> 
> This insn encodes 2 immediate values (the offset and the imm32 value)
> in the insn, and actually forces the sock_filter_int 64-bit struct to
> have both a 16-bit offset field and a 32-bit immediate field). In
> fact, it's the only instructions that uses .off and .imm at the same
> time (for all other instructions, at least one of the fields is always
> 0).
> 
> This did not exist in classic BPF (where BPF_ST|BPF_MEM actually did
> "mem[pc->k] = A;"). In fact, it's rare to find an ISA that allows
> encoding 2 immediate values in a single insn. My impression (after
> checking the x86 JIT implementation, which works on the eBPF code) is
> that this was added as an x86 optimization, because x86 allows
> encoding 2 values (offset and immediate) by using the displacement and
> immediate suffixes. I wonder whether the ISA would be more readable if
> we did this in 2 insn, one to put dst_reg+off16 in a temporary
> register, and the second a simpler BPF_STX|BPF_MEM. Then we could use
> the same space for the immediate and offset fields.

One option is to add code to the x86 JIT to detect the two instruction
sequence and generate a single instruction.

Thinks further, the JIT might be easier to write if there is a temporary
register that is defined to be only valid for the next instruction (or two).
Then the JIT can completely optimise away any assignments to it without
requiring a full analysis of the entire program.

David



Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-23 Thread Alexei Starovoitov
On Fri, Jun 20, 2014 at 9:44 AM, Chema Gonzalez ch...@google.com wrote:

 Model eBPF based on MIPS ISA? Ouch.
 That would be one ugly ISA that is not JITable on x64.

 Definitely I wasn't making my point clear: IMO if we're redesigning
 the BPF ISA, we should get a clean one (clean=something that is simple
 enough to be readable by a newcomer). I mentioned MIPS because it's
 the one I know the most, and it's kind of clean. I'm definitely not
 proposing to use MIPS as the BPF ISA.

mips is a clean isa? When it was designed 30 years ago, it was good,
but today it really shows its age: delay slots, integer arithmetic that
traps on overflow, lack of real comparison operations, hi/lo, etc.
I strongly believe eBPF isa is way cleaner and easier to understand
than mips isa.

It seems your proposals to make eBPF 'cleaner' are based on
HW mindset, which is not applicable here. Details below:

 1. how we codify the ISA. Both BPF and eBPF devote 50% of the insn
 space to load/store (4 opcodes/instruction class values of 8
 possible). In comparison, MIPS uses just 14% (9 of 64 opcodes). That

that is a misleading comparison and leads to the wrong conclusions.
Your proposal to remove useful instruction just to save a bit in bpf
class encoding just doesn't make sense. We have infinite room to
add new instructions and opcodes. Unlike HW ISA eBPF is not
limited by 8-bit opcodes and 8-byte instructions. eBPF is not designed
to be run directly by HW, so we're not trying to save rtl gates here.
Among other things we're optimizing for interpreter performance,
so removing instructions can only hurt.

 gives me pause. I'm definitely not suggesting adding more insn just
 because physical ISAs have it, but I think it makes sense to avoid
 using the whole insn space just because it's there, or because classic
 BPF was using it all.

wrong. Classic BPF is a legacy that we have to live with and we
should not sacrifice performance of classic bpf filters just to reduce
number of eBPF instructions.
It made sense only for one classic instruction: BPF_LD + MSH
It is really single purpose instruction to do X = ip-length * 4.
We didn't carry this ugliness into eBPF, since it's not generic,
can be easily represented by generic instructions, complicates JITs.
Since MSH is used at most once per tcpdump filter, few extra
insns add a tiny penalty to overall filter execution time in interpreter
and give no performance penalty at all when filter is JITed.

 2. instructions (sock_filter_int) hardcode 2 immediate values. This is
 very unusual in ISAs. We're effectively doubling the insn size (from
 32 to 64 bits), and still we are cramming the whole insn space (only 1
 reserved instruction class of 8 possible ones). The rationale is to

Comparisons with HW encoding is not applicable.
mips picked 4 byte wide instruction and had to live with it. Just look
at hi/lo insns and what compilers have to do with it.
All eBPF instructions today are 8 byte wide. There is no reason
to redesign them into 8 bytes or squeeze bits. It will hurt
performance without giving us anything back.
At the same time we can add 16-byte instruction to represent
load 64-bit immediate, but as I was saying before many factors
need to be considered before we proceed.

 3. name reuse: we're reusing names b etween classic BPF and eBPF. For
 example, BPF_LD*|BPF_MEM in classic BPF refers to access to the M[]
 buffer. BPF_LDX|BPF_MEM in eBPF is a generic memory access. I find
 this very confusing, especially because we'll have to live with

That's one opinion. I think names are fine and
Documentation/networking/filter.txt explains both classic and eBPF
encoding well enough.

 classic BPF in userland filters for a long time. In fact, if you ask
 me, I'll come up with some generic name for the generic linux
 filtering mechanism (eBPF and internal BPF sound too much like BPF),
 to make it clear that this is not just BPF.

I don't think it's a good idea. I like BPF abbreviation, since the name
implies the use case. Renaming eBPF to ISA_X will be confusing.
Now everyone understands that BPF is a safe instruction set that
can be dynamically loaded into the kernel. eBPF is the same
plus more.

 4. simplicity: both BPF and eBPF have 4 ld/st operations types (LD,
 LDX, ST, STX) and many addressing modes/modifiers (6 for BPF, 4 for
 eBPF), where only a subset of the {operation types, modifier} tuples
 are valid. I think we can make it simpler. For the eBPF case, we
 currently have 6 valid combinations:

That's a good summary. I think documentation explained it already,
but if you feel it's still missing pieces, please send a patch to
improve the doc.

 The two eBPF BPF_LD insn are BPF_LDX|BPF_MEM insn to access an skbuff.
 For example, BPF_LD|BPF_ABS does dst_reg = packet[imm32], which is a
 BPF_LDX|BPF_MEM with the following differences:
 a. packet is skb-data == CTX == BPF_R6 (this is equivalent to
 src_reg==R6 in BPF_LDX|BPF_MEM)
 b. output is left in R0, not in dst_reg
 c. result is 

RE: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-23 Thread David Laight
From: Chema Gonzalez
...
 4.5. BPF_ST|BPF_MEM
 Operation: *(size *) (dst_reg + off16) = imm32
 
 This insn encodes 2 immediate values (the offset and the imm32 value)
 in the insn, and actually forces the sock_filter_int 64-bit struct to
 have both a 16-bit offset field and a 32-bit immediate field). In
 fact, it's the only instructions that uses .off and .imm at the same
 time (for all other instructions, at least one of the fields is always
 0).
 
 This did not exist in classic BPF (where BPF_ST|BPF_MEM actually did
 mem[pc-k] = A;). In fact, it's rare to find an ISA that allows
 encoding 2 immediate values in a single insn. My impression (after
 checking the x86 JIT implementation, which works on the eBPF code) is
 that this was added as an x86 optimization, because x86 allows
 encoding 2 values (offset and immediate) by using the displacement and
 immediate suffixes. I wonder whether the ISA would be more readable if
 we did this in 2 insn, one to put dst_reg+off16 in a temporary
 register, and the second a simpler BPF_STX|BPF_MEM. Then we could use
 the same space for the immediate and offset fields.

One option is to add code to the x86 JIT to detect the two instruction
sequence and generate a single instruction.

Thinks further, the JIT might be easier to write if there is a temporary
register that is defined to be only valid for the next instruction (or two).
Then the JIT can completely optimise away any assignments to it without
requiring a full analysis of the entire program.

David



Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-20 Thread Chema Gonzalez
[Sorry for the delay in the answer. Been mired somewhere else.]

On Tue, Jun 3, 2014 at 5:38 PM, Alexei Starovoitov  wrote:
> On Tue, Jun 3, 2014 at 2:40 PM, Chema Gonzalez  wrote:
>> First of all, and just to join the crowd, kernel/bpf/ FTW.
>>
>> Now, I have some suggestions about eBPF. IMO classic BPF is an ISA
>> oriented to filter (meaning returning a single integer that states how
>> many bytes of the packet must be captured) packets (e.g. consider the
>> 6 load modes, where 3 provide access the packet -- abs, ind, msh --,
>> one to an skb field -- len--, the 5th one to the memory itself -- mem
>> --, and the 6th is an immediate set mode --imm-- ) that has been used
>> in other environments (seccomp, tracing, etc.) by (a) extending the
>> idea of a "packet" into a "buffer", and (b) adding ancillary loads.
>>
>> eBPF should be a generic ISA that can be used by many environments,
>> including those served today by classic BPF. IMO, we should get a
>> nicely-defined ISA (MIPS anyone?) and check what should go into eBPF
>> and what should not.
>
> Model eBPF based on MIPS ISA? Ouch.
> That would be one ugly ISA that is not JITable on x64.

Definitely I wasn't making my point clear: IMO if we're redesigning
the BPF ISA, we should get a clean one (clean=something that is simple
enough to be readable by a newcomer). I mentioned MIPS because it's
the one I know the most, and it's kind of clean. I'm definitely not
proposing to use MIPS as the BPF ISA.

In particular, I have 4 cleanliness concerns related to load/stores:

1. how we codify the ISA. Both BPF and eBPF devote 50% of the insn
space to load/store (4 opcodes/instruction class values of 8
possible). In comparison, MIPS uses just 14% (9 of 64 opcodes). That
gives me pause. I'm definitely not suggesting adding more insn just
because physical ISAs have it, but I think it makes sense to avoid
using the whole insn space just because it's there, or because classic
BPF was using it all.

2. instructions (sock_filter_int) hardcode 2 immediate values. This is
very unusual in ISAs. We're effectively doubling the insn size (from
32 to 64 bits), and still we are cramming the whole insn space (only 1
reserved instruction class of 8 possible ones). The rationale is to
support a new addressing mode (BPF_ST|BPF_MEM), where both the offset
to src_reg and the immediate value are hardcoded in the insn. (See
more below.)

3. name reuse: we're reusing names b etween classic BPF and eBPF. For
example, BPF_LD*|BPF_MEM in classic BPF refers to access to the M[]
buffer. BPF_LDX|BPF_MEM in eBPF is a generic memory access. I find
this very confusing, especially because we'll have to live with
classic BPF in userland filters for a long time. In fact, if you ask
me, I'll come up with some generic name for the generic linux
filtering mechanism (eBPF and internal BPF sound too much like BPF),
to make it clear that this is not just BPF.

4. simplicity: both BPF and eBPF have 4 ld/st operations types (LD,
LDX, ST, STX) and many addressing modes/modifiers (6 for BPF, 4 for
eBPF), where only a subset of the {operation types, modifier} tuples
are valid. I think we can make it simpler. For the eBPF case, we
currently have 6 valid combinations:

4.1. BPF_LDX|BPF_MEM
Operation: dst_reg = *(size *) (src_reg + off16)

This is the basic load insn. It's used to convert most of the classic
BPF addressing modes by setting the right src_reg (FP in the classic
BPF M[] access, CTX for the classic BPF BPF_LD*|BPF_LEN access and
seccomp_data access, etc.)

4.2. BPF_LD|BPF_ABS
Operation: BPF_R0 = ntoh(*(size *) (skb->data + imm32))

4.3. BPF_LD|BPF_IND
Operation: BPF_R0 = ntoh(*(size *) (skb->data + src_reg + imm32))

The two eBPF BPF_LD insn are BPF_LDX|BPF_MEM insn to access an skbuff.
For example, BPF_LD|BPF_ABS does "dst_reg = packet[imm32]", which is a
BPF_LDX|BPF_MEM with the following differences:
a. packet is skb->data == CTX == BPF_R6 (this is equivalent to
src_reg==R6 in BPF_LDX|BPF_MEM)
b. output is left in R0, not in dst_reg
c. result is ntohs()|ntohl()
d. every packet access is checked using
load_pointer()/skb_header_pointer()/bpf_internal_load_pointer_neg_helper()

Now, (a,b,c) seem like details that should be worked with multiple
instructions (in fact the x86 JIT does that). (d) is IMO the only part
important enough to justify a different insn. I'd call this mode
BPF_SKBUFF_PROTECTED (or something like that), because that is the
main idea of this instruction: that any memory access (ld or st) is
checked during runtime assuming it's an skbuff.

The BPF_LD|BPF_IND insn could be replaced in 2 steps, one to get
src_reg+imm32 into a tmp register, and the other to perform the final
load based on the tmp register.

4.4. BPF_STX|BPF_MEM
Operation: *(size *) (dst_reg + off16) = src_reg

This is the basic store insn. LGTM.

4.5. BPF_ST|BPF_MEM
Operation: *(size *) (dst_reg + off16) = imm32

This insn encodes 2 immediate values (the offset and the imm32 value)
in the insn, and actually forces the 

Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-20 Thread Chema Gonzalez
[Sorry for the delay in the answer. Been mired somewhere else.]

On Tue, Jun 3, 2014 at 5:38 PM, Alexei Starovoitov a...@plumgrid.com wrote:
 On Tue, Jun 3, 2014 at 2:40 PM, Chema Gonzalez ch...@google.com wrote:
 First of all, and just to join the crowd, kernel/bpf/ FTW.

 Now, I have some suggestions about eBPF. IMO classic BPF is an ISA
 oriented to filter (meaning returning a single integer that states how
 many bytes of the packet must be captured) packets (e.g. consider the
 6 load modes, where 3 provide access the packet -- abs, ind, msh --,
 one to an skb field -- len--, the 5th one to the memory itself -- mem
 --, and the 6th is an immediate set mode --imm-- ) that has been used
 in other environments (seccomp, tracing, etc.) by (a) extending the
 idea of a packet into a buffer, and (b) adding ancillary loads.

 eBPF should be a generic ISA that can be used by many environments,
 including those served today by classic BPF. IMO, we should get a
 nicely-defined ISA (MIPS anyone?) and check what should go into eBPF
 and what should not.

 Model eBPF based on MIPS ISA? Ouch.
 That would be one ugly ISA that is not JITable on x64.

Definitely I wasn't making my point clear: IMO if we're redesigning
the BPF ISA, we should get a clean one (clean=something that is simple
enough to be readable by a newcomer). I mentioned MIPS because it's
the one I know the most, and it's kind of clean. I'm definitely not
proposing to use MIPS as the BPF ISA.

In particular, I have 4 cleanliness concerns related to load/stores:

1. how we codify the ISA. Both BPF and eBPF devote 50% of the insn
space to load/store (4 opcodes/instruction class values of 8
possible). In comparison, MIPS uses just 14% (9 of 64 opcodes). That
gives me pause. I'm definitely not suggesting adding more insn just
because physical ISAs have it, but I think it makes sense to avoid
using the whole insn space just because it's there, or because classic
BPF was using it all.

2. instructions (sock_filter_int) hardcode 2 immediate values. This is
very unusual in ISAs. We're effectively doubling the insn size (from
32 to 64 bits), and still we are cramming the whole insn space (only 1
reserved instruction class of 8 possible ones). The rationale is to
support a new addressing mode (BPF_ST|BPF_MEM), where both the offset
to src_reg and the immediate value are hardcoded in the insn. (See
more below.)

3. name reuse: we're reusing names b etween classic BPF and eBPF. For
example, BPF_LD*|BPF_MEM in classic BPF refers to access to the M[]
buffer. BPF_LDX|BPF_MEM in eBPF is a generic memory access. I find
this very confusing, especially because we'll have to live with
classic BPF in userland filters for a long time. In fact, if you ask
me, I'll come up with some generic name for the generic linux
filtering mechanism (eBPF and internal BPF sound too much like BPF),
to make it clear that this is not just BPF.

4. simplicity: both BPF and eBPF have 4 ld/st operations types (LD,
LDX, ST, STX) and many addressing modes/modifiers (6 for BPF, 4 for
eBPF), where only a subset of the {operation types, modifier} tuples
are valid. I think we can make it simpler. For the eBPF case, we
currently have 6 valid combinations:

4.1. BPF_LDX|BPF_MEM
Operation: dst_reg = *(size *) (src_reg + off16)

This is the basic load insn. It's used to convert most of the classic
BPF addressing modes by setting the right src_reg (FP in the classic
BPF M[] access, CTX for the classic BPF BPF_LD*|BPF_LEN access and
seccomp_data access, etc.)

4.2. BPF_LD|BPF_ABS
Operation: BPF_R0 = ntohsize(*(size *) (skb-data + imm32))

4.3. BPF_LD|BPF_IND
Operation: BPF_R0 = ntohsize(*(size *) (skb-data + src_reg + imm32))

The two eBPF BPF_LD insn are BPF_LDX|BPF_MEM insn to access an skbuff.
For example, BPF_LD|BPF_ABS does dst_reg = packet[imm32], which is a
BPF_LDX|BPF_MEM with the following differences:
a. packet is skb-data == CTX == BPF_R6 (this is equivalent to
src_reg==R6 in BPF_LDX|BPF_MEM)
b. output is left in R0, not in dst_reg
c. result is ntohs()|ntohl()
d. every packet access is checked using
load_pointer()/skb_header_pointer()/bpf_internal_load_pointer_neg_helper()

Now, (a,b,c) seem like details that should be worked with multiple
instructions (in fact the x86 JIT does that). (d) is IMO the only part
important enough to justify a different insn. I'd call this mode
BPF_SKBUFF_PROTECTED (or something like that), because that is the
main idea of this instruction: that any memory access (ld or st) is
checked during runtime assuming it's an skbuff.

The BPF_LD|BPF_IND insn could be replaced in 2 steps, one to get
src_reg+imm32 into a tmp register, and the other to perform the final
load based on the tmp register.

4.4. BPF_STX|BPF_MEM
Operation: *(size *) (dst_reg + off16) = src_reg

This is the basic store insn. LGTM.

4.5. BPF_ST|BPF_MEM
Operation: *(size *) (dst_reg + off16) = imm32

This insn encodes 2 immediate values (the offset and the imm32 value)
in the insn, and actually forces the 

Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-03 Thread Alexei Starovoitov
On Tue, Jun 3, 2014 at 2:40 PM, Chema Gonzalez  wrote:
> First of all, and just to join the crowd, kernel/bpf/ FTW.
>
> Now, I have some suggestions about eBPF. IMO classic BPF is an ISA
> oriented to filter (meaning returning a single integer that states how
> many bytes of the packet must be captured) packets (e.g. consider the
> 6 load modes, where 3 provide access the packet -- abs, ind, msh --,
> one to an skb field -- len--, the 5th one to the memory itself -- mem
> --, and the 6th is an immediate set mode --imm-- ) that has been used
> in other environments (seccomp, tracing, etc.) by (a) extending the
> idea of a "packet" into a "buffer", and (b) adding ancillary loads.
>
> eBPF should be a generic ISA that can be used by many environments,
> including those served today by classic BPF. IMO, we should get a
> nicely-defined ISA (MIPS anyone?) and check what should go into eBPF
> and what should not.

Model eBPF based on MIPS ISA? Ouch.
That would be one ugly ISA that is not JITable on x64.

eBPF ISA wasn't invented overnight. It was a gigantic effort that
took a lot of time to narrow down x64 into _verifiable_ ISA.
I had to take into account arm64, mips64, sparcv9 architectures too.
Of course, minor things can be improved here or there.
Ugliness of ISA hits compiler writers first. I've seen many
times how cpu designers add new instructions only to be told
by compiler guys that they just wasted silicon.
Fact that llvm/gcc compile C into eBPF is the strongest
statement that eBPF ISA is 99.9% complete.
New instructions may or may not make sense.
Let's examine your proposal:

> - 1. we should considering separating the eBPF ISA farther from classic BPF
>   - eBPF still uses a_reg and x_reg as the names of the 2 op
> registers. This is very confusing, especially when dealing with
> translated filters that do move data between A and X. I've had a_reg
> being X, and x_reg being A. We should rename them d_reg and s_reg.

that is renaming of two fields in sock_filter_int structure.
No change to actual ISA.

You're proposing the following:
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 0e463ee77bb2..bf50fa440ef8 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -287,8 +287,8 @@ enum {

 struct sock_filter_int {
__u8code;   /* opcode */
-   __u8a_reg:4;/* dest register */
-   __u8x_reg:4;/* source register */
+   __u8dst_reg:4;  /* dest register */
+   __u8src_reg:4;  /* source register */
__s16   off;/* signed offset */
__s32   imm;/* signed immediate constant */
 };
sure. I thought comment was explicit enough, but I agree
the fields could have been named better.
Will do a patch to rename it.

>   - BPF_LD vs. BPF_LDX: this made sense in classic BPF, as there was
> only one register, and d_reg was implicit in the name of the insn
> code. Now, why are we keeping both in eBPF, when the register we're
> writing to is made explicit in d_reg (I already forgot if d_reg was
> a_reg or x_reg ;) ? Removing one of them will save us 1/8th of the
> insns.
>   - BPF_ST vs. BPF_STX: same here. Note that the current
> sk_convert_filter() just converts all stores to BPF_STX.

nope. No extra bits can be saved here.
STX means:
*(dest_reg + off) = src_reg;
ST means:
*(dest_reg + off) = imm;
LDX means:
dest_reg = *(src_reg + off)

LD we had to carry over from classic as only two non-generic
instructions: LD_ABS and LD_IND.

> - 2. there are other insn that we should consider adding:
>   - lui: AFAICT, there is no clean way to build a 64-bit number (you
> can LD_IMM the upper part, lsh 32, and then add the lower part).

correct. in tracing filters I do this:
+   /* construct 64-bit address */
+   emit(BPF_ALU64_IMM(BPF_MOV, BPF_REG_2, addr >>
32), ctx);
+   emit(BPF_ALU64_IMM(BPF_LSH, BPF_REG_2, 32), ctx);
+   emit(BPF_ALU32_IMM(BPF_MOV, BPF_REG_1, (u32)
addr), ctx);
+   emit(BPF_ALU64_REG(BPF_OR, BPF_REG_1, BPF_REG_2), ctx);

so there is a way to construct 64-bit immediate.
The question is how often do we need to do it? Is it in critical path?
Naive answer "one instruction is better than 4" doesn't count.
See my point above 'cpu designer vs compiler writer'.
None of the risc ISAs have 64-bit imm and eBPF has to consider
simplicity of JITs otherwise those architectures will have a hard time
mapping eBPF to native. If JIT would be too difficult to do, then
there will be no JIT. I don't want eBPF to become an instruction
set that can be JITed only on one architecture...

'mov dest_reg, imm64' may still be ok to add, since x64 can
JIT it with one instruction, arm64 with 4 instructions, but JITs
for other archs will be ugly. They can JIT it as load from memory,
but I need to think it through. Let me explore it more carefully.

Two must have requirements for eBPF:
1. verifiable 

Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-03 Thread Chema Gonzalez
First of all, and just to join the crowd, kernel/bpf/ FTW.

Now, I have some suggestions about eBPF. IMO classic BPF is an ISA
oriented to filter (meaning returning a single integer that states how
many bytes of the packet must be captured) packets (e.g. consider the
6 load modes, where 3 provide access the packet -- abs, ind, msh --,
one to an skb field -- len--, the 5th one to the memory itself -- mem
--, and the 6th is an immediate set mode --imm-- ) that has been used
in other environments (seccomp, tracing, etc.) by (a) extending the
idea of a "packet" into a "buffer", and (b) adding ancillary loads.

eBPF should be a generic ISA that can be used by many environments,
including those served today by classic BPF. IMO, we should get a
nicely-defined ISA (MIPS anyone?) and check what should go into eBPF
and what should not.

- 1. we should considering separating the eBPF ISA farther from classic BPF
  - eBPF still uses a_reg and x_reg as the names of the 2 op
registers. This is very confusing, especially when dealing with
translated filters that do move data between A and X. I've had a_reg
being X, and x_reg being A. We should rename them d_reg and s_reg.
  - BPF_LD vs. BPF_LDX: this made sense in classic BPF, as there was
only one register, and d_reg was implicit in the name of the insn
code. Now, why are we keeping both in eBPF, when the register we're
writing to is made explicit in d_reg (I already forgot if d_reg was
a_reg or x_reg ;) ? Removing one of them will save us 1/8th of the
insns.
  - BPF_ST vs. BPF_STX: same here. Note that the current
sk_convert_filter() just converts all stores to BPF_STX.

- 2. there are other insn that we should consider adding:
  - lui: AFAICT, there is no clean way to build a 64-bit number (you
can LD_IMM the upper part, lsh 32, and then add the lower part).
  - nop: I'd like to have a nop. Do I know why? Nope.


On Tue, Jun 3, 2014 at 1:58 PM, Alexei Starovoitov  wrote:
> On Tue, Jun 3, 2014 at 1:35 PM, Daniel Borkmann  wrote:
>> On 06/03/2014 05:44 PM, Alexei Starovoitov wrote:
>> ...
>>>
>>> All of your points are valid. They are right questions to ask. I just
>>>
>>> don't see why you're still arguing about first step of filter.c split,
>>> whereas your concerns are about steps 2, 3, 4.
>>
>>
>> Fair enough, lets keep them in mind though for future work. Btw,
>
> Ok :)
>
>> are other files planned for kernel/bpf/ or should it instead just
>> simply be kernel/bpf.c?
>
> The most obvious one is eBPF verifier in separate file (kernel/bpf/verifier.c)
> bpf maps is yet another thing, but that's different topic.
> Probably a set of bpf-callable functions in another file. Like right now
> for sockets these helpers are __skb_get_pay_offset(), __skb_get_nlattr()
> For tracing there will be a different set of helper functions and eventually
> some will be common. Like __get_raw_cpu_id() from filter.c could
> eventually move to kernel/bpf/helpers.c
LGTM.

I like the idea of every user (packet filter, seccomp, etc.) providing
a map of the bpf calls that are ok, as in the packet filter stating
that {1->__skb_get_pay_offset(), 2->__skb_get_nlattr(), ...}, but
seccomp providing a completely different (or even empty) map.

-Chema


> I'm not a fan of squeezing different logic into one file.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-03 Thread Alexei Starovoitov
On Tue, Jun 3, 2014 at 1:35 PM, Daniel Borkmann  wrote:
> On 06/03/2014 05:44 PM, Alexei Starovoitov wrote:
> ...
>>
>> All of your points are valid. They are right questions to ask. I just
>>
>> don't see why you're still arguing about first step of filter.c split,
>> whereas your concerns are about steps 2, 3, 4.
>
>
> Fair enough, lets keep them in mind though for future work. Btw,

Ok :)

> are other files planned for kernel/bpf/ or should it instead just
> simply be kernel/bpf.c?

The most obvious one is eBPF verifier in separate file (kernel/bpf/verifier.c)
bpf maps is yet another thing, but that's different topic.
Probably a set of bpf-callable functions in another file. Like right now
for sockets these helpers are __skb_get_pay_offset(), __skb_get_nlattr()
For tracing there will be a different set of helper functions and eventually
some will be common. Like __get_raw_cpu_id() from filter.c could
eventually move to kernel/bpf/helpers.c
I'm not a fan of squeezing different logic into one file.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-03 Thread Daniel Borkmann

On 06/03/2014 05:44 PM, Alexei Starovoitov wrote:
...

All of your points are valid. They are right questions to ask. I just
don't see why you're still arguing about first step of filter.c split,
whereas your concerns are about steps 2, 3, 4.


Fair enough, lets keep them in mind though for future work. Btw,
are other files planned for kernel/bpf/ or should it instead just
simply be kernel/bpf.c?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-03 Thread Ingo Molnar

* Alexei Starovoitov  wrote:

> On Mon, Jun 2, 2014 at 7:16 AM, Arnaldo Carvalho de Melo
>  wrote:
> > Em Mon, Jun 02, 2014 at 09:24:56AM -0400, Steven Rostedt escreveu:
> >> On Mon, 2 Jun 2014 08:15:45 -0500
> >> Jonathan Corbet  wrote:
> >
> >> > On Mon,  2 Jun 2014 00:01:44 -0700
> >> > Alexei Starovoitov  wrote:
> >
> >> > > This patch set splits BPF out of core networking into generic component
> >
> >> > Quick, probably dumb question: if you're going to split it out, why not
> >> > split it out entirely, into kernel/ or (perhaps better) lib/?  The
> >> > whole point seems to be that BPF is outgrowing its networking home, so
> >> > it seems like it might be better to make it truly generic.
> >
> >> I believe this is what Ingo suggested as well. If it is become generic,
> >> it belongs in lib/
> >
> > Yes, that was his suggestion, which I agree with, FWIW.
> 
> I guess I posted v2 too quickly :) v2 splits filter.c into 
> kernel/bpf/. I think it's a better location than lib/bpf, since lib 
> feels too constrained by definition of 'library'. bpf is more than a 
> set of library calls.

Yeah, the upgrade to kernel/bpf/ is a better place for BPF IMO: BPF is 
really an 'active', stateful subsystem, with non-trivial per arch 
implementations, while lib/ is generally for standalone, generic, 
platform-decoupled library functions (with a few exceptions).

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-03 Thread Alexei Starovoitov
On Tue, Jun 3, 2014 at 1:56 AM, Daniel Borkmann  wrote:
> On 06/02/2014 09:02 PM, Alexei Starovoitov wrote:
> ...
>>
>> Classic has all sorts of hard coded assumptions. The whole
>>
>> concept of 'load from magic constant' to mean different things
>> is flawed. We all got used to it and now think that it's normal
>> for "ld_abs -4056" to mean "a ^= x"
>
>
> I think everyone knows that, no? Sure it doesn't fit into the
> concept, but I think at the time BPF extensions were introduced,
> it was probably seen as the best trade-off available to access
> useful skb fields while still trying to minimize exposure to uapi
> as much as possible.

Exactly. It _was_ seen as right trade-off in the past.
Now we have a lot more bpf users, so considerations are different.

>> This split is not trying to make classic easier to hack.
>> With eBPF underneath classic, it got a lot easier to add extensions
>> to classic, but we shouldn't be doing it.
>> Classic BPF is not generic and cannot become one. It's eBPF's job.
>>
>> The split is mainly helping to clearly see the boundary of eBPF core
>> vs its socket use case. It doesn't change or add any API.
>
>
> So what's the plan with everything in arch/*/net/, tools/net/ and
> in Documentation/networking/filter.txt, plus MAINTAINERS file, that
> the current patch doesn't address?

I have multi-year long plan of actions in eBPF area and as
was seen in past several month I would have to adjust it many times
based on community feedback.
The plan includes taking care of arch/*/net, but I'm not bringing it up
right now, since the filter.c split itself doesn't depend on what
we're going to do with JITs in arch/*/net/
As you saw I mentioned JITs in the cover letter, so I obviously
thought about it before proposing this filter.c split.
I even have rough patches to take care of it, but let's not get ahead
of ourselves.
My plan also includes upstreaming of LLVM eBPF backend, but linux
needs to expose it to userspace first.
It includes eBPF assembler to write programs like:
 r1 = r5
 *(u32 *) (fp - 10) = 0
 call foo
 if (r0 == 0) goto Label
^^above is assembler. I don't like current bpf_asm syntax,
since it's too assemblish. C-looking assembler is easier to understand.
It includes bpf maps, 'perf run filter.c' and all sorts of other things.
I cannot put the year long plan in one email, since tl;dr kicks in.

filter.c split is a tiny first step.
next step is filter.h split
renaming arch/*/net/bpf_jit_comp.c into arch/*/bpf/jit_comp.c is
the least of my concerns. If JITs stay with strong dependency
to NET, it's also fine.
As I said in cover letter filter.c split is not about NET dependency.
Even tiny embedded systems rely on networking, so all real world
.config's will include 'NET'. The split is about logical separation
of eBPF vs sockets. Having them in one file just not doing any good,
since people are jumping into hacking things quickly without seeing
that eBPF is not only about sockets.

MAINTAINERS file is a good question too.
I would be happy to maintain bpf/ebpf, since it's my full time job anyway,
but again let's not jump the gun.

> We want changes to go via net...@vger.kernel.org as they always
> did, since [ although other use cases pop up ] the main user, as
> I said, is simply still packet filtering in various networking
> subsystems, no?

Obviously sockets is the main, but not the only user, so I think
both lkml and netdev would need to be cc-ed in the future.
Or we can create 'bpf' alias for anyone interested.

All of your points are valid. They are right questions to ask. I just
don't see why you're still arguing about first step of filter.c split,
whereas your concerns are about steps 2, 3, 4.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-03 Thread Daniel Borkmann

On 06/02/2014 09:02 PM, Alexei Starovoitov wrote:
...

Classic has all sorts of hard coded assumptions. The whole
concept of 'load from magic constant' to mean different things
is flawed. We all got used to it and now think that it's normal
for "ld_abs -4056" to mean "a ^= x"


I think everyone knows that, no? Sure it doesn't fit into the
concept, but I think at the time BPF extensions were introduced,
it was probably seen as the best trade-off available to access
useful skb fields while still trying to minimize exposure to uapi
as much as possible.


This split is not trying to make classic easier to hack.
With eBPF underneath classic, it got a lot easier to add extensions
to classic, but we shouldn't be doing it.
Classic BPF is not generic and cannot become one. It's eBPF's job.

The split is mainly helping to clearly see the boundary of eBPF core
vs its socket use case. It doesn't change or add any API.


So what's the plan with everything in arch/*/net/, tools/net/ and
in Documentation/networking/filter.txt, plus MAINTAINERS file, that
the current patch doesn't address?

We want changes to go via net...@vger.kernel.org as they always
did, since [ although other use cases pop up ] the main user, as
I said, is simply still packet filtering in various networking
subsystems, no?


This in-kernel API cleanup was done in commit 5fe821a9dee2
You even acked it back then :)


I agreed with that change, otherwise I wouldn't have acked it,
of course.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-03 Thread Daniel Borkmann

On 06/02/2014 09:02 PM, Alexei Starovoitov wrote:
...

Classic has all sorts of hard coded assumptions. The whole
concept of 'load from magic constant' to mean different things
is flawed. We all got used to it and now think that it's normal
for ld_abs -4056 to mean a ^= x


I think everyone knows that, no? Sure it doesn't fit into the
concept, but I think at the time BPF extensions were introduced,
it was probably seen as the best trade-off available to access
useful skb fields while still trying to minimize exposure to uapi
as much as possible.


This split is not trying to make classic easier to hack.
With eBPF underneath classic, it got a lot easier to add extensions
to classic, but we shouldn't be doing it.
Classic BPF is not generic and cannot become one. It's eBPF's job.

The split is mainly helping to clearly see the boundary of eBPF core
vs its socket use case. It doesn't change or add any API.


So what's the plan with everything in arch/*/net/, tools/net/ and
in Documentation/networking/filter.txt, plus MAINTAINERS file, that
the current patch doesn't address?

We want changes to go via net...@vger.kernel.org as they always
did, since [ although other use cases pop up ] the main user, as
I said, is simply still packet filtering in various networking
subsystems, no?


This in-kernel API cleanup was done in commit 5fe821a9dee2
You even acked it back then :)


I agreed with that change, otherwise I wouldn't have acked it,
of course.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-03 Thread Alexei Starovoitov
On Tue, Jun 3, 2014 at 1:56 AM, Daniel Borkmann dbork...@redhat.com wrote:
 On 06/02/2014 09:02 PM, Alexei Starovoitov wrote:
 ...

 Classic has all sorts of hard coded assumptions. The whole

 concept of 'load from magic constant' to mean different things
 is flawed. We all got used to it and now think that it's normal
 for ld_abs -4056 to mean a ^= x


 I think everyone knows that, no? Sure it doesn't fit into the
 concept, but I think at the time BPF extensions were introduced,
 it was probably seen as the best trade-off available to access
 useful skb fields while still trying to minimize exposure to uapi
 as much as possible.

Exactly. It _was_ seen as right trade-off in the past.
Now we have a lot more bpf users, so considerations are different.

 This split is not trying to make classic easier to hack.
 With eBPF underneath classic, it got a lot easier to add extensions
 to classic, but we shouldn't be doing it.
 Classic BPF is not generic and cannot become one. It's eBPF's job.

 The split is mainly helping to clearly see the boundary of eBPF core
 vs its socket use case. It doesn't change or add any API.


 So what's the plan with everything in arch/*/net/, tools/net/ and
 in Documentation/networking/filter.txt, plus MAINTAINERS file, that
 the current patch doesn't address?

I have multi-year long plan of actions in eBPF area and as
was seen in past several month I would have to adjust it many times
based on community feedback.
The plan includes taking care of arch/*/net, but I'm not bringing it up
right now, since the filter.c split itself doesn't depend on what
we're going to do with JITs in arch/*/net/
As you saw I mentioned JITs in the cover letter, so I obviously
thought about it before proposing this filter.c split.
I even have rough patches to take care of it, but let's not get ahead
of ourselves.
My plan also includes upstreaming of LLVM eBPF backend, but linux
needs to expose it to userspace first.
It includes eBPF assembler to write programs like:
 r1 = r5
 *(u32 *) (fp - 10) = 0
 call foo
 if (r0 == 0) goto Label
^^above is assembler. I don't like current bpf_asm syntax,
since it's too assemblish. C-looking assembler is easier to understand.
It includes bpf maps, 'perf run filter.c' and all sorts of other things.
I cannot put the year long plan in one email, since tl;dr kicks in.

filter.c split is a tiny first step.
next step is filter.h split
renaming arch/*/net/bpf_jit_comp.c into arch/*/bpf/jit_comp.c is
the least of my concerns. If JITs stay with strong dependency
to NET, it's also fine.
As I said in cover letter filter.c split is not about NET dependency.
Even tiny embedded systems rely on networking, so all real world
.config's will include 'NET'. The split is about logical separation
of eBPF vs sockets. Having them in one file just not doing any good,
since people are jumping into hacking things quickly without seeing
that eBPF is not only about sockets.

MAINTAINERS file is a good question too.
I would be happy to maintain bpf/ebpf, since it's my full time job anyway,
but again let's not jump the gun.

 We want changes to go via net...@vger.kernel.org as they always
 did, since [ although other use cases pop up ] the main user, as
 I said, is simply still packet filtering in various networking
 subsystems, no?

Obviously sockets is the main, but not the only user, so I think
both lkml and netdev would need to be cc-ed in the future.
Or we can create 'bpf' alias for anyone interested.

All of your points are valid. They are right questions to ask. I just
don't see why you're still arguing about first step of filter.c split,
whereas your concerns are about steps 2, 3, 4.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-03 Thread Ingo Molnar

* Alexei Starovoitov a...@plumgrid.com wrote:

 On Mon, Jun 2, 2014 at 7:16 AM, Arnaldo Carvalho de Melo
 a...@kernel.org wrote:
  Em Mon, Jun 02, 2014 at 09:24:56AM -0400, Steven Rostedt escreveu:
  On Mon, 2 Jun 2014 08:15:45 -0500
  Jonathan Corbet cor...@lwn.net wrote:
 
   On Mon,  2 Jun 2014 00:01:44 -0700
   Alexei Starovoitov a...@plumgrid.com wrote:
 
This patch set splits BPF out of core networking into generic component
 
   Quick, probably dumb question: if you're going to split it out, why not
   split it out entirely, into kernel/ or (perhaps better) lib/?  The
   whole point seems to be that BPF is outgrowing its networking home, so
   it seems like it might be better to make it truly generic.
 
  I believe this is what Ingo suggested as well. If it is become generic,
  it belongs in lib/
 
  Yes, that was his suggestion, which I agree with, FWIW.
 
 I guess I posted v2 too quickly :) v2 splits filter.c into 
 kernel/bpf/. I think it's a better location than lib/bpf, since lib 
 feels too constrained by definition of 'library'. bpf is more than a 
 set of library calls.

Yeah, the upgrade to kernel/bpf/ is a better place for BPF IMO: BPF is 
really an 'active', stateful subsystem, with non-trivial per arch 
implementations, while lib/ is generally for standalone, generic, 
platform-decoupled library functions (with a few exceptions).

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-03 Thread Daniel Borkmann

On 06/03/2014 05:44 PM, Alexei Starovoitov wrote:
...

All of your points are valid. They are right questions to ask. I just
don't see why you're still arguing about first step of filter.c split,
whereas your concerns are about steps 2, 3, 4.


Fair enough, lets keep them in mind though for future work. Btw,
are other files planned for kernel/bpf/ or should it instead just
simply be kernel/bpf.c?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-03 Thread Alexei Starovoitov
On Tue, Jun 3, 2014 at 1:35 PM, Daniel Borkmann dbork...@redhat.com wrote:
 On 06/03/2014 05:44 PM, Alexei Starovoitov wrote:
 ...

 All of your points are valid. They are right questions to ask. I just

 don't see why you're still arguing about first step of filter.c split,
 whereas your concerns are about steps 2, 3, 4.


 Fair enough, lets keep them in mind though for future work. Btw,

Ok :)

 are other files planned for kernel/bpf/ or should it instead just
 simply be kernel/bpf.c?

The most obvious one is eBPF verifier in separate file (kernel/bpf/verifier.c)
bpf maps is yet another thing, but that's different topic.
Probably a set of bpf-callable functions in another file. Like right now
for sockets these helpers are __skb_get_pay_offset(), __skb_get_nlattr()
For tracing there will be a different set of helper functions and eventually
some will be common. Like __get_raw_cpu_id() from filter.c could
eventually move to kernel/bpf/helpers.c
I'm not a fan of squeezing different logic into one file.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-03 Thread Chema Gonzalez
First of all, and just to join the crowd, kernel/bpf/ FTW.

Now, I have some suggestions about eBPF. IMO classic BPF is an ISA
oriented to filter (meaning returning a single integer that states how
many bytes of the packet must be captured) packets (e.g. consider the
6 load modes, where 3 provide access the packet -- abs, ind, msh --,
one to an skb field -- len--, the 5th one to the memory itself -- mem
--, and the 6th is an immediate set mode --imm-- ) that has been used
in other environments (seccomp, tracing, etc.) by (a) extending the
idea of a packet into a buffer, and (b) adding ancillary loads.

eBPF should be a generic ISA that can be used by many environments,
including those served today by classic BPF. IMO, we should get a
nicely-defined ISA (MIPS anyone?) and check what should go into eBPF
and what should not.

- 1. we should considering separating the eBPF ISA farther from classic BPF
  - eBPF still uses a_reg and x_reg as the names of the 2 op
registers. This is very confusing, especially when dealing with
translated filters that do move data between A and X. I've had a_reg
being X, and x_reg being A. We should rename them d_reg and s_reg.
  - BPF_LD vs. BPF_LDX: this made sense in classic BPF, as there was
only one register, and d_reg was implicit in the name of the insn
code. Now, why are we keeping both in eBPF, when the register we're
writing to is made explicit in d_reg (I already forgot if d_reg was
a_reg or x_reg ;) ? Removing one of them will save us 1/8th of the
insns.
  - BPF_ST vs. BPF_STX: same here. Note that the current
sk_convert_filter() just converts all stores to BPF_STX.

- 2. there are other insn that we should consider adding:
  - lui: AFAICT, there is no clean way to build a 64-bit number (you
can LD_IMM the upper part, lsh 32, and then add the lower part).
  - nop: I'd like to have a nop. Do I know why? Nope.


On Tue, Jun 3, 2014 at 1:58 PM, Alexei Starovoitov a...@plumgrid.com wrote:
 On Tue, Jun 3, 2014 at 1:35 PM, Daniel Borkmann dbork...@redhat.com wrote:
 On 06/03/2014 05:44 PM, Alexei Starovoitov wrote:
 ...

 All of your points are valid. They are right questions to ask. I just

 don't see why you're still arguing about first step of filter.c split,
 whereas your concerns are about steps 2, 3, 4.


 Fair enough, lets keep them in mind though for future work. Btw,

 Ok :)

 are other files planned for kernel/bpf/ or should it instead just
 simply be kernel/bpf.c?

 The most obvious one is eBPF verifier in separate file (kernel/bpf/verifier.c)
 bpf maps is yet another thing, but that's different topic.
 Probably a set of bpf-callable functions in another file. Like right now
 for sockets these helpers are __skb_get_pay_offset(), __skb_get_nlattr()
 For tracing there will be a different set of helper functions and eventually
 some will be common. Like __get_raw_cpu_id() from filter.c could
 eventually move to kernel/bpf/helpers.c
LGTM.

I like the idea of every user (packet filter, seccomp, etc.) providing
a map of the bpf calls that are ok, as in the packet filter stating
that {1-__skb_get_pay_offset(), 2-__skb_get_nlattr(), ...}, but
seccomp providing a completely different (or even empty) map.

-Chema


 I'm not a fan of squeezing different logic into one file.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-03 Thread Alexei Starovoitov
On Tue, Jun 3, 2014 at 2:40 PM, Chema Gonzalez ch...@google.com wrote:
 First of all, and just to join the crowd, kernel/bpf/ FTW.

 Now, I have some suggestions about eBPF. IMO classic BPF is an ISA
 oriented to filter (meaning returning a single integer that states how
 many bytes of the packet must be captured) packets (e.g. consider the
 6 load modes, where 3 provide access the packet -- abs, ind, msh --,
 one to an skb field -- len--, the 5th one to the memory itself -- mem
 --, and the 6th is an immediate set mode --imm-- ) that has been used
 in other environments (seccomp, tracing, etc.) by (a) extending the
 idea of a packet into a buffer, and (b) adding ancillary loads.

 eBPF should be a generic ISA that can be used by many environments,
 including those served today by classic BPF. IMO, we should get a
 nicely-defined ISA (MIPS anyone?) and check what should go into eBPF
 and what should not.

Model eBPF based on MIPS ISA? Ouch.
That would be one ugly ISA that is not JITable on x64.

eBPF ISA wasn't invented overnight. It was a gigantic effort that
took a lot of time to narrow down x64 into _verifiable_ ISA.
I had to take into account arm64, mips64, sparcv9 architectures too.
Of course, minor things can be improved here or there.
Ugliness of ISA hits compiler writers first. I've seen many
times how cpu designers add new instructions only to be told
by compiler guys that they just wasted silicon.
Fact that llvm/gcc compile C into eBPF is the strongest
statement that eBPF ISA is 99.9% complete.
New instructions may or may not make sense.
Let's examine your proposal:

 - 1. we should considering separating the eBPF ISA farther from classic BPF
   - eBPF still uses a_reg and x_reg as the names of the 2 op
 registers. This is very confusing, especially when dealing with
 translated filters that do move data between A and X. I've had a_reg
 being X, and x_reg being A. We should rename them d_reg and s_reg.

that is renaming of two fields in sock_filter_int structure.
No change to actual ISA.

You're proposing the following:
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 0e463ee77bb2..bf50fa440ef8 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -287,8 +287,8 @@ enum {

 struct sock_filter_int {
__u8code;   /* opcode */
-   __u8a_reg:4;/* dest register */
-   __u8x_reg:4;/* source register */
+   __u8dst_reg:4;  /* dest register */
+   __u8src_reg:4;  /* source register */
__s16   off;/* signed offset */
__s32   imm;/* signed immediate constant */
 };
sure. I thought comment was explicit enough, but I agree
the fields could have been named better.
Will do a patch to rename it.

   - BPF_LD vs. BPF_LDX: this made sense in classic BPF, as there was
 only one register, and d_reg was implicit in the name of the insn
 code. Now, why are we keeping both in eBPF, when the register we're
 writing to is made explicit in d_reg (I already forgot if d_reg was
 a_reg or x_reg ;) ? Removing one of them will save us 1/8th of the
 insns.
   - BPF_ST vs. BPF_STX: same here. Note that the current
 sk_convert_filter() just converts all stores to BPF_STX.

nope. No extra bits can be saved here.
STX means:
*(dest_reg + off) = src_reg;
ST means:
*(dest_reg + off) = imm;
LDX means:
dest_reg = *(src_reg + off)

LD we had to carry over from classic as only two non-generic
instructions: LD_ABS and LD_IND.

 - 2. there are other insn that we should consider adding:
   - lui: AFAICT, there is no clean way to build a 64-bit number (you
 can LD_IMM the upper part, lsh 32, and then add the lower part).

correct. in tracing filters I do this:
+   /* construct 64-bit address */
+   emit(BPF_ALU64_IMM(BPF_MOV, BPF_REG_2, addr 
32), ctx);
+   emit(BPF_ALU64_IMM(BPF_LSH, BPF_REG_2, 32), ctx);
+   emit(BPF_ALU32_IMM(BPF_MOV, BPF_REG_1, (u32)
addr), ctx);
+   emit(BPF_ALU64_REG(BPF_OR, BPF_REG_1, BPF_REG_2), ctx);

so there is a way to construct 64-bit immediate.
The question is how often do we need to do it? Is it in critical path?
Naive answer one instruction is better than 4 doesn't count.
See my point above 'cpu designer vs compiler writer'.
None of the risc ISAs have 64-bit imm and eBPF has to consider
simplicity of JITs otherwise those architectures will have a hard time
mapping eBPF to native. If JIT would be too difficult to do, then
there will be no JIT. I don't want eBPF to become an instruction
set that can be JITed only on one architecture...

'mov dest_reg, imm64' may still be ok to add, since x64 can
JIT it with one instruction, arm64 with 4 instructions, but JITs
for other archs will be ugly. They can JIT it as load from memory,
but I need to think it through. Let me explore it more carefully.

Two must have requirements for eBPF:
1. verifiable instructions, meaning that 

Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-02 Thread Alexei Starovoitov
On Mon, Jun 2, 2014 at 10:04 AM, Daniel Borkmann  wrote:
> On 06/02/2014 05:41 PM, Alexei Starovoitov wrote:
> ...
>
>> Glad you brought up this point :)
>> 100% agree that current double verification done by seccomp is far from
>> being generic and quite hard to maintain, since any change done to
>> classic BPF verifier needs to be thought through from
>> seccomp_check_filter()
>> perspective as well.
>
>
> Glad we're on the same page.
>
>
>> BPF's input context, set of allowed calls need to be expressed in a
>> generic way.
>> Obviously this split by itself won't make classic BPF all of a sudden
>> generic.
>> It rather defines a boundary of eBPF core.
>
>
> Note, I'm not at all against using it in tracing, I think it's probably
> a good idea, but shouldn't we _first_ think about how to overcome such
> deficits as above by improving upon its in-kernel API design, thus to
> better prepare it to be generic? I feel this step is otherwise just
> skipped and quickly 'hacked' around ... ;)

Are you talking about classic 'deficit' or eBPF 'deficit' ?
Classic has all sorts of hard coded assumptions. The whole
concept of 'load from magic constant' to mean different things
is flawed. We all got used to it and now think that it's normal
for "ld_abs -4056" to mean "a ^= x"
This split is not trying to make classic easier to hack.
With eBPF underneath classic, it got a lot easier to add extensions
to classic, but we shouldn't be doing it.
Classic BPF is not generic and cannot become one. It's eBPF's job.

The split is mainly helping to clearly see the boundary of eBPF core
vs its socket use case. It doesn't change or add any API.
We need to carefully design eBPF APIs when we expose it
to user space. I have a proposal for that too, but that's separate
discussion.
In terms of in-kernel eBPF API there is nothing to be done.
eBPF program 'prog' is generated by whatever means and then:
  struct sk_filter *fp;

  fp = kzalloc(sk_filter_size(prog_len), GFP_KERNEL);
  memcpy(fp->insni, prog, prog_len * sizeof(fp->insni[0]));
  fp->len = prog_len;

  sk_filter_select_runtime(fp); // select interpreter or JIT
  SK_RUN_FILTER(fp, ctx);  // run the program
  sk_filter_free(fp); // free program

that's how sockets, testsuite, seccomp, tracing are doing it.
All have different ways of producing 'prog' and 'prog_len'.
This in-kernel API cleanup was done in commit 5fe821a9dee2
You even acked it back then :)

If you're referring to eBPF verifier in-kernel API then yeah, it's
missing, just like the whole eBPF verifier :)
Ideally any kernel component that generates eBPF on the fly
sends eBPF program to verifier first just to double check
that generated program is valid.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-02 Thread Daniel Borkmann

On 06/02/2014 05:41 PM, Alexei Starovoitov wrote:
...

Glad you brought up this point :)
100% agree that current double verification done by seccomp is far from
being generic and quite hard to maintain, since any change done to
classic BPF verifier needs to be thought through from seccomp_check_filter()
perspective as well.


Glad we're on the same page.


BPF's input context, set of allowed calls need to be expressed in a generic way.
Obviously this split by itself won't make classic BPF all of a sudden generic.
It rather defines a boundary of eBPF core.


Note, I'm not at all against using it in tracing, I think it's probably
a good idea, but shouldn't we _first_ think about how to overcome such
deficits as above by improving upon its in-kernel API design, thus to
better prepare it to be generic? I feel this step is otherwise just
skipped and quickly 'hacked' around ... ;)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-02 Thread Alexei Starovoitov
On Mon, Jun 2, 2014 at 1:57 AM, Daniel Borkmann  wrote:
> On 06/02/2014 09:01 AM, Alexei Starovoitov wrote:
>>
>> This patch set splits BPF out of core networking into generic component
>>
>> patch #1 splits filter.c into two logical pieces: generic BPF core and
>> socket
>> filters. It only moves functions around. No real changes.
>>
>> patch #2 adds hidden CONFIG_BPF that seccomp/tracing can select
>>
>> The main value of the patch is not a NET separation, but rather logical
>> boundary
>> between generic BPF core and socket filtering. All socket specific code
>> stays in
>> net/core/filter.c and kernel/bpf/core.c is for generic BPF infrastructure
>> (both
>> classic and internal).
>>
>> Note that CONFIG_BPF_JIT is still under NET, so NET-less configs cannot
>> use
>> BPF JITs yet. This can be cleaned up in the future. Also it seems to makes
>> sense
>> to split up filter.h into generic and socket specific as well to cleanup
>> the
>> boundary further.
>
>
> Hm, I really don't like that 'ripping code and headers apart' and then we
> believe
> it's a generic abstraction. So far seccomp-BPF could live with the current
> state
> since it was introduced, the rest of users (vast majority) is in the
> networking
> domain (and invoked through tcpdump et al) ...
>
> There are still parts in seccomp that show some BPF weaknesses in terms of
> being
> 'generic', for example shown in seccomp, we need to go once again over the
> filter
> instructions after doing the usual filter sanity checks, just to whitelist
> what
> seccomp may do in BPF.
>
> I have not yet thought about it deeply enough, but I think we should avoid
> something similar in other non-networking areas but abstract that cleanly
> w/o
> such hacks first, for example.

Glad you brought up this point :)
100% agree that current double verification done by seccomp is far from
being generic and quite hard to maintain, since any change done to
classic BPF verifier needs to be thought through from seccomp_check_filter()
perspective as well.
imo lack of generality in classic BPF is the main reason why we should stop
adding extensions to classic and switch to eBPF for any new features.
the eBPF verifier I posted now long ago is trying to be generic through
customization. The verifier core needs to stay independent of the use case.
BPF's input context, set of allowed calls need to be expressed in a generic way.
Obviously this split by itself won't make classic BPF all of a sudden generic.
It rather defines a boundary of eBPF core.
In eBPF only two instructions are not generic. It's LD_ABS/LD_IND
which are legacy instruction that we had to carry over from classic.
They require input context == sk_buff. That's why core.c had to
#include  and do '__weak skb_copy_bits()'.
Alternative to that was to #ifdef these two instructions out of interpreter
and #ifndef NET #include  and ld_abs helper functions in core.c
IMO that would have been ugly for code style, maintenance and testing,
but then core.c would have only one #include  and we can
say: 'look eBPF core.c is really generic'

In the next set of patches I'll repost verifier and will explain how single
eBPF verifier core can be used for socket, seccomp, tracing and other
things. Note I'm not saying that we should use eBPF now everywhere.
Classic BPF has its niche and that niche we have to maintain forever.
So let's make sure that eBPF interpreter, its instruction set and its
verifier are staying generic.
This split is only first step in that direction that creates a file boundary
between eBPF core and sockets.

>
>> Tested with several NET and NET-less configs on arm and x86
>>
>> V1->V2:
>> rebase on top of net-next
>> split filter.c into kernel/bpf/core.c instead of net/bpf/core.c
>>
>> Alexei Starovoitov (2):
>>net: filter: split filter.c into two files
>>net: filter: split BPF out of core networking
>>
>>   arch/Kconfig   |6 +-
>>   include/linux/filter.h |2 +
>>   kernel/Makefile|1 +
>>   kernel/bpf/Makefile|5 +
>>   kernel/bpf/core.c  | 1063
>> 
>>   net/Kconfig|1 +
>>   net/core/filter.c  | 1023
>> +-
>>   7 files changed, 1079 insertions(+), 1022 deletions(-)
>>   create mode 100644 kernel/bpf/Makefile
>>   create mode 100644 kernel/bpf/core.c
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-02 Thread Alexei Starovoitov
On Mon, Jun 2, 2014 at 7:16 AM, Arnaldo Carvalho de Melo
 wrote:
> Em Mon, Jun 02, 2014 at 09:24:56AM -0400, Steven Rostedt escreveu:
>> On Mon, 2 Jun 2014 08:15:45 -0500
>> Jonathan Corbet  wrote:
>
>> > On Mon,  2 Jun 2014 00:01:44 -0700
>> > Alexei Starovoitov  wrote:
>
>> > > This patch set splits BPF out of core networking into generic component
>
>> > Quick, probably dumb question: if you're going to split it out, why not
>> > split it out entirely, into kernel/ or (perhaps better) lib/?  The
>> > whole point seems to be that BPF is outgrowing its networking home, so
>> > it seems like it might be better to make it truly generic.
>
>> I believe this is what Ingo suggested as well. If it is become generic,
>> it belongs in lib/
>
> Yes, that was his suggestion, which I agree with, FWIW.

I guess I posted v2 too quickly :) v2 splits filter.c into kernel/bpf/.
I think it's a better location than lib/bpf, since lib feels too constrained
by definition of 'library'. bpf is more than a set of library calls.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-02 Thread Arnaldo Carvalho de Melo
Em Mon, Jun 02, 2014 at 09:24:56AM -0400, Steven Rostedt escreveu:
> On Mon, 2 Jun 2014 08:15:45 -0500
> Jonathan Corbet  wrote:
 
> > On Mon,  2 Jun 2014 00:01:44 -0700
> > Alexei Starovoitov  wrote:

> > > This patch set splits BPF out of core networking into generic component

> > Quick, probably dumb question: if you're going to split it out, why not
> > split it out entirely, into kernel/ or (perhaps better) lib/?  The
> > whole point seems to be that BPF is outgrowing its networking home, so
> > it seems like it might be better to make it truly generic.
 
> I believe this is what Ingo suggested as well. If it is become generic,
> it belongs in lib/

Yes, that was his suggestion, which I agree with, FWIW.

- Arnaldo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-02 Thread Steven Rostedt
On Mon, 2 Jun 2014 08:15:45 -0500
Jonathan Corbet  wrote:

> On Mon,  2 Jun 2014 00:01:44 -0700
> Alexei Starovoitov  wrote:
> 
> > This patch set splits BPF out of core networking into generic component
> 
> Quick, probably dumb question: if you're going to split it out, why not
> split it out entirely, into kernel/ or (perhaps better) lib/?  The
> whole point seems to be that BPF is outgrowing its networking home, so
> it seems like it might be better to make it truly generic.

I believe this is what Ingo suggested as well. If it is become generic,
it belongs in lib/

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-02 Thread Jonathan Corbet
On Mon,  2 Jun 2014 00:01:44 -0700
Alexei Starovoitov  wrote:

> This patch set splits BPF out of core networking into generic component

Quick, probably dumb question: if you're going to split it out, why not
split it out entirely, into kernel/ or (perhaps better) lib/?  The
whole point seems to be that BPF is outgrowing its networking home, so
it seems like it might be better to make it truly generic.

jon
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-02 Thread Daniel Borkmann

On 06/02/2014 09:01 AM, Alexei Starovoitov wrote:

This patch set splits BPF out of core networking into generic component

patch #1 splits filter.c into two logical pieces: generic BPF core and socket
filters. It only moves functions around. No real changes.

patch #2 adds hidden CONFIG_BPF that seccomp/tracing can select

The main value of the patch is not a NET separation, but rather logical boundary
between generic BPF core and socket filtering. All socket specific code stays in
net/core/filter.c and kernel/bpf/core.c is for generic BPF infrastructure (both
classic and internal).

Note that CONFIG_BPF_JIT is still under NET, so NET-less configs cannot use
BPF JITs yet. This can be cleaned up in the future. Also it seems to makes sense
to split up filter.h into generic and socket specific as well to cleanup the
boundary further.


Hm, I really don't like that 'ripping code and headers apart' and then we 
believe
it's a generic abstraction. So far seccomp-BPF could live with the current state
since it was introduced, the rest of users (vast majority) is in the networking
domain (and invoked through tcpdump et al) ...

There are still parts in seccomp that show some BPF weaknesses in terms of being
'generic', for example shown in seccomp, we need to go once again over the 
filter
instructions after doing the usual filter sanity checks, just to whitelist what
seccomp may do in BPF.

I have not yet thought about it deeply enough, but I think we should avoid
something similar in other non-networking areas but abstract that cleanly w/o
such hacks first, for example.


Tested with several NET and NET-less configs on arm and x86

V1->V2:
rebase on top of net-next
split filter.c into kernel/bpf/core.c instead of net/bpf/core.c

Alexei Starovoitov (2):
   net: filter: split filter.c into two files
   net: filter: split BPF out of core networking

  arch/Kconfig   |6 +-
  include/linux/filter.h |2 +
  kernel/Makefile|1 +
  kernel/bpf/Makefile|5 +
  kernel/bpf/core.c  | 1063 
  net/Kconfig|1 +
  net/core/filter.c  | 1023 +-
  7 files changed, 1079 insertions(+), 1022 deletions(-)
  create mode 100644 kernel/bpf/Makefile
  create mode 100644 kernel/bpf/core.c


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-02 Thread Daniel Borkmann

On 06/02/2014 09:01 AM, Alexei Starovoitov wrote:

This patch set splits BPF out of core networking into generic component

patch #1 splits filter.c into two logical pieces: generic BPF core and socket
filters. It only moves functions around. No real changes.

patch #2 adds hidden CONFIG_BPF that seccomp/tracing can select

The main value of the patch is not a NET separation, but rather logical boundary
between generic BPF core and socket filtering. All socket specific code stays in
net/core/filter.c and kernel/bpf/core.c is for generic BPF infrastructure (both
classic and internal).

Note that CONFIG_BPF_JIT is still under NET, so NET-less configs cannot use
BPF JITs yet. This can be cleaned up in the future. Also it seems to makes sense
to split up filter.h into generic and socket specific as well to cleanup the
boundary further.


Hm, I really don't like that 'ripping code and headers apart' and then we 
believe
it's a generic abstraction. So far seccomp-BPF could live with the current state
since it was introduced, the rest of users (vast majority) is in the networking
domain (and invoked through tcpdump et al) ...

There are still parts in seccomp that show some BPF weaknesses in terms of being
'generic', for example shown in seccomp, we need to go once again over the 
filter
instructions after doing the usual filter sanity checks, just to whitelist what
seccomp may do in BPF.

I have not yet thought about it deeply enough, but I think we should avoid
something similar in other non-networking areas but abstract that cleanly w/o
such hacks first, for example.


Tested with several NET and NET-less configs on arm and x86

V1-V2:
rebase on top of net-next
split filter.c into kernel/bpf/core.c instead of net/bpf/core.c

Alexei Starovoitov (2):
   net: filter: split filter.c into two files
   net: filter: split BPF out of core networking

  arch/Kconfig   |6 +-
  include/linux/filter.h |2 +
  kernel/Makefile|1 +
  kernel/bpf/Makefile|5 +
  kernel/bpf/core.c  | 1063 
  net/Kconfig|1 +
  net/core/filter.c  | 1023 +-
  7 files changed, 1079 insertions(+), 1022 deletions(-)
  create mode 100644 kernel/bpf/Makefile
  create mode 100644 kernel/bpf/core.c


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-02 Thread Jonathan Corbet
On Mon,  2 Jun 2014 00:01:44 -0700
Alexei Starovoitov a...@plumgrid.com wrote:

 This patch set splits BPF out of core networking into generic component

Quick, probably dumb question: if you're going to split it out, why not
split it out entirely, into kernel/ or (perhaps better) lib/?  The
whole point seems to be that BPF is outgrowing its networking home, so
it seems like it might be better to make it truly generic.

jon
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-02 Thread Steven Rostedt
On Mon, 2 Jun 2014 08:15:45 -0500
Jonathan Corbet cor...@lwn.net wrote:

 On Mon,  2 Jun 2014 00:01:44 -0700
 Alexei Starovoitov a...@plumgrid.com wrote:
 
  This patch set splits BPF out of core networking into generic component
 
 Quick, probably dumb question: if you're going to split it out, why not
 split it out entirely, into kernel/ or (perhaps better) lib/?  The
 whole point seems to be that BPF is outgrowing its networking home, so
 it seems like it might be better to make it truly generic.

I believe this is what Ingo suggested as well. If it is become generic,
it belongs in lib/

-- Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-02 Thread Arnaldo Carvalho de Melo
Em Mon, Jun 02, 2014 at 09:24:56AM -0400, Steven Rostedt escreveu:
 On Mon, 2 Jun 2014 08:15:45 -0500
 Jonathan Corbet cor...@lwn.net wrote:
 
  On Mon,  2 Jun 2014 00:01:44 -0700
  Alexei Starovoitov a...@plumgrid.com wrote:

   This patch set splits BPF out of core networking into generic component

  Quick, probably dumb question: if you're going to split it out, why not
  split it out entirely, into kernel/ or (perhaps better) lib/?  The
  whole point seems to be that BPF is outgrowing its networking home, so
  it seems like it might be better to make it truly generic.
 
 I believe this is what Ingo suggested as well. If it is become generic,
 it belongs in lib/

Yes, that was his suggestion, which I agree with, FWIW.

- Arnaldo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-02 Thread Alexei Starovoitov
On Mon, Jun 2, 2014 at 7:16 AM, Arnaldo Carvalho de Melo
a...@kernel.org wrote:
 Em Mon, Jun 02, 2014 at 09:24:56AM -0400, Steven Rostedt escreveu:
 On Mon, 2 Jun 2014 08:15:45 -0500
 Jonathan Corbet cor...@lwn.net wrote:

  On Mon,  2 Jun 2014 00:01:44 -0700
  Alexei Starovoitov a...@plumgrid.com wrote:

   This patch set splits BPF out of core networking into generic component

  Quick, probably dumb question: if you're going to split it out, why not
  split it out entirely, into kernel/ or (perhaps better) lib/?  The
  whole point seems to be that BPF is outgrowing its networking home, so
  it seems like it might be better to make it truly generic.

 I believe this is what Ingo suggested as well. If it is become generic,
 it belongs in lib/

 Yes, that was his suggestion, which I agree with, FWIW.

I guess I posted v2 too quickly :) v2 splits filter.c into kernel/bpf/.
I think it's a better location than lib/bpf, since lib feels too constrained
by definition of 'library'. bpf is more than a set of library calls.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-02 Thread Alexei Starovoitov
On Mon, Jun 2, 2014 at 1:57 AM, Daniel Borkmann dbork...@redhat.com wrote:
 On 06/02/2014 09:01 AM, Alexei Starovoitov wrote:

 This patch set splits BPF out of core networking into generic component

 patch #1 splits filter.c into two logical pieces: generic BPF core and
 socket
 filters. It only moves functions around. No real changes.

 patch #2 adds hidden CONFIG_BPF that seccomp/tracing can select

 The main value of the patch is not a NET separation, but rather logical
 boundary
 between generic BPF core and socket filtering. All socket specific code
 stays in
 net/core/filter.c and kernel/bpf/core.c is for generic BPF infrastructure
 (both
 classic and internal).

 Note that CONFIG_BPF_JIT is still under NET, so NET-less configs cannot
 use
 BPF JITs yet. This can be cleaned up in the future. Also it seems to makes
 sense
 to split up filter.h into generic and socket specific as well to cleanup
 the
 boundary further.


 Hm, I really don't like that 'ripping code and headers apart' and then we
 believe
 it's a generic abstraction. So far seccomp-BPF could live with the current
 state
 since it was introduced, the rest of users (vast majority) is in the
 networking
 domain (and invoked through tcpdump et al) ...

 There are still parts in seccomp that show some BPF weaknesses in terms of
 being
 'generic', for example shown in seccomp, we need to go once again over the
 filter
 instructions after doing the usual filter sanity checks, just to whitelist
 what
 seccomp may do in BPF.

 I have not yet thought about it deeply enough, but I think we should avoid
 something similar in other non-networking areas but abstract that cleanly
 w/o
 such hacks first, for example.

Glad you brought up this point :)
100% agree that current double verification done by seccomp is far from
being generic and quite hard to maintain, since any change done to
classic BPF verifier needs to be thought through from seccomp_check_filter()
perspective as well.
imo lack of generality in classic BPF is the main reason why we should stop
adding extensions to classic and switch to eBPF for any new features.
the eBPF verifier I posted now long ago is trying to be generic through
customization. The verifier core needs to stay independent of the use case.
BPF's input context, set of allowed calls need to be expressed in a generic way.
Obviously this split by itself won't make classic BPF all of a sudden generic.
It rather defines a boundary of eBPF core.
In eBPF only two instructions are not generic. It's LD_ABS/LD_IND
which are legacy instruction that we had to carry over from classic.
They require input context == sk_buff. That's why core.c had to
#include skbuff.h and do '__weak skb_copy_bits()'.
Alternative to that was to #ifdef these two instructions out of interpreter
and #ifndef NET #include skbuff.h and ld_abs helper functions in core.c
IMO that would have been ugly for code style, maintenance and testing,
but then core.c would have only one #include filter.h and we can
say: 'look eBPF core.c is really generic'

In the next set of patches I'll repost verifier and will explain how single
eBPF verifier core can be used for socket, seccomp, tracing and other
things. Note I'm not saying that we should use eBPF now everywhere.
Classic BPF has its niche and that niche we have to maintain forever.
So let's make sure that eBPF interpreter, its instruction set and its
verifier are staying generic.
This split is only first step in that direction that creates a file boundary
between eBPF core and sockets.


 Tested with several NET and NET-less configs on arm and x86

 V1-V2:
 rebase on top of net-next
 split filter.c into kernel/bpf/core.c instead of net/bpf/core.c

 Alexei Starovoitov (2):
net: filter: split filter.c into two files
net: filter: split BPF out of core networking

   arch/Kconfig   |6 +-
   include/linux/filter.h |2 +
   kernel/Makefile|1 +
   kernel/bpf/Makefile|5 +
   kernel/bpf/core.c  | 1063
 
   net/Kconfig|1 +
   net/core/filter.c  | 1023
 +-
   7 files changed, 1079 insertions(+), 1022 deletions(-)
   create mode 100644 kernel/bpf/Makefile
   create mode 100644 kernel/bpf/core.c


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-02 Thread Daniel Borkmann

On 06/02/2014 05:41 PM, Alexei Starovoitov wrote:
...

Glad you brought up this point :)
100% agree that current double verification done by seccomp is far from
being generic and quite hard to maintain, since any change done to
classic BPF verifier needs to be thought through from seccomp_check_filter()
perspective as well.


Glad we're on the same page.


BPF's input context, set of allowed calls need to be expressed in a generic way.
Obviously this split by itself won't make classic BPF all of a sudden generic.
It rather defines a boundary of eBPF core.


Note, I'm not at all against using it in tracing, I think it's probably
a good idea, but shouldn't we _first_ think about how to overcome such
deficits as above by improving upon its in-kernel API design, thus to
better prepare it to be generic? I feel this step is otherwise just
skipped and quickly 'hacked' around ... ;)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/2] split BPF out of core networking

2014-06-02 Thread Alexei Starovoitov
On Mon, Jun 2, 2014 at 10:04 AM, Daniel Borkmann dbork...@redhat.com wrote:
 On 06/02/2014 05:41 PM, Alexei Starovoitov wrote:
 ...

 Glad you brought up this point :)
 100% agree that current double verification done by seccomp is far from
 being generic and quite hard to maintain, since any change done to
 classic BPF verifier needs to be thought through from
 seccomp_check_filter()
 perspective as well.


 Glad we're on the same page.


 BPF's input context, set of allowed calls need to be expressed in a
 generic way.
 Obviously this split by itself won't make classic BPF all of a sudden
 generic.
 It rather defines a boundary of eBPF core.


 Note, I'm not at all against using it in tracing, I think it's probably
 a good idea, but shouldn't we _first_ think about how to overcome such
 deficits as above by improving upon its in-kernel API design, thus to
 better prepare it to be generic? I feel this step is otherwise just
 skipped and quickly 'hacked' around ... ;)

Are you talking about classic 'deficit' or eBPF 'deficit' ?
Classic has all sorts of hard coded assumptions. The whole
concept of 'load from magic constant' to mean different things
is flawed. We all got used to it and now think that it's normal
for ld_abs -4056 to mean a ^= x
This split is not trying to make classic easier to hack.
With eBPF underneath classic, it got a lot easier to add extensions
to classic, but we shouldn't be doing it.
Classic BPF is not generic and cannot become one. It's eBPF's job.

The split is mainly helping to clearly see the boundary of eBPF core
vs its socket use case. It doesn't change or add any API.
We need to carefully design eBPF APIs when we expose it
to user space. I have a proposal for that too, but that's separate
discussion.
In terms of in-kernel eBPF API there is nothing to be done.
eBPF program 'prog' is generated by whatever means and then:
  struct sk_filter *fp;

  fp = kzalloc(sk_filter_size(prog_len), GFP_KERNEL);
  memcpy(fp-insni, prog, prog_len * sizeof(fp-insni[0]));
  fp-len = prog_len;

  sk_filter_select_runtime(fp); // select interpreter or JIT
  SK_RUN_FILTER(fp, ctx);  // run the program
  sk_filter_free(fp); // free program

that's how sockets, testsuite, seccomp, tracing are doing it.
All have different ways of producing 'prog' and 'prog_len'.
This in-kernel API cleanup was done in commit 5fe821a9dee2
You even acked it back then :)

If you're referring to eBPF verifier in-kernel API then yeah, it's
missing, just like the whole eBPF verifier :)
Ideally any kernel component that generates eBPF on the fly
sends eBPF program to verifier first just to double check
that generated program is valid.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/