On Wed, Mar 14, 2018 at 03:37:01PM +0100, Daniel Borkmann wrote:
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -133,6 +133,8 @@ enum bpf_prog_type {
> >     BPF_PROG_TYPE_SOCK_OPS,
> >     BPF_PROG_TYPE_SK_SKB,
> >     BPF_PROG_TYPE_CGROUP_DEVICE,
> > +   BPF_PROG_TYPE_CGROUP_INET4_BIND,
> > +   BPF_PROG_TYPE_CGROUP_INET6_BIND,
> 
> Could those all be merged into BPF_PROG_TYPE_SOCK_OPS? I'm slowly getting
> confused by the many sock_*/sk_* prog types we have. The attach hook could
> still be something like BPF_CGROUP_BIND/BPF_CGROUP_CONNECT. Potentially
> storing some prog-type specific void *private_data in prog's aux during
> verification could be a way (similarly as you mention) which can later be
> retrieved at attach time to reject with -ENOTSUPP or such.

that's exacly what I mentioned in the cover letter,
but we need to extend attach cmd with verifier-like log_buf+log_size.
since simple enotsupp will be impossible to debug.
That's the main question of the RFC.

> >  };
> >  
> >  enum bpf_attach_type {
> > @@ -143,6 +145,8 @@ enum bpf_attach_type {
> >     BPF_SK_SKB_STREAM_PARSER,
> >     BPF_SK_SKB_STREAM_VERDICT,
> >     BPF_CGROUP_DEVICE,
> > +   BPF_CGROUP_INET4_BIND,
> > +   BPF_CGROUP_INET6_BIND,
> 
> Binding to v4 mapped v6 address would work as well, right? Can't this be
> squashed into one attach type as mentioned?

explained the reasons for this in the cover letter and proposed extension
to attach cmd.

> > +struct bpf_sock_addr {
> > +   __u32 user_family;      /* Allows 4-byte read, but no write. */
> > +   __u32 user_ip4;         /* Allows 1,2,4-byte read and 4-byte write.
> > +                            * Stored in network byte order.
> > +                            */
> > +   __u32 user_ip6[4];      /* Allows 1,2,4-byte read an 4-byte write.
> > +                            * Stored in network byte order.
> > +                            */
> > +   __u32 user_port;        /* Allows 4-byte read and write.
> > +                            * Stored in network byte order
> > +                            */
> > +   __u32 family;           /* Allows 4-byte read, but no write */
> > +   __u32 type;             /* Allows 4-byte read, but no write */
> > +   __u32 protocol;         /* Allows 4-byte read, but no write */
> 
> I recall bind to prefix came up from time to time in BPF context in the sense
> to let the app itself be more flexible to bind to BPF prog. Do you see also 
> app
> to be able to add a BPF prog into the array itself?

I'm not following. In this case the container management framework
will attach bpf progs to cgroups and apps inside the cgroups will
get their bind/connects overwritten when necessary.

> > +int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
> > +                                 struct sockaddr *uaddr,
> > +                                 enum bpf_attach_type type)
> > +{
> > +   struct bpf_sock_addr_kern ctx = {
> > +           .sk = sk,
> > +           .uaddr = uaddr,
> > +   };
> > +   struct cgroup *cgrp;
> > +   int ret;
> > +
> > +   /* Check socket family since not all sockets represent network
> > +    * endpoint (e.g. AF_UNIX).
> > +    */
> > +   if (sk->sk_family != AF_INET && sk->sk_family != AF_INET6)
> > +           return 0;
> > +
> > +   cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > +   ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx, BPF_PROG_RUN);
> > +
> > +   return ret == 1 ? 0 : -EPERM;
> 
> Semantics may be a little bit strange, though this would perhaps be at the 
> risk
> of the orchestrator(s) (?). Basically when you run through the prog array, 
> then
> the last attached program in that array has the final /real/ say to which 
> address
> to bind/connect to; all the others decisions can freely be overridden, so this
> is dependent on the order the BPF progs how they were attached. I think we 
> don't
> have this case for context in multi-prog tracing, cgroup/inet (only filtering)
> and cgroup/dev. Although in cgroup/sock_ops for the tcp/BPF hooks progs can 
> already
> override the sock_ops.reply (and sk_txhash which may be less relevant) field 
> from
> the ctx, so whatever one prog is assumed to reply back to the caller, another 
> one
> could override it. 

correct. tcp-bpf is in the same boat. When progs override the decision the last
prog in the prog_run_array is effective. Remember that
 * The programs of sub-cgroup are executed first, then programs of
 * this cgroup and then programs of parent cgroup.
so outer cgroup controlled by container management is running last.
If it would want to let children do nested overwrittes it could look at the same
sockaddr memory region and will see what children's prog or children's tasks
did with sockaddr and make approriate decision.

> Wouldn't it make more sense to just have a single prog instead
> to avoid this override/ordering issue?

I don't think there is any ordering issue, but yes, if parent is paranoid
it can install no-override program on the cgroup. Which is default anyway.

Reply via email to