Alexei Starovoitov <a...@fb.com> writes:

> in cases where bpf programs are looking at sockets and packets
> that belong to different netns, it could be useful to get an id
> that uniquely identify a netns within the whole system.

It could be useful but there is no unique namespace id.

> Therefore introduce 'u64 bpf_sk_netns_id(sk);' helper. It returns
> unique value that identifies netns of given socket or dev_net(skb->dev)
> The upper 32-bits of the return value contain device id where namespace
> filesystem resides and lower 32-bits contain inode number within that 
> filesystem.
> It's the same as
>  struct stat st;
>  stat("/proc/pid/ns/net", &st);
>  return (st->st_dev << 32)  | st->st_ino;

The function is fundamentally buggy.  Inode numbers are 64bit and need
to be 64bit whenever we expose them to userspace.  Otherwise we are
painting ourselves into a corner with respect to future expansion.

> For example to disallow raw sockets in all non-init netns
> the bpf_type_cgroup_sock program can do:
> if (sk->type == SOCK_RAW && bpf_sk_netns_id(sk) != 0x3f0000075)
>   return 0;
> where 0x3f0000075 comes from combination of st_dev and st_ino
> of /proc/pid/ns/net

Which is generally a reasonable type of thing to do.

However if we make the logic look like:

if (sk->type == SOCK_RAW && bpf_sk_net(sk, 0x3f, 0x75))
        return 0;

With the comparison in the function call itself.  That will solve the
32 vs 64bit inode number issue as well putting the burden on matching
what userspace sees to what the kernel sees to the kernel.  Which
is much more future proof.

I suspect the bpf verifier can even be enhanced to check that the last
two arguments are constants.  Limiting the device number and inode
number to constants will make further optimizations/simplifcations
possible.   But that is just a nice to have.

But the key thing here is that if we pass the device number and the
inode to the kernel and ask it to compare, the kernel can lookup up the
namespace by device+inode and see if it matches what is on the socket
without any need for that to be a unique name of the network namespace
which is 1000 times more maintainable then returning a magic string.

Which means even if all we do in kernel churn is go back to the
implementation that existed a little while ago where the device number
depended upon which mount of proc you looked at, the bpf filters written
today can all be made to work with any challenge.

Does that make sense?

Eric

Reply via email to