On 03/13/2018 08:39 PM, Alexei Starovoitov wrote:
From: Andrey Ignatov <r...@fb.com>
== The problem ==
There is a use-case when all processes inside a cgroup should use one
single IP address on a host that has multiple IP configured. Those
processes should use the IP for both ingress and egress, for TCP and UDP
traffic. So TCP/UDP servers should be bound to that IP to accept
incoming connections on it, and TCP/UDP clients should make outgoing
connections from that IP. It should not require changing application
code since it's often not possible.
Currently it's solved by intercepting glibc wrappers around syscalls
such as `bind(2)` and `connect(2)`. It's done by a shared library that
is preloaded for every process in a cgroup so that whenever TCP/UDP
server calls `bind(2)`, the library replaces IP in sockaddr before
passing arguments to syscall. When application calls `connect(2)` the
library transparently binds the local end of connection to that IP
(`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).
Shared library approach is fragile though, e.g.:
* some applications clear env vars (incl. `LD_PRELOAD`);
* `/etc/ld.so.preload` doesn't help since some applications are linked
with option `-z nodefaultlib`;
* other applications don't use glibc and there is nothing to intercept.
== The solution ==
The patch provides much more reliable in-kernel solution for the 1st
part of the problem: binding TCP/UDP servers on desired IP. It does not
depend on application environment and implementation details (whether
glibc is used or not).
If I understand well, strace(1) will not show the real (after
modification by eBPF) IP/port ?
What about selinux and other LSM ?
We have now network namespaces for full isolation. Soon ILA will come.
The argument that it is not convenient (or even possible) to change the
application or using modern isolation is quite strange, considering the
added burden/complexity/bloat to the kernel.
The post hook for sys_bind is clearly a failure of the model, since
releasing the port might already be too late, another thread might fail
to get it during a non zero time window.
It seems this is exactly the case where a netns would be the correct answer.
If you want to provide an alternate port allocation strategy, better
provide a correct eBPF hook.