Starting from Go 1.22.0, TCPConn implements the WriteTo interface [1],
which internally uses the splice(2) syscall to transfer data between
file descriptors [2].

However, for sockets with sockmap enabled, sk_prot is replaced with
tcp_bpf_prots which does not provide a splice_read callback. When data
is redirected to a socket's psock ingress queue via bpf_msg_redirect,
splice(2) cannot read from it because the splice path has no knowledge
of the psock queue. This causes TCPConn.WriteTo to return 0 bytes,
effectively breaking Go applications that rely on io.Copy between TCP
connections when sockmap/BPF is in use [3].

The simplest fix would be registering a splice callback that just calls
copy_splice_read(), but this results in redundant copies (socket -> kernel
buffer -> pipe -> destination), which defeats the purpose of splice.

Patch 1 adds splice_read to struct proto and sets it in TCP.
Patch 2 adds inet_splice_read and uses it in inet_stream_ops.
Patch 3 refactors tcp_bpf recvmsg with a read actor abstraction.
Patch 4 adds basic splice_read support for sockmap, but this still
involves 2 data copies.
Patch 5 optimizes the splice implementation by transferring page
ownership directly into the pipe, achieving true zero-copy. Benchmarks
show performance on par with the read(2) path.
Patch 6 adds splice selftests. Since splice can seamlessly replace read
operations, we redefine read to splice in the existing selftests so
that all existing test cases also cover the splice path.
Patch 7 adds splice to the sockmap benchmark, which also serves to
verify the effectiveness of our zero-copy implementation.

Benchmark results with rx-verdict-ingress mode (loopback, 8 CPUs):

  read(2):                  ~4292 MB/s
  splice(2) + zero-copy:    ~4270 MB/s
  splice(2) + always-copy:  ~2770 MB/s

Zero-copy splice achieves near-parity with read(2), while the
always-copy fallback is ~35% slower.

[1] https://github.com/golang/go/blob/master/src/net/tcpsock.go#L173
[2] https://github.com/golang/go/blob/fdf3bee/src/net/tcpsock_posix.go#L57
[3] https://github.com/jschwinger233/bpf_msg_redirect_bug_reproducer

Jiayuan Chen (7):
  net: add splice_read to struct proto and set it in tcp_prot/tcpv6_prot
  inet: add inet_splice_read() and use it in
    inet_stream_ops/inet6_stream_ops
  tcp_bpf: refactor recvmsg with read actor abstraction
  tcp_bpf: add splice_read support for sockmap
  tcp_bpf: optimize splice_read with zero-copy for non-slab pages
  selftests/bpf: add splice_read tests for sockmap
  selftests/bpf: add splice option to sockmap benchmark

 include/linux/skmsg.h                         |  12 +-
 include/net/inet_common.h                     |   3 +
 include/net/sock.h                            |   3 +
 net/core/skmsg.c                              |  34 ++-
 net/ipv4/af_inet.c                            |  15 +-
 net/ipv4/tcp_bpf.c                            | 227 +++++++++++++++---
 net/ipv4/tcp_ipv4.c                           |   1 +
 net/ipv6/af_inet6.c                           |   2 +-
 net/ipv6/tcp_ipv6.c                           |   1 +
 .../selftests/bpf/benchs/bench_sockmap.c      |  57 ++++-
 .../selftests/bpf/prog_tests/sockmap_basic.c  |  28 ++-
 .../bpf/prog_tests/sockmap_helpers.h          |  62 +++++
 .../selftests/bpf/prog_tests/sockmap_strp.c   |  28 ++-
 13 files changed, 421 insertions(+), 52 deletions(-)

-- 
2.43.0


Reply via email to