January 7, 2026 at 22:23, "Jakub Sitnicki" <[email protected]
mailto:[email protected]?to=%22Jakub%20Sitnicki%22%20%3Cjakub%40cloudflare.com%3E
> wrote:
>
> On Tue, Jan 06, 2026 at 01:14 PM +08, Jiayuan Chen wrote:
>
> >
> > A socket using sockmap has its own independent receive queue: ingress_msg.
> > This queue may contain data from its own protocol stack or from other
> > sockets.
> >
> > Therefore, for sockmap, relying solely on copied_seq and rcv_nxt to
> > calculate FIONREAD is not enough.
> >
> > This patch adds a new ingress_size field in the psock structure to record
> > the data length in ingress_msg. Additionally, we implement new ioctl
> > interfaces for TCP and UDP to intercept FIONREAD operations. While Unix
> > and VSOCK also support sockmap and have similar FIONREAD calculation
> > issues, fixing them would require more extensive changes
> > (please let me know if modifications are needed). I believe it's not
> > appropriate to include those changes under this fix patch.
> >
> Nit: These last two lines don't really belong in the commit message.
> Side notes for reviewers can be added after the "---" marker.
>
> >
> > Previous work by John Fastabend made some efforts towards FIONREAD support:
> > commit e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq")
> > Although the current patch is based on the previous work by John Fastabend,
> > it is acceptable for our Fixes tag to point to the same commit.
> >
> > FD1:read()
> > -- FD1->copied_seq++
> > | [read data]
> > |
> > [enqueue data] v
> > [sockmap] -> ingress to self -> ingress_msg queue
> > FD1 native stack ------> ^
> > -- FD1->rcv_nxt++ -> redirect to other | [enqueue data]
> > | |
> > | ingress to FD1
> > v ^
> > ... | [sockmap]
> > FD2 native stack
> >
> > Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> > Signed-off-by: Jiayuan Chen <[email protected]>
> > ---
> > include/linux/skmsg.h | 67 +++++++++++++++++++++++++++++++++++++++++--
> > net/core/skmsg.c | 3 ++
> > net/ipv4/tcp_bpf.c | 21 ++++++++++++++
> > net/ipv4/udp_bpf.c | 25 +++++++++++++---
> > 4 files changed, 110 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> > index 0323a2b6cf5e..1fa03953043f 100644
> > --- a/include/linux/skmsg.h
> > +++ b/include/linux/skmsg.h
> > @@ -97,6 +97,7 @@ struct sk_psock {
> > struct sk_buff_head ingress_skb;
> > struct list_head ingress_msg;
> > spinlock_t ingress_lock;
> > + ssize_t ingress_size;
> >
> The name is not great because we also already have `ingress_bytes`.
> I suggest to rename and add a doc string. Also we don't expect the count
> to ever be negative. Why ssize_t when we store all other byte counts
> there as u32?
>
> /** @msg_tot_len: Total bytes queued in ingress_msg list. */
> u32 msg_tot_len;
>
> >
> > unsigned long state;
> > struct list_head link;
> > spinlock_t link_lock;
> > @@ -321,6 +322,27 @@ static inline void sock_drop(struct sock *sk, struct
> > sk_buff *skb)
> > kfree_skb(skb);
> > }
> >
> > +static inline ssize_t sk_psock_get_msg_size_nolock(struct sk_psock *psock)
> > +{
> > + /* Used by ioctl to read ingress_size only; lock-free for performance */
> > + return READ_ONCE(psock->ingress_size);
> > +}
> > +
> > +static inline void sk_psock_inc_msg_size_locked(struct sk_psock *psock,
> > ssize_t diff)
> > +{
> > + /* Use WRITE_ONCE to ensure correct read in
> > sk_psock_get_msg_size_nolock().
> > + * ingress_lock should be held to prevent concurrent updates to
> > ingress_size
> > + */
> > + WRITE_ONCE(psock->ingress_size, psock->ingress_size + diff);
> > +}
> > +
> > +static inline void sk_psock_inc_msg_size(struct sk_psock *psock, ssize_t
> > diff)
> >
> Not sure about this function name. "inc" usually means increment by one.
> Was that modeled after some existing interface?
>
> If not, I'd switch rename to sk_psock_msg_len_add(..., int delta)
>
> Following the naming convention from sk_forward_alloc_add(),
> skb_frag_size_add(), skb_len_add(), etc.
>
> >
> > +{
> > + spin_lock_bh(&psock->ingress_lock);
> > + sk_psock_inc_msg_size_locked(psock, diff);
> > + spin_unlock_bh(&psock->ingress_lock);
> > +}
> > +
> > static inline bool sk_psock_queue_msg(struct sk_psock *psock,
> > struct sk_msg *msg)
> > {
> > @@ -329,6 +351,7 @@ static inline bool sk_psock_queue_msg(struct sk_psock
> > *psock,
> > spin_lock_bh(&psock->ingress_lock);
> > if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
> > list_add_tail(&msg->list, &psock->ingress_msg);
> > + sk_psock_inc_msg_size_locked(psock, msg->sg.size);
> > ret = true;
> > } else {
> > sk_msg_free(psock->sk, msg);
> > @@ -345,18 +368,25 @@ static inline struct sk_msg
> > *sk_psock_dequeue_msg(struct sk_psock *psock)
> >
> > spin_lock_bh(&psock->ingress_lock);
> > msg = list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list);
> > - if (msg)
> > + if (msg) {
> > list_del(&msg->list);
> > + sk_psock_inc_msg_size_locked(psock, -msg->sg.size);
> > + }
> > spin_unlock_bh(&psock->ingress_lock);
> > return msg;
> > }
> >
> > +static inline struct sk_msg *sk_psock_peek_msg_locked(struct sk_psock
> > *psock)
> > +{
> > + return list_first_entry_or_null(&psock->ingress_msg, struct sk_msg,
> > list);
> > +}
> > +
> > static inline struct sk_msg *sk_psock_peek_msg(struct sk_psock *psock)
> > {
> > struct sk_msg *msg;
> >
> > spin_lock_bh(&psock->ingress_lock);
> > - msg = list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list);
> > + msg = sk_psock_peek_msg_locked(psock);
> > spin_unlock_bh(&psock->ingress_lock);
> > return msg;
> > }
> > @@ -523,6 +553,39 @@ static inline bool sk_psock_strp_enabled(struct
> > sk_psock *psock)
> > return !!psock->saved_data_ready;
> > }
> >
> > +/* for tcp only, sk is locked */
> > +static inline ssize_t sk_psock_msg_inq(struct sock *sk)
> > +{
> > + struct sk_psock *psock;
> > + ssize_t inq = 0;
> > +
> > + psock = sk_psock_get(sk);
> > + if (likely(psock)) {
> > + inq = sk_psock_get_msg_size_nolock(psock);
> > + sk_psock_put(sk, psock);
> > + }
> > + return inq;
> > +}
> > +
> > +/* for udp only, sk is not locked */
> > +static inline ssize_t sk_msg_first_length(struct sock *sk)
> >
> s/_length/_len/
>
> >
> > +{
> > + struct sk_psock *psock;
> > + struct sk_msg *msg;
> > + ssize_t inq = 0;
> > +
> > + psock = sk_psock_get(sk);
> > + if (likely(psock)) {
> > + spin_lock_bh(&psock->ingress_lock);
> > + msg = sk_psock_peek_msg_locked(psock);
> > + if (msg)
> > + inq = msg->sg.size;
> > + spin_unlock_bh(&psock->ingress_lock);
> > + sk_psock_put(sk, psock);
> > + }
> > + return inq;
> > +}
> > +
> > #if IS_ENABLED(CONFIG_NET_SOCK_MSG)
> >
> > #define BPF_F_STRPARSER (1UL << 1)
> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> > index d73e03f7713a..c959d52a62b2 100644
> > --- a/net/core/skmsg.c
> > +++ b/net/core/skmsg.c
> > @@ -455,6 +455,7 @@ int __sk_msg_recvmsg(struct sock *sk, struct sk_psock
> > *psock, struct msghdr *msg
> > atomic_sub(copy, &sk->sk_rmem_alloc);
> > }
> > msg_rx->sg.size -= copy;
> > + sk_psock_inc_msg_size(psock, -copy);
> >
> > if (!sge->length) {
> > sk_msg_iter_var_next(i);
> > @@ -819,9 +820,11 @@ static void __sk_psock_purge_ingress_msg(struct
> > sk_psock *psock)
> > list_del(&msg->list);
> > if (!msg->skb)
> > atomic_sub(msg->sg.size, &psock->sk->sk_rmem_alloc);
> > + sk_psock_inc_msg_size(psock, -((ssize_t)msg->sg.size));
> >
> Cast won't be needed after you switch param type to `int`.
>
> >
> > sk_msg_free(psock->sk, msg);
> > kfree(msg);
> > }
> > + WARN_ON_ONCE(psock->ingress_size);
> > }
> >
> > static void __sk_psock_zap_ingress(struct sk_psock *psock)
> > diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> > index 6332fc36ffe6..a9c758868f13 100644
> > --- a/net/ipv4/tcp_bpf.c
> > +++ b/net/ipv4/tcp_bpf.c
> > @@ -10,6 +10,7 @@
> >
> > #include <net/inet_common.h>
> > #include <net/tls.h>
> > +#include <asm/ioctls.h>
> >
> > void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
> > {
> > @@ -332,6 +333,25 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
> > return copied;
> > }
> >
> > +static int tcp_bpf_ioctl(struct sock *sk, int cmd, int *karg)
> > +{
> > + bool slow;
> > +
> > + /* we only care about FIONREAD */
> >
> Nit: This comment seems redundant. The expression is obvious.
>
> >
> > + if (cmd != SIOCINQ)
> > + return tcp_ioctl(sk, cmd, karg);
> > +
> > + /* works similar as tcp_ioctl */
> > + if (sk->sk_state == TCP_LISTEN)
> > + return -EINVAL;
> > +
> > + slow = lock_sock_fast(sk);
> > + *karg = sk_psock_msg_inq(sk);
> > + unlock_sock_fast(sk, slow);
> > +
> > + return 0;
> > +}
> > +
> > static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
> > int flags, int *addr_len)
> > {
> > @@ -610,6 +630,7 @@ static void tcp_bpf_rebuild_protos(struct proto
> > prot[TCP_BPF_NUM_CFGS],
> > prot[TCP_BPF_BASE].close = sock_map_close;
> > prot[TCP_BPF_BASE].recvmsg = tcp_bpf_recvmsg;
> > prot[TCP_BPF_BASE].sock_is_readable = sk_msg_is_readable;
> > + prot[TCP_BPF_BASE].ioctl = tcp_bpf_ioctl;
> >
> > prot[TCP_BPF_TX] = prot[TCP_BPF_BASE];
> > prot[TCP_BPF_TX].sendmsg = tcp_bpf_sendmsg;
> > diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
> > index 0735d820e413..cc1156aef14d 100644
> > --- a/net/ipv4/udp_bpf.c
> > +++ b/net/ipv4/udp_bpf.c
> > @@ -5,6 +5,7 @@
> > #include <net/sock.h>
> > #include <net/udp.h>
> > #include <net/inet_common.h>
> > +#include <asm/ioctls.h>
> >
> > #include "udp_impl.h"
> >
> > @@ -111,12 +112,28 @@ enum {
> > static DEFINE_SPINLOCK(udpv6_prot_lock);
> > static struct proto udp_bpf_prots[UDP_BPF_NUM_PROTS];
> >
> > +static int udp_bpf_ioctl(struct sock *sk, int cmd, int *karg)
> > +{
> > + /* we only care about FIONREAD */
> > + if (cmd != SIOCINQ)
> > + return udp_ioctl(sk, cmd, karg);
> > +
> > + /* works similar as udp_ioctl.
> > + * man udp(7): "FIONREAD (SIOCINQ): Returns the size of the next
> > + * pending datagram in the integer in bytes, or 0 when no datagram
> > + * is pending."
> > + */
> >
> Not sure we need to quote man pages here.
>
> >
> > + *karg = sk_msg_first_length(sk);
> > + return 0;
> > +}
> > +
> > static void udp_bpf_rebuild_protos(struct proto *prot, const struct proto
> > *base)
> > {
> > - *prot = *base;
> > - prot->close = sock_map_close;
> > - prot->recvmsg = udp_bpf_recvmsg;
> > - prot->sock_is_readable = sk_msg_is_readable;
> > + *prot = *base;
> > + prot->close = sock_map_close;
> > + prot->recvmsg = udp_bpf_recvmsg;
> > + prot->sock_is_readable = sk_msg_is_readable;
> > + prot->ioctl = udp_bpf_ioctl;
> > }
> >
> > static void udp_bpf_check_v6_needs_rebuild(struct proto *ops)
> >
>
Thanks Jakub. All good suggestions. I've applied them in v6.