On 3/6/26 06:01, Jiayuan Chen wrote:
> 
> On 3/6/26 7:30 AM, Michal Luczaj wrote:
>> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
>> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
>> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
>> socket is properly set up, which would include having a defined peer. IOW,
>> there's a window when unix_stream_bpf_update_proto() can be called on
>> socket which still has unix_peer(sk) == NULL.
>>
>>            T0 bpf                            T1 connect
>>            ------                            ----------
>>
>>                                  WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
>> sock_map_sk_state_allowed(sk)
>> ...
>> sk_pair = unix_peer(sk)
>> sock_hold(sk_pair)
>>                                  sock_hold(newsk)
>>                                  smp_mb__after_atomic()
>>                                  unix_peer(sk) = newsk
>>
>> BUG: kernel NULL pointer dereference, address: 0000000000000080
>> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
>> Call Trace:
>>    sock_map_link+0x564/0x8b0
>>    sock_map_update_common+0x6e/0x340
>>    sock_map_update_elem_sys+0x17d/0x240
>>    __sys_bpf+0x26db/0x3250
>>    __x64_sys_bpf+0x21/0x30
>>    do_syscall_64+0x6b/0x3a0
>>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> Initial idea was to move peer assignment _before_ the sk_state update[1],
>> but that involved an additional memory barrier, and changing the hot path
>> was rejected. Then a check during proto update was considered[2], but a
>> follow-up discussion[3] concluded the root cause is sockmap taking a wrong
>> lock. Or, more specifically, an insufficient lock[4].
>>
>> Thus, teach sockmap about the af_unix-specific locking: af_unix protects
>> critical sections under unix_state_lock() operating on unix_sock::lock.
>>
>> [1]: 
>> https://lore.kernel.org/netdev/[email protected]/
>> [2]: https://lore.kernel.org/netdev/[email protected]/
>> [3]: 
>> https://lore.kernel.org/netdev/[email protected]/
>> [4]: 
>> https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+...@mail.gmail.com/
>>
>> Suggested-by: Kuniyuki Iwashima <[email protected]>
>> Suggested-by: Martin KaFai Lau <[email protected]>
>> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
>> Signed-off-by: Michal Luczaj <[email protected]>
>> ---
>>   net/core/sock_map.c | 35 +++++++++++++++++++++++++++++------
>>   1 file changed, 29 insertions(+), 6 deletions(-)
>>
>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>> index 7ba6a7f24ccd..6109fbe6f99c 100644
>> --- a/net/core/sock_map.c
>> +++ b/net/core/sock_map.c
>> @@ -12,6 +12,7 @@
>>   #include <linux/list.h>
>>   #include <linux/jhash.h>
>>   #include <linux/sock_diag.h>
>> +#include <net/af_unix.h>
>>   #include <net/udp.h>
>>   
>>   struct bpf_stab {
>> @@ -115,19 +116,43 @@ int sock_map_prog_detach(const union bpf_attr *attr, 
>> enum bpf_prog_type ptype)
>>   }
>>   
>>   static void sock_map_sk_acquire(struct sock *sk)
>> -    __acquires(&sk->sk_lock.slock)
>>   {
>>      lock_sock(sk);
>> +
>> +    if (sk_is_unix(sk))
>> +            unix_state_lock(sk);
>> +
>>      rcu_read_lock();
>>   }
>>   
> 
> This introduces a new ordering constraint: lock_sock() before
> unix_state_lock(). Kuniyuki flagged in the v2 review that taking
> lock_sock() inside unix_state_lock() in the future would create an
> ABBA deadlock, which is exactly why the order was settled this way. However,
> the thread did not reach a conclusion on whether that constraint should be
> documented in the code.
> 
> Since there is nothing enforcing this ordering mechanically, a brief comment
> at sock_map_sk_acquire() would help future readers avoid introducing the
> inverse ordering.

Sure, will do.

>>   static void sock_map_sk_release(struct sock *sk)
>> -    __releases(&sk->sk_lock.slock)
>>   {
>>      rcu_read_unlock();
>> +
>> +    if (sk_is_unix(sk))
>> +            unix_state_unlock(sk);
>> +
>>      release_sock(sk);
>>   }
>>   
>> +static inline void sock_map_sk_acquire_fast(struct sock *sk)
>> +{
>> +    local_bh_disable();
>> +    bh_lock_sock(sk);
>> +
>> +    if (sk_is_unix(sk))
>> +            unix_state_lock(sk);
>> +}
>> +
> 
> 
> v2 and v3 differ here in a way that deserves a closer look. In v2, AF_UNIX
> sockets took only unix_state_lock() in the fast path, skipping
> bh_lock_sock() entirely:
> 
> /* v2 */
> if (sk_is_unix(sk))
>    unix_state_lock(sk);
> else
>    bh_lock_sock(sk);
> 
> v3 takes both for AF_UNIX sockets.
> 
> bh_lock_sock() protects sock::sk_lock.slock, whereas the state that
> actually needs protection here — sk_state and unix_peer() — lives under
> unix_sock::lock. Since unix_state_lock() is already sufficient to close
> the race against unix_stream_connect(), is bh_lock_sock() still doing
> anything useful for AF_UNIX sockets in this path?

Yeah, good point. I think I was just trying to keep the _fast variant
aligned with the sleepy one. Which I agree might be unnecessary.

In a parallel thread I've asked Kuniyuki if it might be better to
completely drop patch 2/5, which would change how we interact with
sock_map_close(). Lets see how it goes.


Reply via email to