On Mon Jun 22, 2026 at 12:16 AM PDT, Gyutae Bae wrote:
> From: Gyutae Bae <[email protected]>
>
> This series adds an atomic compare-and-delete primitive to BPF hash
> maps, motivated by a TOCTOU race in Cilium's conntrack GC [1]: the
> batched GC snapshots CT entries, decides which expired, then deletes
> them by key in a later syscall; between snapshot and delete the
> datapath can refresh the same entry, so a live entry is deleted. A
> userspace re-check before delete can't close it (lookup and delete are
> separate, individually bucket-locked calls).
>
> BPF_F_COMPARE lets userspace delete a key only if a chosen value region
> is unchanged, with the compare and the delete done atomically under the
> hash bucket lock:
>
>     attr.flags |= BPF_F_COMPARE;
>     attr.compare = <expected>;
>     attr.compare_offset = <off>;
>     attr.compare_size = <len>;
>
> mismatch -> -EBUSY, absent -> -ENOENT, unsupported map -> -EOPNOTSUPP.
> The compare* fields without the flag are rejected (-EINVAL) so a dropped
> flag can't silently become an unconditional delete; maps whose value
> carries BTF-managed fields (spin_lock/timer/kptr/...) are rejected
> (-EOPNOTSUPP) since those bytes are sanitised on lookup.
>
> Atomicity boundary (please scrutinise): the compare is atomic vs every
> bucket-lock holder, but NOT vs a BPF program writing the value in place
> via the pointer from bpf_map_lookup_elem() (no bucket lock). It
> collapses the race window from the whole GC batch to one bucket-locked
> critical section; full closure wants the compared region treated as a
> synchronization variable (e.g. a monotonic revision). The selftest
> models this.
>
> Scope of this RFC: per-element compare-and-delete on BPF_MAP_TYPE_HASH
> only. Deferred (will follow once the approach is agreed): batch delete +
> its attr fields, a libbpf wrapper, LRU-hash and other map types, a
> compare-and-swap *update*.
>
> Open questions:
>   - flag name: BPF_F_COMPARE vs something else?
>   - mismatch errno: -EBUSY vs -EAGAIN?
>   - new ->map_delete_elem_cmp() op vs extending ->map_delete_elem?

Sorry, this is no go.
There is bpf_spin_lock that use can use to synchronize access
between bpf progs and user space.
lookup_and_delete with BPF_F_LOCK uses the same lock.
Or add another syscall program that is triggered from user space
that operates on the same map.
Or convert everything to arena and use whatever algorithm you prefer.


Reply via email to