Hello oss-security, I am disclosing a Linux kernel vulnerability in the TLS ULP subsystem.
Affected component: Linux kernel TLS ULP File: net/tls/tls_main.c Function: tls_sk_proto_close() Vulnerability type: Use-after-free / race condition Summary: There is a race between close() and setsockopt(SOL_TLS, TLS_TX) in the Linux kernel TLS ULP subsystem. Under certain interleavings, one thread can close a TLS socket while another thread is still operating on TLS-related socket state through setsockopt(). This can lead to a use-after-free in the TLS socket teardown path. Impact: A local unprivileged user may be able to trigger kernel heap memory corruption. Based on my analysis, this may potentially be exploitable for local privilege escalation, although I do not have a confirmed full privilege escalation exploit. Affected versions: The affected version range is still being confirmed. Kernel configuration: The issue affects systems with CONFIG_TLS enabled. On many distributions this is built as a module. Status: This issue was reported to linux-distros on 2026-05-16. I incorrectly contacted linux-distros before first getting a fix accepted by the Linux kernel maintainers. The latest proposed public disclosure date was 2026-05-30, and this oss-security posting is being made late. As of this posting, I do not have an accepted upstream fix commit to cite. I am available to work with the kernel TLS/networking maintainers to validate the issue and test a fix. Reproducer: I have a reproducer for the race. I am not including it in this initial public posting to avoid unnecessarily increasing harm before a fix is available, but I can share it with kernel maintainers on request. CVE: No CVE has been assigned as far as I know. I understand that the Linux kernel CNA generally assigns CVEs after a fix commit is available. AI disclosure: AI assistance was used during analysis and report preparation. Specifically, OpenAI Codex was used to help inspect the relevant code path, reason about the race condition, and draft portions of the vulnerability report. I reviewed and take responsibility for the report contents. References: Linux kernel security bug process: https://docs.kernel.org/process/security-bugs.html Related but different public KTLS issue: https://www.openwall.com/lists/oss-security/2026/05/07/1 Timeline: 2026-05-16: Reported to linux-distros 2026-05-30: Latest agreed public disclosure date 2026-06-02: Public disclosure to oss-security Regards, Oleg Sevostyanov
# Use-After-Free via TOCTOU Race in net/tls: tls_sk_proto_close() reads tx_conf without lock_sock **Reporter:** Oleg Sevostyanov <[email protected]> **Date:** 2026-05-16 **Kernel version:** 7.1-rc3 (confirmed; likely present since ~v4.13 when TLS ULP was introduced) **Subsystem:** net/tls **Files:** - `net/tls/tls_main.c` — vulnerable read at line 372 - `net/tls/tls_sw.c` — UAF sites in tx_work_handler (line 2637), tls_encrypt_done (line 467) **CWE:** CWE-416 (Use After Free), CWE-362 (Race Condition) **Severity:** High — local privilege escalation; no privileges required --- ## Summary `tls_sk_proto_close()` in `net/tls/tls_main.c` reads the field `ctx->tx_conf` at line 372 **without holding `lock_sock`**. A concurrent `setsockopt(SOL_TLS, TLS_TX, ...)` call writes `ctx->tx_conf = TLS_SW` **inside** `lock_sock`. When the race is won by `setsockopt`, the close path: 1. **Skips** `tls_sw_cancel_work_tx()` (which would set `BIT_TX_CLOSING` and call `disable_delayed_work_sync`) because it saw `TLS_BASE` at line 372. 2. **Calls** `tls_sw_free_ctx_tx()` → `kfree(tls_sw_context_tx)` at line 390 because it sees `TLS_SW` on the second (now correctly-locked) read. 3. A delayed workqueue item (`tx_work_handler`, scheduled 1 jiffy earlier by `tls_encrypt_done` or `tls_sw_write_space`) fires after the `kfree`, producing a **use-after-free** on the freed `tls_sw_context_tx` object. No special privileges are required — any unprivileged user with a TCP socket can trigger the race. --- ## Affected Kernel Versions The unlocked read of `tx_conf` before `lock_sock` in `tls_sk_proto_close` has been present since the TLS ULP was introduced (~v4.13). All kernels with `CONFIG_TLS=y` in the v4.13–v7.1 range are likely affected, subject to confirmation against each stable branch. Earliest introducing commit (approximate): ``` e8f69799810c ("net/tls: Add generic NIC offload infrastructure", 2018-07-13) ``` or the commit that split `tls_sk_proto_close` into its current form. --- ## Exact Vulnerable Code ### net/tls/tls_main.c — unlocked read + free ```c /* Line 365–399 (Linux 7.1-rc3) */ static void tls_sk_proto_close(struct sock *sk, long timeout) { struct inet_connection_sock *icsk = inet_csk(sk); struct tls_context *ctx = tls_get_ctx(sk); long timeo = sock_sndtimeo(sk, 0); bool free_ctx; if (ctx->tx_conf == TLS_SW) /* ← L372: READ WITHOUT lock_sock BUG */ tls_sw_cancel_work_tx(ctx); /* ← L373: SKIPPED when race wins */ lock_sock(sk); /* ← L375: lock acquired too late */ free_ctx = ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW; if (ctx->tx_conf != TLS_BASE || ctx->rx_conf != TLS_BASE) tls_sk_proto_cleanup(sk, ctx, timeo); write_lock_bh(&sk->sk_callback_lock); if (free_ctx) rcu_assign_pointer(icsk->icsk_ulp_data, NULL); WRITE_ONCE(sk->sk_prot, ctx->sk_proto); if (sk->sk_write_space == tls_write_space) sk->sk_write_space = ctx->sk_write_space; write_unlock_bh(&sk->sk_callback_lock); release_sock(sk); if (ctx->tx_conf == TLS_SW) /* ← L389: second read (stale, race won) */ tls_sw_free_ctx_tx(ctx); /* ← L390: kfree(tls_sw_context_tx) FREE */ ... } ``` ### net/tls/tls_main.c — setsockopt sets tx_conf under lock ```c /* Line 757–758 — inside do_tls_setsockopt_conf(), which holds lock_sock */ if (tx) ctx->tx_conf = conf; /* ← sets TLS_SW under lock_sock */ ``` ### net/tls/tls_sw.c — cancel_work_tx: what is skipped ```c /* Line 2539–2546 */ void tls_sw_cancel_work_tx(struct tls_context *tls_ctx) { struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); set_bit(BIT_TX_CLOSING, &ctx->tx_bitmask); /* prevent new work */ set_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask); disable_delayed_work_sync(&ctx->tx_work.work); /* wait for in-flight */ } ``` Without this call, `BIT_TX_CLOSING` is never set → `tx_work_handler` does not return early at line 2650 and proceeds to access freed memory. ### net/tls/tls_sw.c — delayed work scheduler (1 jiffy after crypto callback) ```c /* Line 515–517 — tls_encrypt_done() */ if (!test_and_set_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask)) schedule_delayed_work(&ctx->tx_work.work, 1); /* 1 jiffy delay */ /* Line 521–522 — tls_encrypt_done() */ if (atomic_dec_and_test(&ctx->encrypt_pending)) complete(&ctx->async_wait.completion); /* wakes tls_encrypt_async_wait */ ``` `tls_encrypt_async_wait` returns first (completion fires before the 1-jiffy delay), so `tls_sw_free_ctx_tx` at L390 can race with the pending delayed work. ### net/tls/tls_sw.c — UAF sites in tx_work_handler ```c /* Line 2637–2668 */ static void tx_work_handler(struct work_struct *work) { struct delayed_work *delayed_work = to_delayed_work(work); struct tx_work *tx_work = container_of(delayed_work, struct tx_work, work); struct sock *sk = tx_work->sk; struct tls_context *tls_ctx = tls_get_ctx(sk); struct tls_sw_context_tx *ctx; if (unlikely(!tls_ctx)) return; ctx = tls_sw_ctx_tx(tls_ctx); /* freed pointer */ if (test_bit(BIT_TX_CLOSING, &ctx->tx_bitmask)) /* UAF READ L2650 */ return; if (!test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask)) /* UAF READ/WRITE */ return; if (mutex_trylock(&tls_ctx->tx_lock)) { lock_sock(sk); tls_tx_records(sk, -1); /* UAF — tx_list */ release_sock(sk); mutex_unlock(&tls_ctx->tx_lock); } else if (!test_and_set_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask)) { /* UAF WRITE */ schedule_delayed_work(&ctx->tx_work.work, /* func ptr on freed*/ msecs_to_jiffies(10)); } } ``` ### net/tls/tls_sw.c — UAF sites in tls_encrypt_done ```c /* Line 467–522 */ static void tls_encrypt_done(void *data, int err) { ... ctx = tls_sw_ctx_tx(tls_ctx); /* freed pointer */ ... ctx->async_wait.err = err; /* UAF WRITE L497 */ ... first_rec = list_first_entry(&ctx->tx_list, /* UAF READ L511 */ struct tls_rec, list); if (!test_and_set_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask)) /* UAF READ/WRITE L515 */ schedule_delayed_work(&ctx->tx_work.work, 1); if (atomic_dec_and_test(&ctx->encrypt_pending)) /* UAF READ/WRITE L521 */ complete(&ctx->async_wait.completion); /* UAF WRITE */ } ``` --- ## Race Condition Timeline ``` Thread A — close(fd) Thread B — setsockopt(fd, SOL_TLS, TLS_TX) ══════════════════════════════════════════════════════════════════════════════════ setsockopt(fd, SOL_TLS, TLS_TX, &info) do_tls_setsockopt_conf() lock_sock(sk) tls_set_sw_offload(sk, tx=1) kzalloc_obj(*sw_ctx_tx) → alloc INIT_DELAYED_WORK(&sw_ctx_tx->tx_work) crypto_aead_encrypt() → -EINPROGRESS atomic_inc(&ctx->encrypt_pending) ctx->tx_conf = TLS_SW ← L758 release_sock(sk) close(fd) tls_sk_proto_close(sk) READ ctx->tx_conf → TLS_BASE ← race window: setsockopt set it after this read! (tls_sw_cancel_work_tx NOT called — BIT_TX_CLOSING never set) [async encrypt callback fires] tls_encrypt_done(): schedule_delayed_work(..., 1) ← +1 jiffy complete(&ctx->async_wait.completion) lock_sock(sk) ← L375 tls_sk_proto_cleanup(sk): tls_sw_release_resources_tx(): tls_encrypt_async_wait() ← returns (completion already fired) crypto_free_aead(ctx->aead_send) release_sock(sk) READ ctx->tx_conf → TLS_SW ← L389: now sees TLS_SW tls_sw_free_ctx_tx(): kfree(ctx) ← tls_sw_context_tx FREED ──────┐ [1 jiffy later — workqueue] │ tx_work_handler(): │ ctx = tls_sw_ctx_tx(tls_ctx) ←──┘ FREED test_bit(BIT_TX_CLOSING, ...) ← UAF READ tls_tx_records(sk, -1) ← UAF ``` --- ## Freed Object ```c /* include/net/tls.h */ struct tls_sw_context_tx { struct crypto_aead *aead_send; /* offset 0x00 */ struct crypto_wait async_wait; /* offset 0x08 */ struct tx_work tx_work; /* offset 0x28 — contains delayed_work */ struct tls_rec *open_rec; /* offset 0x50 */ struct list_head tx_list; /* offset 0x58 */ atomic_t encrypt_pending; /* offset 0x68 */ u8 async_capable:1; unsigned long tx_bitmask; /* BIT_TX_SCHEDULED, BIT_TX_CLOSING */ }; /* allocated via kzalloc_obj(*sw_ctx_tx) → kmalloc-256 slab */ ``` `tx_work.work` (a `struct delayed_work`) is at a fixed offset within the freed chunk. Its embedded `work_struct.func` is the function pointer called by the workqueue. --- ## Privilege Requirements | Requirement | Value | |---|---| | Root / CAP_NET_ADMIN | Not required | | CAP_NET_RAW | Not required | | Network namespace | Default (init_net) | | Minimum privilege | Unprivileged user with TCP socket access | | Kernel config | CONFIG_TLS=y (default on most distros) | | Async crypto | Required for the 1-jiffy UAF window; synchronous crypto still triggers the state inconsistency | --- ## Exploitation Scenarios ### Scenario 1 — Crash / DoS (reliability: high) Even without a controlled allocation, `tx_work_handler` traversing the freed `ctx->tx_list` will likely corrupt memory and trigger a kernel BUG/oops within seconds of the race firing. ### Scenario 2 — Information Leak / KASLR Defeat 1. Win the race → `tls_sw_context_tx` (kmalloc-256) is freed. 2. Spray `kmalloc-256` objects from user space before the 1-jiffy deadline: - `msg_msg` bodies (via `msgsnd()`) - `pipe_buffer` structures - `sk_buff` headers 3. `tls_encrypt_done()` fires and reads from the reclaimed chunk: - `list_first_entry(&ctx->tx_list, ...)` → follows attacker-controlled pointer - Returned pointer is dereferenced as a `tls_rec *` 4. Any kernel pointer stored by the spray object in that slot leaks to attacker via timing or error paths → KASLR broken. ### Scenario 3 — Arbitrary Write `complete(&ctx->async_wait.completion)` calls `wake_up_process()` on `x->wait.task_list.next`. If the freed chunk is reclaimed with a controlled `swait_queue_head`, `wake_up_process()` writes to an attacker-controlled `task_struct` pointer. ### Scenario 4 — Local Privilege Escalation (LPE) — Full Root 1. KASLR defeated (Scenario 2 first). 2. Spray the freed 256-byte slot so that `ctx->tx_work.work.func` (at a known offset within the freed chunk) contains the address of a kernel ROP gadget or directly `commit_creds(prepare_kernel_cred(0))`. 3. When `schedule_delayed_work(&ctx->tx_work.work, 10ms)` is called by `tx_work_handler` on the reclaimed chunk, the workqueue executes the attacker's function in softirq/kernel context. 4. Overwrite `current->cred` → uid=gid=0 → root shell. --- ## Reproducer ### Build ```bash gcc -O2 -lpthread -o poc-tls-uaf-race poc-tls-uaf-race.c ``` ### Run ```bash sudo modprobe tls # ensure TLS ULP module is loaded ./poc-tls-uaf-race # run race loop sudo dmesg | grep -A 40 "BUG: KASAN: use-after-free" ``` ### Expected KASAN output (CONFIG_KASAN=y kernel) ``` ================================================================== BUG: KASAN: use-after-free in tx_work_handler+0x.../net/tls/tls_sw.c:2649 Read of size 8 at addr ffff... by task kworker/... CPU: 1 PID: ... Comm: kworker/... Call Trace: tx_work_handler process_one_work worker_thread kthread ret_from_fork ... Freed by task ...: tls_sw_free_ctx_tx tls_sk_proto_close inet_release sock_close ================================================================== ``` ### Race conditions to verify without KASAN Use `ftrace` to log `tx_conf` values at close entry and compare: ```bash echo 'p:probe_close tls_sk_proto_close ctx->tx_conf=%cx' > \ /sys/kernel/debug/tracing/kprobe_events echo 1 > /sys/kernel/debug/tracing/events/kprobes/probe_close/enable ./poc-tls-uaf-race grep "tx_conf=0" /sys/kernel/debug/tracing/trace # 0=TLS_BASE — race hit ``` A `tx_conf=0` at `tls_sk_proto_close` entry while `tx_conf` later becomes 1 (TLS_SW) before `kfree` confirms the race window. --- ## Proposed Fix Move the `tx_conf` check and `tls_sw_cancel_work_tx()` call to **after** `lock_sock()` so that the read is protected by the same lock that `setsockopt` uses when writing `tx_conf`: ```diff --- a/net/tls/tls_main.c +++ b/net/tls/tls_main.c @@ -365,10 +365,10 @@ static void tls_sk_proto_close(struct sock *sk, long timeout) long timeo = sock_sndtimeo(sk, 0); bool free_ctx; - if (ctx->tx_conf == TLS_SW) - tls_sw_cancel_work_tx(ctx); - lock_sock(sk); + /* tx_conf must be read under lock_sock to avoid TOCTOU with setsockopt */ + if (ctx->tx_conf == TLS_SW) + tls_sw_cancel_work_tx(ctx); + free_ctx = ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW; ``` This one-block move ensures that `tls_sw_cancel_work_tx()` is always called before any cleanup when `tx_conf` is `TLS_SW`, regardless of concurrent `setsockopt`. --- ## References - Subsystem maintainers: Jakub Kicinski <[email protected]>, John Fastabend <[email protected]> - Related prior work: CVE-2023-0461 (different TLS UAF — listening socket context) - Slab cache: `kmalloc-256` - PoC file: `poc-tls-uaf-race.c` (attached)
poc-tls-uaf-race.c
Description: Binary data
