Hello oss-security,

I am disclosing a Linux kernel vulnerability in the TLS ULP subsystem.

Affected component:
  Linux kernel TLS ULP
  File: net/tls/tls_main.c
  Function: tls_sk_proto_close()

Vulnerability type:
  Use-after-free / race condition

Summary:
  There is a race between close() and setsockopt(SOL_TLS, TLS_TX) in the
  Linux kernel TLS ULP subsystem. Under certain interleavings, one thread
can
  close a TLS socket while another thread is still operating on TLS-related
  socket state through setsockopt(). This can lead to a use-after-free in
the
  TLS socket teardown path.

Impact:
  A local unprivileged user may be able to trigger kernel heap memory
  corruption. Based on my analysis, this may potentially be exploitable for
  local privilege escalation, although I do not have a confirmed full
privilege
  escalation exploit.

Affected versions:
  The affected version range is still being confirmed.

Kernel configuration:
  The issue affects systems with CONFIG_TLS enabled. On many distributions
this
  is built as a module.

Status:
  This issue was reported to linux-distros on 2026-05-16. I incorrectly
  contacted linux-distros before first getting a fix accepted by the Linux
  kernel maintainers. The latest proposed public disclosure date was
  2026-05-30, and this oss-security posting is being made late.

  As of this posting, I do not have an accepted upstream fix commit to cite.
  I am available to work with the kernel TLS/networking maintainers to
validate
  the issue and test a fix.

Reproducer:
  I have a reproducer for the race. I am not including it in this initial
public
  posting to avoid unnecessarily increasing harm before a fix is available,
but
  I can share it with kernel maintainers on request.

CVE:
  No CVE has been assigned as far as I know. I understand that the Linux
kernel
  CNA generally assigns CVEs after a fix commit is available.

AI disclosure:
  AI assistance was used during analysis and report preparation.
Specifically,
  OpenAI Codex was used to help inspect the relevant code path, reason about
  the race condition, and draft portions of the vulnerability report. I
reviewed
  and take responsibility for the report contents.

References:
  Linux kernel security bug process:
  https://docs.kernel.org/process/security-bugs.html

  Related but different public KTLS issue:
  https://www.openwall.com/lists/oss-security/2026/05/07/1

Timeline:
  2026-05-16: Reported to linux-distros
  2026-05-30: Latest agreed public disclosure date
  2026-06-02: Public disclosure to oss-security

Regards,
Oleg Sevostyanov
# Use-After-Free via TOCTOU Race in net/tls: tls_sk_proto_close() reads tx_conf without lock_sock

**Reporter:** Oleg Sevostyanov <[email protected]>
**Date:** 2026-05-16
**Kernel version:** 7.1-rc3 (confirmed; likely present since ~v4.13 when TLS ULP was introduced)
**Subsystem:** net/tls
**Files:**
- `net/tls/tls_main.c` — vulnerable read at line 372
- `net/tls/tls_sw.c`   — UAF sites in tx_work_handler (line 2637), tls_encrypt_done (line 467)

**CWE:** CWE-416 (Use After Free), CWE-362 (Race Condition)
**Severity:** High — local privilege escalation; no privileges required

---

## Summary

`tls_sk_proto_close()` in `net/tls/tls_main.c` reads the field `ctx->tx_conf` at line 372
**without holding `lock_sock`**.  A concurrent `setsockopt(SOL_TLS, TLS_TX, ...)` call
writes `ctx->tx_conf = TLS_SW` **inside** `lock_sock`.

When the race is won by `setsockopt`, the close path:
1. **Skips** `tls_sw_cancel_work_tx()` (which would set `BIT_TX_CLOSING` and call
   `disable_delayed_work_sync`) because it saw `TLS_BASE` at line 372.
2. **Calls** `tls_sw_free_ctx_tx()` → `kfree(tls_sw_context_tx)` at line 390 because
   it sees `TLS_SW` on the second (now correctly-locked) read.
3. A delayed workqueue item (`tx_work_handler`, scheduled 1 jiffy earlier by
   `tls_encrypt_done` or `tls_sw_write_space`) fires after the `kfree`, producing a
   **use-after-free** on the freed `tls_sw_context_tx` object.

No special privileges are required — any unprivileged user with a TCP socket can
trigger the race.

---

## Affected Kernel Versions

The unlocked read of `tx_conf` before `lock_sock` in `tls_sk_proto_close` has been
present since the TLS ULP was introduced (~v4.13).  All kernels with `CONFIG_TLS=y`
in the v4.13–v7.1 range are likely affected, subject to confirmation against each
stable branch.

Earliest introducing commit (approximate):
```
e8f69799810c ("net/tls: Add generic NIC offload infrastructure", 2018-07-13)
```
or the commit that split `tls_sk_proto_close` into its current form.

---

## Exact Vulnerable Code

### net/tls/tls_main.c — unlocked read + free

```c
/* Line 365–399 (Linux 7.1-rc3) */
static void tls_sk_proto_close(struct sock *sk, long timeout)
{
    struct inet_connection_sock *icsk = inet_csk(sk);
    struct tls_context *ctx = tls_get_ctx(sk);
    long timeo = sock_sndtimeo(sk, 0);
    bool free_ctx;

    if (ctx->tx_conf == TLS_SW)          /* ← L372: READ WITHOUT lock_sock  BUG */
        tls_sw_cancel_work_tx(ctx);       /* ← L373: SKIPPED when race wins       */

    lock_sock(sk);                        /* ← L375: lock acquired too late        */
    free_ctx = ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW;

    if (ctx->tx_conf != TLS_BASE || ctx->rx_conf != TLS_BASE)
        tls_sk_proto_cleanup(sk, ctx, timeo);

    write_lock_bh(&sk->sk_callback_lock);
    if (free_ctx)
        rcu_assign_pointer(icsk->icsk_ulp_data, NULL);
    WRITE_ONCE(sk->sk_prot, ctx->sk_proto);
    if (sk->sk_write_space == tls_write_space)
        sk->sk_write_space = ctx->sk_write_space;
    write_unlock_bh(&sk->sk_callback_lock);
    release_sock(sk);

    if (ctx->tx_conf == TLS_SW)          /* ← L389: second read (stale, race won) */
        tls_sw_free_ctx_tx(ctx);         /* ← L390: kfree(tls_sw_context_tx) FREE */
    ...
}
```

### net/tls/tls_main.c — setsockopt sets tx_conf under lock

```c
/* Line 757–758 — inside do_tls_setsockopt_conf(), which holds lock_sock */
    if (tx)
        ctx->tx_conf = conf;             /* ← sets TLS_SW under lock_sock         */
```

### net/tls/tls_sw.c — cancel_work_tx: what is skipped

```c
/* Line 2539–2546 */
void tls_sw_cancel_work_tx(struct tls_context *tls_ctx)
{
    struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);

    set_bit(BIT_TX_CLOSING, &ctx->tx_bitmask);      /* prevent new work */
    set_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask);
    disable_delayed_work_sync(&ctx->tx_work.work);  /* wait for in-flight */
}
```

Without this call, `BIT_TX_CLOSING` is never set → `tx_work_handler` does not
return early at line 2650 and proceeds to access freed memory.

### net/tls/tls_sw.c — delayed work scheduler (1 jiffy after crypto callback)

```c
/* Line 515–517 — tls_encrypt_done() */
    if (!test_and_set_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask))
        schedule_delayed_work(&ctx->tx_work.work, 1);   /* 1 jiffy delay */

/* Line 521–522 — tls_encrypt_done() */
    if (atomic_dec_and_test(&ctx->encrypt_pending))
        complete(&ctx->async_wait.completion);  /* wakes tls_encrypt_async_wait */
```

`tls_encrypt_async_wait` returns first (completion fires before the 1-jiffy delay),
so `tls_sw_free_ctx_tx` at L390 can race with the pending delayed work.

### net/tls/tls_sw.c — UAF sites in tx_work_handler

```c
/* Line 2637–2668 */
static void tx_work_handler(struct work_struct *work)
{
    struct delayed_work *delayed_work = to_delayed_work(work);
    struct tx_work *tx_work = container_of(delayed_work,
                             struct tx_work, work);
    struct sock *sk = tx_work->sk;
    struct tls_context *tls_ctx = tls_get_ctx(sk);
    struct tls_sw_context_tx *ctx;

    if (unlikely(!tls_ctx))
        return;

    ctx = tls_sw_ctx_tx(tls_ctx);                       /* freed pointer     */
    if (test_bit(BIT_TX_CLOSING, &ctx->tx_bitmask))     /* UAF READ  L2650  */
        return;

    if (!test_and_clear_bit(BIT_TX_SCHEDULED,
                            &ctx->tx_bitmask))           /* UAF READ/WRITE   */
        return;

    if (mutex_trylock(&tls_ctx->tx_lock)) {
        lock_sock(sk);
        tls_tx_records(sk, -1);                          /* UAF — tx_list    */
        release_sock(sk);
        mutex_unlock(&tls_ctx->tx_lock);
    } else if (!test_and_set_bit(BIT_TX_SCHEDULED,
                                  &ctx->tx_bitmask)) {   /* UAF WRITE        */
        schedule_delayed_work(&ctx->tx_work.work,        /* func ptr on freed*/
                              msecs_to_jiffies(10));
    }
}
```

### net/tls/tls_sw.c — UAF sites in tls_encrypt_done

```c
/* Line 467–522 */
static void tls_encrypt_done(void *data, int err)
{
    ...
    ctx = tls_sw_ctx_tx(tls_ctx);               /* freed pointer            */
    ...
    ctx->async_wait.err = err;                  /* UAF WRITE  L497          */
    ...
    first_rec = list_first_entry(&ctx->tx_list, /* UAF READ   L511          */
                                 struct tls_rec, list);
    if (!test_and_set_bit(BIT_TX_SCHEDULED,
                          &ctx->tx_bitmask))     /* UAF READ/WRITE L515     */
        schedule_delayed_work(&ctx->tx_work.work, 1);
    if (atomic_dec_and_test(&ctx->encrypt_pending)) /* UAF READ/WRITE L521  */
        complete(&ctx->async_wait.completion);   /* UAF WRITE               */
}
```

---

## Race Condition Timeline

```
Thread A — close(fd)                    Thread B — setsockopt(fd, SOL_TLS, TLS_TX)
══════════════════════════════════════════════════════════════════════════════════

                                        setsockopt(fd, SOL_TLS, TLS_TX, &info)
                                          do_tls_setsockopt_conf()
                                          lock_sock(sk)
                                          tls_set_sw_offload(sk, tx=1)
                                            kzalloc_obj(*sw_ctx_tx)       → alloc
                                            INIT_DELAYED_WORK(&sw_ctx_tx->tx_work)
                                            crypto_aead_encrypt()         → -EINPROGRESS
                                            atomic_inc(&ctx->encrypt_pending)
                                          ctx->tx_conf = TLS_SW           ← L758
                                          release_sock(sk)

close(fd)
  tls_sk_proto_close(sk)
    READ ctx->tx_conf → TLS_BASE ← race window: setsockopt set it after this read!
    (tls_sw_cancel_work_tx NOT called — BIT_TX_CLOSING never set)

                                        [async encrypt callback fires]
                                        tls_encrypt_done():
                                          schedule_delayed_work(..., 1)   ← +1 jiffy
                                          complete(&ctx->async_wait.completion)

    lock_sock(sk)                       ← L375
    tls_sk_proto_cleanup(sk):
      tls_sw_release_resources_tx():
        tls_encrypt_async_wait()        ← returns (completion already fired)
        crypto_free_aead(ctx->aead_send)
    release_sock(sk)
    READ ctx->tx_conf → TLS_SW         ← L389: now sees TLS_SW
    tls_sw_free_ctx_tx():
      kfree(ctx)                        ← tls_sw_context_tx FREED ──────┐

                                        [1 jiffy later — workqueue]         │
                                        tx_work_handler():                  │
                                          ctx = tls_sw_ctx_tx(tls_ctx)  ←──┘ FREED
                                          test_bit(BIT_TX_CLOSING, ...)  ← UAF READ
                                          tls_tx_records(sk, -1)         ← UAF
```

---

## Freed Object

```c
/* include/net/tls.h */
struct tls_sw_context_tx {
    struct crypto_aead  *aead_send;     /* offset  0x00 */
    struct crypto_wait   async_wait;    /* offset  0x08 */
    struct tx_work       tx_work;       /* offset  0x28 — contains delayed_work */
    struct tls_rec      *open_rec;      /* offset  0x50 */
    struct list_head     tx_list;       /* offset  0x58 */
    atomic_t             encrypt_pending; /* offset 0x68 */
    u8                   async_capable:1;
    unsigned long        tx_bitmask;    /* BIT_TX_SCHEDULED, BIT_TX_CLOSING */
};
/* allocated via kzalloc_obj(*sw_ctx_tx) → kmalloc-256 slab */
```

`tx_work.work` (a `struct delayed_work`) is at a fixed offset within the freed chunk.
Its embedded `work_struct.func` is the function pointer called by the workqueue.

---

## Privilege Requirements

| Requirement | Value |
|---|---|
| Root / CAP_NET_ADMIN | Not required |
| CAP_NET_RAW | Not required |
| Network namespace | Default (init_net) |
| Minimum privilege | Unprivileged user with TCP socket access |
| Kernel config | CONFIG_TLS=y (default on most distros) |
| Async crypto | Required for the 1-jiffy UAF window; synchronous crypto still triggers the state inconsistency |

---

## Exploitation Scenarios

### Scenario 1 — Crash / DoS (reliability: high)

Even without a controlled allocation, `tx_work_handler` traversing the freed
`ctx->tx_list` will likely corrupt memory and trigger a kernel BUG/oops within
seconds of the race firing.

### Scenario 2 — Information Leak / KASLR Defeat

1. Win the race → `tls_sw_context_tx` (kmalloc-256) is freed.
2. Spray `kmalloc-256` objects from user space before the 1-jiffy deadline:
   - `msg_msg` bodies (via `msgsnd()`)
   - `pipe_buffer` structures
   - `sk_buff` headers
3. `tls_encrypt_done()` fires and reads from the reclaimed chunk:
   - `list_first_entry(&ctx->tx_list, ...)` → follows attacker-controlled pointer
   - Returned pointer is dereferenced as a `tls_rec *`
4. Any kernel pointer stored by the spray object in that slot leaks to attacker
   via timing or error paths → KASLR broken.

### Scenario 3 — Arbitrary Write

`complete(&ctx->async_wait.completion)` calls `wake_up_process()` on
`x->wait.task_list.next`.  If the freed chunk is reclaimed with a controlled
`swait_queue_head`, `wake_up_process()` writes to an attacker-controlled
`task_struct` pointer.

### Scenario 4 — Local Privilege Escalation (LPE) — Full Root

1. KASLR defeated (Scenario 2 first).
2. Spray the freed 256-byte slot so that `ctx->tx_work.work.func` (at a known
   offset within the freed chunk) contains the address of a kernel ROP gadget or
   directly `commit_creds(prepare_kernel_cred(0))`.
3. When `schedule_delayed_work(&ctx->tx_work.work, 10ms)` is called by
   `tx_work_handler` on the reclaimed chunk, the workqueue executes the attacker's
   function in softirq/kernel context.
4. Overwrite `current->cred` → uid=gid=0 → root shell.

---

## Reproducer

### Build

```bash
gcc -O2 -lpthread -o poc-tls-uaf-race poc-tls-uaf-race.c
```

### Run

```bash
sudo modprobe tls           # ensure TLS ULP module is loaded
./poc-tls-uaf-race          # run race loop
sudo dmesg | grep -A 40 "BUG: KASAN: use-after-free"
```

### Expected KASAN output (CONFIG_KASAN=y kernel)

```
==================================================================
BUG: KASAN: use-after-free in tx_work_handler+0x.../net/tls/tls_sw.c:2649
Read of size 8 at addr ffff... by task kworker/...

CPU: 1 PID: ... Comm: kworker/...
Call Trace:
 tx_work_handler
 process_one_work
 worker_thread
 kthread
 ret_from_fork
...
Freed by task ...:
 tls_sw_free_ctx_tx
 tls_sk_proto_close
 inet_release
 sock_close
==================================================================
```

### Race conditions to verify without KASAN

Use `ftrace` to log `tx_conf` values at close entry and compare:

```bash
echo 'p:probe_close tls_sk_proto_close ctx->tx_conf=%cx' > \
    /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/events/kprobes/probe_close/enable
./poc-tls-uaf-race
grep "tx_conf=0" /sys/kernel/debug/tracing/trace  # 0=TLS_BASE — race hit
```

A `tx_conf=0` at `tls_sk_proto_close` entry while `tx_conf` later becomes 1
(TLS_SW) before `kfree` confirms the race window.

---

## Proposed Fix

Move the `tx_conf` check and `tls_sw_cancel_work_tx()` call to **after**
`lock_sock()` so that the read is protected by the same lock that `setsockopt`
uses when writing `tx_conf`:

```diff
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -365,10 +365,10 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
        long timeo = sock_sndtimeo(sk, 0);
        bool free_ctx;

-       if (ctx->tx_conf == TLS_SW)
-               tls_sw_cancel_work_tx(ctx);
-
        lock_sock(sk);
+       /* tx_conf must be read under lock_sock to avoid TOCTOU with setsockopt */
+       if (ctx->tx_conf == TLS_SW)
+               tls_sw_cancel_work_tx(ctx);
+
        free_ctx = ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW;
```

This one-block move ensures that `tls_sw_cancel_work_tx()` is always called before
any cleanup when `tx_conf` is `TLS_SW`, regardless of concurrent `setsockopt`.

---

## References

- Subsystem maintainers: Jakub Kicinski <[email protected]>,
  John Fastabend <[email protected]>
- Related prior work: CVE-2023-0461 (different TLS UAF — listening socket context)
- Slab cache: `kmalloc-256`
- PoC file: `poc-tls-uaf-race.c` (attached)

Attachment: poc-tls-uaf-race.c
Description: Binary data

Reply via email to