abusing CONFIG_RANDOMIZE_KSTACK_OFFSET to assist with exploitation

Jann Horn Fri, 08 Aug 2025 03:47:22 -0700

Hi!

I just published a blogpost
(https://googleprojectzero.blogspot.com/2025/08/from-chrome-renderer-code-exec-to-kernel.html)
about how I wrote a kernel exploit that uses an AF_UNIX bug to go from
"attacker can run arbitrary native code in a sandboxed Chrome
renderer" to kernel page table control.


One aspect that I think I should call out in particular is that
CONFIG_RANDOMIZE_KSTACK_OFFSET was actually helpful for this exploit -
when I was at a point where I already had a (semi-)arbitrary read
primitive, I could use the combination of
CONFIG_RANDOMIZE_KSTACK_OFFSET and the read primitive to line up
things on the stack that would otherwise never have been in the right
spot.

Quoting two sections from my linked blogpost that are directly
relevant for CONFIG_RANDOMIZE_KSTACK_OFFSET:
<<<
## Finding a reallocation target: The magic of `CONFIG_RANDOMIZE_KSTACK_OFFSET`
[...] I went looking for some other allocation which would place an
object such that incrementing the value at address 0x...44 leads to a
nice primitive. It would be nice to have something there like an
important flags field, or a length specifying the size of a pointer
array, or something like that. I spent a lot of time looking at
various object types that can be allocated on the kernel heap from
inside the Chrome sandbox, but found nothing great.

Eventually, I realized that I had been going down the wrong path.
Clearly trying to target a heap object was foolish, because there is
something much better: It is possible to reallocate the target page as
the topmost page of a kernel stack!

That might initially sound like a silly idea; but Debian's kernel
config enables `CONFIG_RANDOMIZE_KSTACK_OFFSET=y` and
`CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT=y`, **causing each syscall
invocation to randomly shift the stack pointer down by up to 0x3f0
bytes, with 0x10 bytes granularity**. That is supposed to be a
security mitigation, but works to my advantage when I already have an
arbitrary read: instead of having to find an overwrite target that is
at a 0x44-byte distance from the preceding 0x100-byte boundary, I
effectively just have to find an overwrite target that is at a
0x4-byte distance from the preceding 0x10-byte boundary, and then keep
doing syscalls and checking at what stack depth they execute until I
randomly get lucky and the stack lands in the right position.

With that in mind, I went looking for an overwrite target on the
stack, strongly inspired by [Seth's exploit that overwrote a spilled
register containing a length used in
`copy_from_user`](https://googleprojectzero.blogspot.com/2022/12/exploiting-CVE-2022-42703-bringing-back-the-stack-attack.html).
Targeting a normal `copy_from_user()` directly wouldn't work here - if
I incremented the 64-bit length used inside `copy_from_user()` by 4
GiB, then even if the copy failed midway through due to a userspace
fault, `copy_from_user()` would try to `memset()` the remaining kernel
memory to zero.

I discovered that, on the codepath `pipe_write -> copy_page_from_iter
-> copy_from_iter`, the 64-bit length variable `bytes` of
`copy_page_from_iter()` is stored in register `R14`, which is spilled
to the stack frame of `copy_from_iter()`; and this stack spill is in a
stack location where I can clobber it.

When userspace calls `write()` on a pipe, the kernel constructs an
iterator (`struct iov_iter`) that encapsulates the userspace memory
range passed to `write()`. (There are different types of iterators
that can encapsulate a single userspace range, a set of userspace
ranges, or various types of kernel memory.) Then, `pipe_write()`
(which is called `anon_pipe_write()` in newer kernels) essentially
runs a loop which allocates a new `pipe_buffer` slot in the pipe,
places a new page allocation in this pipe buffer slot, and copies up
to a page worth of data (`PAGE_SIZE` bytes) from the `iov_iter` to the
pipe buffer slot's page using `copy_page_from_iter()`.
`copy_page_from_iter()` effectively receives two length values: The
number of bytes that fit into the caller-provided page (`bytes`,
initially set to `PAGE_SIZE` here) and the number of bytes available
in the `struct iov_iter` encapsulating the userspace memory range
(`i->count`). The amount of data that will actually be copied is
limited by both.

If I manage to increment the spilled register `R14` which contains
`bytes` by 4 GiB while `copy_from_iter()` is busy copying data into
the kernel, then after `copy_from_iter()` returns,
`copy_page_from_iter()` will effectively no longer be bounded by
`bytes`, only by `i->count` (based on the length userspace passed to
`write()`); so it will do a second iteration, which copies into
out-of-bounds memory behind the pipe buffer page. If userspace calls
`write(fd, buf, 0x3000)`, and the overwrite happens in the middle of
copying bytes 0x1000-0x1fff of the userspace buffer into the second
pipe buffer page, then bytes 0x2000-0x2fff will be written
out-of-bounds behind the second pipe buffer page, at which point
`i->count` will drop to 0, terminating the operation.
>>>

<<<
# Takeaway: probabilistic mitigations against attackers with arbitrary read

When faced with an attacker who already has an arbitrary read
primitive, probabilistic mitigations that randomize something
differently on every operation can be ineffective if the attacker can
keep retrying until the arbitrary read confirms that the randomization
picked a suitable value or even work to the attacker's advantage by
lining up memory locations that could otherwise never overlap, as done
here using the kernel stack randomization feature.

Picking per-syscall random stack offsets at boottime might avoid this
issue, since to retry with different offsets, the attacker would have
to wait for the machine to reboot or try again on another machine.
However, that would break the protection for cases where the attacker
wants to line up two syscalls that use the same syscall number (such
as different `ioctl()` calls); and it could also weaken the protection
in cases where the attacker just needs to know what the randomization
offset for some syscall will be.

Somewhat relatedly,
[Blindside](https://www.vusec.net/projects/blindside/) demonstrated
that this style of attack can be pulled off without a normal arbitrary
read primitive, by “exploiting” a real kernel memory corruption bug
during speculative execution in order to leak information needed for
subsequently exploiting the same memory corruption bug for real.
>>>

abusing CONFIG_RANDOMIZE_KSTACK_OFFSET to assist with exploitation

Reply via email to