Hi! I just published a blogpost (https://googleprojectzero.blogspot.com/2025/08/from-chrome-renderer-code-exec-to-kernel.html) about how I wrote a kernel exploit that uses an AF_UNIX bug to go from "attacker can run arbitrary native code in a sandboxed Chrome renderer" to kernel page table control.
One aspect that I think I should call out in particular is that CONFIG_RANDOMIZE_KSTACK_OFFSET was actually helpful for this exploit - when I was at a point where I already had a (semi-)arbitrary read primitive, I could use the combination of CONFIG_RANDOMIZE_KSTACK_OFFSET and the read primitive to line up things on the stack that would otherwise never have been in the right spot. Quoting two sections from my linked blogpost that are directly relevant for CONFIG_RANDOMIZE_KSTACK_OFFSET: <<< ## Finding a reallocation target: The magic of `CONFIG_RANDOMIZE_KSTACK_OFFSET` [...] I went looking for some other allocation which would place an object such that incrementing the value at address 0x...44 leads to a nice primitive. It would be nice to have something there like an important flags field, or a length specifying the size of a pointer array, or something like that. I spent a lot of time looking at various object types that can be allocated on the kernel heap from inside the Chrome sandbox, but found nothing great. Eventually, I realized that I had been going down the wrong path. Clearly trying to target a heap object was foolish, because there is something much better: It is possible to reallocate the target page as the topmost page of a kernel stack! That might initially sound like a silly idea; but Debian's kernel config enables `CONFIG_RANDOMIZE_KSTACK_OFFSET=y` and `CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT=y`, **causing each syscall invocation to randomly shift the stack pointer down by up to 0x3f0 bytes, with 0x10 bytes granularity**. That is supposed to be a security mitigation, but works to my advantage when I already have an arbitrary read: instead of having to find an overwrite target that is at a 0x44-byte distance from the preceding 0x100-byte boundary, I effectively just have to find an overwrite target that is at a 0x4-byte distance from the preceding 0x10-byte boundary, and then keep doing syscalls and checking at what stack depth they execute until I randomly get lucky and the stack lands in the right position. With that in mind, I went looking for an overwrite target on the stack, strongly inspired by [Seth's exploit that overwrote a spilled register containing a length used in `copy_from_user`](https://googleprojectzero.blogspot.com/2022/12/exploiting-CVE-2022-42703-bringing-back-the-stack-attack.html). Targeting a normal `copy_from_user()` directly wouldn't work here - if I incremented the 64-bit length used inside `copy_from_user()` by 4 GiB, then even if the copy failed midway through due to a userspace fault, `copy_from_user()` would try to `memset()` the remaining kernel memory to zero. I discovered that, on the codepath `pipe_write -> copy_page_from_iter -> copy_from_iter`, the 64-bit length variable `bytes` of `copy_page_from_iter()` is stored in register `R14`, which is spilled to the stack frame of `copy_from_iter()`; and this stack spill is in a stack location where I can clobber it. When userspace calls `write()` on a pipe, the kernel constructs an iterator (`struct iov_iter`) that encapsulates the userspace memory range passed to `write()`. (There are different types of iterators that can encapsulate a single userspace range, a set of userspace ranges, or various types of kernel memory.) Then, `pipe_write()` (which is called `anon_pipe_write()` in newer kernels) essentially runs a loop which allocates a new `pipe_buffer` slot in the pipe, places a new page allocation in this pipe buffer slot, and copies up to a page worth of data (`PAGE_SIZE` bytes) from the `iov_iter` to the pipe buffer slot's page using `copy_page_from_iter()`. `copy_page_from_iter()` effectively receives two length values: The number of bytes that fit into the caller-provided page (`bytes`, initially set to `PAGE_SIZE` here) and the number of bytes available in the `struct iov_iter` encapsulating the userspace memory range (`i->count`). The amount of data that will actually be copied is limited by both. If I manage to increment the spilled register `R14` which contains `bytes` by 4 GiB while `copy_from_iter()` is busy copying data into the kernel, then after `copy_from_iter()` returns, `copy_page_from_iter()` will effectively no longer be bounded by `bytes`, only by `i->count` (based on the length userspace passed to `write()`); so it will do a second iteration, which copies into out-of-bounds memory behind the pipe buffer page. If userspace calls `write(fd, buf, 0x3000)`, and the overwrite happens in the middle of copying bytes 0x1000-0x1fff of the userspace buffer into the second pipe buffer page, then bytes 0x2000-0x2fff will be written out-of-bounds behind the second pipe buffer page, at which point `i->count` will drop to 0, terminating the operation. >>> <<< # Takeaway: probabilistic mitigations against attackers with arbitrary read When faced with an attacker who already has an arbitrary read primitive, probabilistic mitigations that randomize something differently on every operation can be ineffective if the attacker can keep retrying until the arbitrary read confirms that the randomization picked a suitable value or even work to the attacker's advantage by lining up memory locations that could otherwise never overlap, as done here using the kernel stack randomization feature. Picking per-syscall random stack offsets at boottime might avoid this issue, since to retry with different offsets, the attacker would have to wait for the machine to reboot or try again on another machine. However, that would break the protection for cases where the attacker wants to line up two syscalls that use the same syscall number (such as different `ioctl()` calls); and it could also weaken the protection in cases where the attacker just needs to know what the randomization offset for some syscall will be. Somewhat relatedly, [Blindside](https://www.vusec.net/projects/blindside/) demonstrated that this style of attack can be pulled off without a normal arbitrary read primitive, by “exploiting” a real kernel memory corruption bug during speculative execution in order to leak information needed for subsequently exploiting the same memory corruption bug for real. >>>