On Mon, Mar 09, 2026, Ackerley Tng wrote:
> Set release always on guest_memfd mappings to enable the use of
> .invalidate_folio, which performs inode accounting for guest_memfd.
>
> Signed-off-by: Ackerley Tng <[email protected]>
> ---
> virt/kvm/guest_memfd.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 77219551056a7..8246b9fbcf832 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -607,6 +607,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t
> size, u64 flags)
> mapping_set_inaccessible(inode->i_mapping);
> /* Unmovable mappings are supposed to be marked unevictable as well. */
> WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> + mapping_set_release_always(inode->i_mapping);
*sigh*
So... an internal AI review bot flagged setting AS_RELEASE_ALWAYS as being
potentially problematic, and I started poking around, mostly because I was
curious. I'm pretty sure the exact scenario painted by the bot isn't possible,
but I do think a similar issue exists in at least truncate_error_folio(). Or at
least, *should* exist, but doesn't because of a different bug.
On memory error, kvm_gmem_error_folio() will get invoked via this code. Note
the "err != 0" check. kvm_gmem_error_folio() returns MF_DELAYED, which has an
arbitrary value of '2', and so KVM is always signalling "failure".
int err = mapping->a_ops->error_remove_folio(mapping, folio);
if (err != 0)
pr_info("%#lx: Failed to punch page: %d\n", pfn, err);
else if (!filemap_release_folio(folio, GFP_NOIO))
pr_info("%#lx: failed to release buffers\n", pfn);
I _think_ that's bad? On x86, if I'm following the breadcrubs correctly, we'll
end up in this code in kill_me_maybe()
pr_err("Memory error not recovered");
kill_me_now(cb);
and send what I assume is a relatively useless SIGBUS and likely kill the VM.
struct task_struct *p = container_of(ch, struct task_struct,
mce_kill_me);
p->mce_count = 0;
force_sig(SIGBUS);
But even if that's somehow the "right" behavior, we're doing it purely by
accident.
As for this patch, if we fix that bug by returning 0, then
filemap_release_folio()
is definitely reachable by at least one flow, so I think guest_memfd also needs
to implement release_folio()?
Full AI bot text:
--
Setting the AS_RELEASE_ALWAYS flag causes folio_needs_release() to return
true. This correctly triggers .invalidate_folio during truncation, but does
it also unintentionally expose guest_memfd folios to eviction via
posix_fadvise(POSIX_FADV_DONTNEED)?
If userspace calls posix_fadvise() on a guest_memfd file, the core mm
calls mapping_evict_folio(). Because folio_needs_release() is true, it
calls filemap_release_folio().
Since guest_memfd does not implement a .release_folio address space
operation, filemap_release_folio() falls back to calling
try_to_free_buffers(). Could this fallback cause a warning?
fs/buffer.c:try_to_free_buffers() {
...
/* Misconfigured folio check */
if (WARN_ON_ONCE(!folio_buffers(folio)))
return true;
...
}
Because the guest_memfd folio has no private data, folio_buffers()
is NULL, which will trigger this WARN_ON_ONCE.
Furthermore, try_to_free_buffers() returns true, allowing the folio to be
removed from the page cache. Because this eviction path bypasses
truncate_cleanup_folio(), it never calls .invalidate_folio.
Does this mean inode_sub_bytes() is skipped, leaking the inode block
accounting?
Userspace could potentially trigger the warning and infinitely inflate the
inode's block count with:
struct kvm_create_guest_memfd args = { .size = 4096 };
int fd = ioctl(kvm_vm_fd, KVM_CREATE_GUEST_MEMFD, &args);
fallocate(fd, 0, 0, 4096);
posix_fadvise(fd, 0, 4096, POSIX_FADV_DONTNEED);
Should guest_memfd implement a .release_folio callback that simply
returns false to prevent these folios from being evicted?
--