On 2025-10-06 09:21, Jason Gunthorpe wrote:
On Fri, Oct 03, 2025 at 06:16:14PM -0400, Felix Kuehling wrote:
It sounds like zone_device_page_init is just unsafe to use in
general.
It can only be used if you know the page is freed.
It assumes that pages have a 0 refcount.
Yes
But I don't see a good way for drivers to guarantee that, because
they are not in control of when the page refcounts for their
zone-device pages get decremented.
?? Drivers are supposed to hoook pgmap->ops->page_free() and keep
track.
Right, we have that callback, it's currently only used to track
references to our buffer objects used to back device pages.
There is no way to write a driver without calling
zone_device_page_init() as there is no other defined way to re-use a
page that has been returned through page_free().
It is completely wrong to call get_page() on a 0 refcount folio, we
don't have a debugging crash for this, but we really should. If you
think the refcount could be 0 you have to use a try_get().
So this patch looks wrong to me, I see a page_free() implementation
and this is the only call to zone_device_page_init(). If you remove it
the driver is absolutely broken.
I would expect migration should be writing to freed memory and
zone_device_page_init() is the correct and only way to make freed
memory usable again.
OK. We made an incorrect assumption that we can reuse a page if the
driver isn't tracking it as allocated to any of our SVM ranges (i.e.,
after dev_pagemap_ops.migrate_to_ram() migrated all data out of the
page). However, we neglected that other parts of the kernel can still
hold references to a page even after that.
Would something like this work:
static void
svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn)
{
...
if (!try_get_page(page)) {
page->zone_device_data = prange->svm_bo;
zone_device_page_init(page);
}
}
Therefore, I expect the refcount to be 0 when
svm_migrate_ram_to_vram() picks a dst.
If it is not true, and you are tring to migrate to already allocated
VRAM, then WTF?
As I understand it, it's a race condition. The driver is done with the
page and its migrate_to_ram() call has completed. But do_swap_page
hasn't called put_page yet. At the same time, another thread is trying
to reuse the page, migrating data back to VRAM.
Regards,
Felix
And if you really want to do that then yes you need to use get_page
but you need a different path to handle already allocated vs
page_free() called. get_page() MUST NOT be used to unfree page_free'd
memory.
The explanation in the commit doesn't really have enough detail:
1. CPU page fault handler get vram page, migrate the vram page to
system page
2. GPU page fault migrate to the vram page, set page refcount to 1
So why is the same vram page being used for both? For #1 the VRAM page
is installed in a swap entry so it is has an elevated refcount.
The implication is that #2 is targeting already allocated VRAM memory
that is NOT FREE.
Jason