On Thu, Jun 25, 2026 at 05:07:23PM -0700, Ackerley Tng wrote: > Yan Zhao <[email protected]> writes: > > > On Wed, Jun 24, 2026 at 04:00:32PM -0700, Ackerley Tng wrote: > >> Sean Christopherson <[email protected]> writes: > >> > >> > On Tue, Jun 23, 2026, Yan Zhao wrote: > >> >> On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote: > >> >> > On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote: > >> >> > > On Mon, Jun 22, 2026, Yan Zhao wrote: > >> >> > > > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 > >> >> > > > Relay wrote: > >> >> > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > >> >> > > > > index ffe9d0db58c59..56d10333c61a7 100644 > >> >> > > > > --- a/arch/x86/kvm/vmx/tdx.c > >> >> > > > > +++ b/arch/x86/kvm/vmx/tdx.c > >> >> > > > > @@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct > >> >> > > > > kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > >> >> > > > > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm)) > >> >> > > > > return -EIO; > >> >> > > > > > >> >> > > > > - if (!src_page) > >> >> > > > > - return -EOPNOTSUPP; > >> >> > > > > + if (!src_page) { > >> >> > > > > + if (!gmem_in_place_conversion) > >> >> > > > When userspace turns on gmem_in_place_conversion while creating > >> >> > > > guest_memfd > >> >> > > > without the MMAP flag, the absence of src_page should still be > >> >> > > > treated as an > >> >> > > > error. > >> >> > > > >> >> > > Why MMAP? > >> >> > Hmm, I was showing a scenario that in-place conversion couldn't occur. > >> >> > I didn't mean that with the MMAP flag, mmap() and user write must > >> >> > occur. > >> >> > > >> >> > > Shouldn't this be a general "if (!src_page && !up-to-date)"? Just > >> >> > > because userspace _can_ mmap() the memory doesn't mean userspace > >> >> > > _has_ mmap()'d > >> >> > > and written memory. And when write() lands, MMAP wouldn't be > >> >> > > necessary to > >> >> > > initialize the memory. > >> >> > Do you mean using up-to-date flag as below? > >> > > >> > Yes? I didn't actually look at the implementation details. > >> > > >> >> > if (!src_page) { > >> >> > src_page = pfn_to_page(pfn); > >> >> > if (!folio_test_uptodate(page_folio(src_page))) > >> >> > return -EOPNOTSUPP; > >> >> > } > >> > >> Yan is right that with the earlier patch "Zero page while getting pfn", > >> folio_test_uptodate() here will always return true. > >> > >> Actually, this is an alternative fix for the issue Sashiko pointed out > >> on v7 where userspace can do a populate() (either TDX or SNP) without > >> first allocating the page, with src_address == NULL, and leak > >> uninitialized memory into the guest. > >> > >> Advantage of using the uptodate check in populate: if the host never > >> allocates the page, populate doesn't incur zeroing before writing the > >> page anyway in populate(). > >> > >> Disadvantage: Both TDX and SNP will have to implement this uptodate > >> check. guest_memfd can't check centrally because for SNP, for a > >> PAGE_TYPE_ZERO, !src_page should be allowed with a !uptodate page since > >> firmware will zero and there's no leakage of uninitialized host memory? > > Another disadvantage: the uptodate flag is per-folio. What if the folio > > is only partially initialized by the userspace especially after huge page is > > supported? > > > > Good point on huge pages! > > The uptodate flag on the folio in guest_memfd means "this folio has been > written to". As of now (before patch at [1]), this happens when > > + folio is zeroed on first use by userspace > + folio is zeroed on first use of the guest > + folio is populated > > When huge pages are supported, the folio can't partially be initialized? > > On allocation, if any part is shared, we split the page. The parts are > separate folios that have their own uptodate flags. > > On splitting, if the huge page is uptodate, the split pages will also be > uptodate. If the huge page is not uptodate, the split pages won't be > uptodate, but that's ok since they will be marked uptodate on first use. > > On merging, the non-uptodate parts have to be zeroed and then marked If that's true, it would be good.
> uptodate. Any parts that are in use would have been marked uptodate > already, so there's no overwriting data that is in use. I'll need to > think more about when it's safe to zero. > > I'm still on the fence between the two options > > 1. Using uptodate check in populate to reject src_pages that have never > been written to or > 2. Always zero before populate 2 does not work? The flow is 1. mmap gmem_fd, make GFN shared, and write initial content. 2. convert GFN to private 3. invoke ioctl to trigger populate. > but whether the uptodate flag is per-folio or not doesn't affect these > two options in terms of fixing the leak of uninitialized host memory, > right? yes, provided "On merging, the non-uptodate parts have to be zeroed and then marked uptodate". > > > >> >> Another concern with this fix is that: > >> >> commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks > >> >> the > >> >> folio uptodate before reaching post_populate(). > >> >> > >> >> [1] > >> >> https://lore.kernel.org/all/[email protected]/ > >> >> > >> >> > One concern is that TDX now does not much care about the up-to-date > >> >> > flag since > >> >> > TDX doesn't rely on the flag to clear pages on conversions. > >> >> > I'm not sure if the flag can be reliably checked in this case. e.g., > >> >> > now the whole folio is marked up-to-date even if only part of it is > >> >> > faulted by > >> >> > user access. > >> >> > Ensuring that the up-to-date flag works correctly with huge page > >> >> > support seems > >> >> > to have more effort than introducing a dedicated flag for TDX. > >> >> > > >> >> > > > Additionally, to properly enable in-place copying for the TDX > >> >> > > > initial memory > >> >> > > > region, userspace must not only specify source_addr to NULL, but > >> >> > > > also follow > >> >> > > > a specific sequence (where steps 1/2/3/7 are required only for > >> >> > > > in-place copy): > >> >> > > > 1. create guest_memfd with MMAP flag > >> >> > > > 2. mmap the guest_memfd. > >> >> > > > 3. convert the initial memory range to shared. > >> >> > > > 4. copy initial content to the source page. > >> >> > > > 5. convert the initial memory range to private > >> >> > > > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION. > >> >> > > > 7. do not unmap the source backend. > >> >> > > > > >> >> > > > So, would it be reasonable to introduce a dedicated flag that > >> >> > > > allows userspace > >> >> > > > to explicitly opt into the in-place copy functionality? e.g., > >> >> > > > >> >> > > Why? It's userspace's responsibility to get the above right. If > >> >> > > userspace fails > >> >> > > to provide a src_page when it doesn't want in-place copy, that's a > >> >> > > userspace bug. > >> > >> Yan, is your concern that userspace forgot to update the code and > >> forgets to provide a src_page, and if we keep the "Zero page while > > Yes. Previously, it would be rejected after GUP fails. > > > > I see, didn't realize previously it would be rejected because GUP > fails. GUP failed because it wasn't faulted into the host? GUP fails if 0 is not a valid user address. But GUP would not fail if 0 is a valid address. e.g., in below scenario: #include <sys/mman.h> #include <stdio.h> int main(void) { void *p=mmap((void*)0,4096,PROT_READ|PROT_WRITE, MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS,-1,0); if (p==MAP_FAILED) { perror("mmap"); return 1; } *(char*)0='Y'; printf("addr0=%p val=%c\n",p,*(char*)0); return 0; } > That's kind of orthogonal, I don't think GUP fail leading to rejecting > populate was meant to help userspace catch these issues. GUP would also > fail if the user did mmap(), write to it, unmap using > madvise(MADV_DONTNEED), then forget and pass 0 as src_address. The original uAPI did not explicitly define 0 as an invalid uaddr. Whether 0 was rejected depended on whether the user mmap()'d address 0. If 0 was a valid mapping, populate() could proceed. commit 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating guest memory") changed the behavior though. It would return -EOPNOTSUPP for a 0 uaddr. But if a user configures 0 uaddr as valid, writes to it, and then passes 0 as source_addr(not from gmem), I'm not sure if it's good for the kernel to silently treat 0 uaddr as an identifier for in-place copy from the private PFN in gmem. > >> getting pfn" patch, ends up with the guest silently having a zero page? > >> I think that would be found quite early in userspace VMM testing... > > I actually encountered this during testing this patch. > > I update most code path to follow this sequence. However, still some corner > > ones > > for TDVF HOB, which are less obvious and harder to update. > > The TD just booted up and hang silently. > > > > I think this is just the life of a close-to-hardware software engineer > :P no errors, got stuck somewhere, root cause is some unitialized > thing. > > >> >> > I mean if userspace specifies a NULL source_addr by mistake, it's > >> >> > better for > >> >> > kernel to detect this mistake, similar to how it validates whether > >> >> > source_addr > >> >> > is PAGE_ALIGNED. > >> > > >> > The alignment case is different. If userspace provides an unaligned > >> > value, KVM > >> > *can't* do what userspace is asking because hardware and thus KVM only > >> > supports > >> > converting on page boundaries. > >> > > >> > For a NULL source, KVM can still do what userspace is asking. Rejecting > >> > userspace's > >> > request would then be making assumptions about what userspace wants. > >> > > >> > >> Also, +1 on this, what if userspace, knowing that pages are zeroed on > >> allocation, actually wants to rely on that to get a zero page in the guest? > > What if 0 uaddr is a valid address? :) > > > >> >> > Since userspace already needs to perform additional steps to enable > >> >> > in-place > >> >> > copy, specifying a dedicated flag to indicate that the NULL > >> >> > source_addr is > >> >> > intentional seems like a reasonable burden. > >> > > >> > I don't see how it adds any value. I wouldn't be at all surprised if > >> > most VMMs > >> > just wen up with code that does: > >> > > >> > if (in-place) { > >> > src = NULL; > >> > flags |= KVM_TDX_IN_PLACE_COPY_INITIAL_MEMORY_REGION; > >> > } > >>
