On 03/12/2025 09:23, David Hildenbrand (Red Hat) wrote:
On 12/2/25 12:50, Nikita Kalyazin wrote:
On 01/12/2025 20:57, Peter Xu wrote:
On Mon, Dec 01, 2025 at 08:12:38PM +0000, Nikita Kalyazin wrote:
On 01/12/2025 18:35, Peter Xu wrote:
On Mon, Dec 01, 2025 at 04:48:22PM +0000, Nikita Kalyazin wrote:
I believe I found the precise point where we convinced ourselves
that minor
support was sufficient: [1]. If at this moment we don't find that
reasoning
valid anymore, then indeed implementing missing is the only option.
[1] https://lore.kernel.org/kvm/[email protected]
Now after I re-read the discussion, I may have made a wrong statement
there, sorry. I could have got slightly confused on when the write()
syscall can be involved.
I agree if you want to get an event when cache missed with the
current uffd
definitions and when pre-population is forbidden, then MISSING trap is
required. That is, with/without the need of UFFDIO_COPY being
available.
Do I understand it right that UFFDIO_COPY is not allowed in your
case, but
only write()?
No, UFFDIO_COPY would work perfectly fine. We will still use write()
whenever we resolve stage-2 faults as they aren't visible to UFFD.
When a
userfault occurs at an offset that already has a page in the cache,
we will
have to keep using UFFDIO_CONTINUE so it looks like both will be
required:
- user mapping major fault -> UFFDIO_COPY (fills the cache and
sets up
userspace PT)
- user mapping minor fault -> UFFDIO_CONTINUE (only sets up
userspace PT)
- stage-2 fault -> write() (only fills the cache)
Is stage-2 fault about KVM_MEMORY_EXIT_FLAG_USERFAULT, per James's
series?
Yes, that's the one ([1]).
[1]
https://lore.kernel.org/kvm/[email protected]
It looks fine indeed, but it looks slightly weird then, as you'll
have two
ways to populate the page cache. Logically here atomicity is indeed not
needed when you trap both MISSING + MINOR.
I reran the test based on the UFFDIO_COPY prototype I had using your
series [2], and UFFDIO_COPY is slower than write() to populate 512 MiB:
237 vs 202 ms (+17%). Even though UFFDIO_COPY alone is functionally
sufficient, I would prefer to have an option to use write() where
possible and only falling back to UFFDIO_COPY for userspace faults to
have better performance.
Just so I understand correctly: we could even do without UFFDIO_COPY for
that scenario by using write() + minor faults?
We still need major fault notifications as well (which we were
accidentally generating until this version). But we can resolve them
with write() + UFFDIO_CONTINUE instead of UFFDIO_COPY.
But what you are saying is that there might be a performance benefit in
using UFFDIO_COPY for userspace faults, to avoid the write()+minor fault
overhead?
UFFDIO_COPY _may_ be faster to resolve userspace faults because it's a
single syscall instead of two, but the amount of userspace faults, at
least in our scenario, is negligible compared to the amount of stage-2
faults, so I wouldn't use it as an argument for supporting UFFDIO_COPY
if it can be avoided.
--
Cheers
David