On 07/11/2025 15:29, Ackerley Tng wrote:
Patrick Roy <[email protected]> writes:

Hey all,

sorry it took me a while to get back to this, turns out moving
internationally is move time consuming than I expected.

On Mon, 2025-09-29 at 12:20 +0200, David Hildenbrand wrote:
On 27.09.25 09:38, Patrick Roy wrote:
On Fri, 2025-09-26 at 21:09 +0100, David Hildenbrand wrote:
On 26.09.25 12:53, Will Deacon wrote:
On Fri, Sep 26, 2025 at 10:46:15AM +0100, Patrick Roy wrote:
On Thu, 2025-09-25 at 21:13 +0100, David Hildenbrand wrote:
On 25.09.25 21:59, Dave Hansen wrote:
On 9/25/25 12:20, David Hildenbrand wrote:
On 25.09.25 20:27, Dave Hansen wrote:
On 9/24/25 08:22, Roy, Patrick wrote:
Add an option to not perform TLB flushes after direct map manipulations.

I'd really prefer this be left out for now. It's a massive can of worms.
Let's agree on something that works and has well-defined behavior before
we go breaking it on purpose.

May I ask what the big concern here is?

It's not a _big_ concern.

Oh, I read "can of worms" and thought there is something seriously problematic 
:)

I just think we want to start on something
like this as simple, secure, and deterministic as possible.

Yes, I agree. And it should be the default. Less secure would have to be opt-in 
and documented thoroughly.

Yes, I am definitely happy to have the 100% secure behavior be the
default, and the skipping of TLB flushes be an opt-in, with thorough
documentation!

But I would like to include the "skip tlb flushes" option as part of
this patch series straight away, because as I was alluding to in the
commit message, with TLB flushes this is not usable for Firecracker for
performance reasons :(

I really don't want that option for arm64. If we're going to bother
unmapping from the linear map, we should invalidate the TLB.

Reading "TLB flushes result in a up to 40x elongation of page faults in
guest_memfd (scaling with the number of CPU cores), or a 5x elongation
of memory population,", I can understand why one would want that optimization :)

@Patrick, couldn't we use fallocate() to preallocate memory and batch the TLB 
flush within such an operation?

That is, we wouldn't flush after each individual direct-map modification but 
after multiple ones part of a single operation like fallocate of a larger range.

Likely wouldn't make all use cases happy.


For Firecracker, we rely a lot on not preallocating _all_ VM memory, and
trying to ensure only the actual "working set" of a VM is faulted in (we
pack a lot more VMs onto a physical host than there is actual physical
memory available). For VMs that are restored from a snapshot, we know
pretty well what memory needs to be faulted in (that's where @Nikita's
write syscall comes in), so there we could try such an optimization. But
for everything else we very much rely on the on-demand nature of guest
memory allocation (and hence direct map removal). And even right now,
the long pole performance-wise are these on-demand faults, so really, we
don't want them to become even slower :(

Makes sense. I guess even without support for large folios one could implement a kind of 
"fault" around: for example, on access to one addr, allocate+prepare all pages 
in the same 2 M chunk, flushing the tlb only once after adjusting all the direct map 
entries.


Also, can we really batch multiple TLB flushes as you suggest? Even if
pages are at consecutive indices in guest_memfd, they're not guaranteed
to be continguous physically, e.g. we couldn't just coalesce multiple
TLB flushes into a single TLB flush of a larger range.

Well, you there is the option on just flushing the complete tlb of course :) 
When trying to flush a range you would indeed run into the problem of flushing 
an ever growing range.

In the last guest_memfd upstream call (over a week ago now), we've
discussed the option of batching and deferring TLB flushes, while
providing a sort of "deadline" at which a TLB flush will
deterministically be done.  E.g. guest_memfd would keep a counter of how
many pages got direct map zapped, and do a flush of a range that
contains all zapped pages every 512 allocated pages (and to ensure the
flushes even happen in a timely manner if no allocations happen for a
long time, also every, say, 5 seconds or something like that). Would
that work for everyone? I briefly tested the performance of
batch-flushes with secretmem in QEMU, and its within of 30% of the "no
TLB flushes at all" solution in a simple benchmark that just memsets
2GiB of memory.

I think something like this, together with the batch-flushing at the end
of fallocate() / write() as David suggested above should work for
Firecracker.

There's probably other things we can try. Backing guest_memfd with
hugepages would reduce the number TLB flushes by 512x (although not all
users of Firecracker at Amazon [can] use hugepages).

Right.


And I do still wonder if it's possible to have "async TLB flushes" where
we simply don't wait for the IPI (x86 terminology, not sure what the
mechanism on arm64 is). Looking at
smp_call_function_many_cond()/invlpgb_kernel_range_flush() on x86, it
seems so? Although seems like on ARM it's actually just handled by a
single instruction (TLBI) and not some interprocess communication
thingy. Maybe there's a variant that's faster / better for this usecase?

Right, some architectures (and IIRC also x86 with some extension) are able to 
flush remote TLBs without IPIs.

Doing a quick search, there seems to be some research on async TLB flushing, 
e.g., [1].

In the context here, I wonder whether an async TLB flush would be
significantly better than not doing an explicit TLB flush: in both
cases, it's not really deterministic when the relevant TLB entries
will vanish: with the async variant it might happen faster on average
I guess.

I actually did end up playing around with this a while ago, and it made
things slightly better performance wise, but it was still too bad to be
useful :(


Does it help if we add a guest_memfd ioctl that allows userspace to zap
from the direct map to batch TLB flushes?

Could usage be something like:

0. Create guest_memfd with GUEST_MEMFD_FLAG_NO_DIRECT_MAP.
1. write() entire VM memory to guest_memfd.

Hi Ackerley,

This doesn't fully cover our use case. We are not always able to populate the entire guest memory proactively with a write(). Specifically, 1) the memory content may not be available on the host by the time the vCPU accesses the page and 2) as we don't want to populate zero pages in advance to save memory on the host, faults on those pages will occur unpredictably and we will have to pay TLB flush cost on every such fault.

2. ioctl(guest_memfd, KVM_GUEST_MEMFD_ZAP_DIRECT_MAP, { offset, len })
3. vcpu_run()

This way, we could flush the tlb once for the entire range of { offset,
len } instead of zapping once per fault.

For not-yet-allocated folios, those will get zapped once per fault
though.

Maybe this won't help much if the intention is to allow on-demand
loading of memory, since the demands will come to guest_memfd on a
per-folio basis.

Yes, in our setup we rely on both write() + on-demand faulting working concurrently and we can't always predict which of them will handle a specific page.

Nikita



[1] https://cs.yale.edu/homes/abhishek/kumar-taco20.pdf


Best,
Patrick


Reply via email to