Andrea Arcangeli wrote:
On Fri, Jun 06, 2008 at 11:49:23AM +0300, Avi Kivity wrote:
Andrea Arcangeli wrote:
One day we want to sort the slots according to size. We'll need better locking then (rcu, likely).
I think it's more interesting to sort them on their start/end gfn
address, prio_tree would be the optimal structure for me to use in the
mmu notifier invalidates as I could ask the prio tree to show me in
O(log(N)) (N number of slots) all the slots that overlap with the
invalidated start/end range. However I'm afraid there aren't enough
slots for this to matter... but OTOH there aren't frequent
modifications either, so it may be a microoptimization (if there were
frequent modifications with such a small number of slots it likely
would be slower than a list).
There are only two interesting slots, 1G-end_of_mem and 0-pci_hole. A sorted array gives an average of less than two lookups to find the slot, with the smallest possible constant. Any other algorithm will give more lookups with a far higher constant.

Sometimes linear search is the best algorithm.

But we can't break the loop after the first match! I think you're not
taking into account the aliasing done with overlapping gfn in the
memslots (which is not forbidden apparently and you planned to
obsolete the explicit aliasing logic with it)

For the gfn->hva (the common case) we can break immediately. For the hva->gfn, in general we cannot, but we can add an "unaliased" flag to the memslot and set it for slots which do not have aliases. That makes the loop terminate soon again.

. Only prio_tree can
reduce the number of walks in the most optimal way because of the gfn
overlaps and search by start-end address. Sorting by size won't be
useful because we can't break the loop after the first found match. If
prio tree is too heavyweight the next best would be sorting with a
single-index tree like rbtree by start address as it'll at least
eliminate all the memslots that have a start address higher than the
end of the invalidate range. But prio_tree will eliminate them all in
O(log(N)) so on the paper prio_tree is best (but perhaps in practice
rbtree is better, dunno).

Any pointer-based data structure is bound to be much slower than a list with such a small number of elements.

btw, on 64-bit userspace we can arrange the entire physical address space in a contiguous region (some of it unmapped) and have just one slot for the guest.

I forgot, last round: EPT doesn't have PT_ACCESSED_MASK, so you're reading (and clearing) some random bit. We can easily qualify the test by looking at shadow_accessed_mask; if zero, the hardware doesn't have a young bit.

There's a bigger question of what to do in the case of EPT. We can always return true, or return true and also unmap. The first means we lie to the vm, the second means we take unnecessary exits to maintain age information.

If there's no access bit (and we can't use the guest pte access bit
because EPT represent a physical page and not a guest pte anymore)
we're forced to return 0 without doing anything at all. Returning 1 is
unsafe as it'd pin the page in host ram.

If there's no access bit all we can do is to wait the VM to unmap the
page, mark the spte not present during the invalidate_page mmu
notifier, and wait the guest to trigger a minor fault from swapcache.

Okay.  It's sad, but I don't see any choice.

If anyone from Intel is listening, please give us an accessed bit (and a dirty bit, too).

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to