Peter Xu <pet...@redhat.com> writes:

> Currently, the postcopy blocktime feature maintains vCPU fault information
> using an array (vcpu_addr[]).  It has two issues.
>
> Issue 1: Performance Concern
> ============================
>
> The old algorithm was almost OK and fast on inserts, except that the lookup
> is slow and won't scale if there are a lot of vCPUs: when a page is copied
> during postcopy, mark_postcopy_blocktime_end() will walk the whole array
> trying to find which vCPUs are blocked by the address.  So it needs
> constant O(N) walk for each page resolution.
>
> Alexey (the author of postcopy blocktime) mentioned the perf issue and how
> to optimize it in a piece of comment in the page resolution path.  The
> comment was (interestingly..) not complete, but it's relatively clear what
> he wanted to say about this perf issue.
>
> Issue 2: Wrong Accounting on re-entrancies
> ==========================================
>
> People might think that each vCPU should only and always get one fault at a
> time, so that when the blocktime layer captured one fault on one vCPU, we
> should never see another fault message on this vCPU.
>
> It's almost correct, except in some extreme rare cases.
>
> Case 1: it's possible the fault thread processes the userfaultfd messages
> too fast so it can see >1 messages on one vCPU before the previous one was
> resolved.
>
> Case 2: it's theoretically also possible one vCPU can get even more than
> one message on the same fault address if a fault is retried by the
> kernel (e.g., handle_userfault() got interrupted before page resolution).
>
> As this info might be important, instead of using commit message, I put
> more details into the code as comment, when introducing an array
> maintaining concurrent faults on one vCPU.  Please refer to the comments
> for details on both cases, especially case 1 which can be tricky.
>
> Case 1 sounds rare, but it can be easily reproduced locally for me when we
> run blocktime together with the migration-test on the vanilla postcopy.
>
> New Design
> ==========
>
> This patch should do almost what Alexey mentioned, but slightly
> differently: instead of having an array to maintain vCPU fault addresses,
> for each of the fault message we push a message into a hash, indexed by the
> fault address.
>
> With the hash, it can replace the old two structs: both the vcpu_addr[]
> array, and also the array to store the start time of the fault.  However
> due to above we need one more counter array to account concurrent faults on
> the same vCPU - that should even be needed in the old code, it's just that
> the old code was buggy and it will blindly overwrite an existing
> entry.. now we'll start to really track everything.
>
> The hash structure might be more efficient than tree to maintain such
> addr->(cpu, fault_time) information, so that the insert() and lookup()
> paths should ideally both be ~O(1).  After all, we do not need to sort.
>
> Here we need to do one remove() though after the lookup().  It could be
> slow but only if many vCPUs faulted exactly on the same address (so when
> the list of cpu entries is long), which should be unlikely. Even with that,
> it's still a worst case O(N) (consider 400 vCPUs faulted on the same
> address and how likely is it..) rather than a constant O(N) complexity.
>
> When at it, touch up the tracepoints to make them slightly more useful.
> One tracepoint is added when walking all the fault entries.
>
> Signed-off-by: Peter Xu <pet...@redhat.com>

Reviewed-by: Fabiano Rosas <faro...@suse.de>

Reply via email to