Peter Xu <pet...@redhat.com> writes: > Currently, the postcopy blocktime feature maintains vCPU fault information > using an array (vcpu_addr[]). It has two issues. > > Issue 1: Performance Concern > ============================ > > The old algorithm was almost OK and fast on inserts, except that the lookup > is slow and won't scale if there are a lot of vCPUs: when a page is copied > during postcopy, mark_postcopy_blocktime_end() will walk the whole array > trying to find which vCPUs are blocked by the address. So it needs > constant O(N) walk for each page resolution. > > Alexey (the author of postcopy blocktime) mentioned the perf issue and how > to optimize it in a piece of comment in the page resolution path. The > comment was (interestingly..) not complete, but it's relatively clear what > he wanted to say about this perf issue. > > Issue 2: Wrong Accounting on re-entrancies > ========================================== > > People might think that each vCPU should only and always get one fault at a > time, so that when the blocktime layer captured one fault on one vCPU, we > should never see another fault message on this vCPU. > > It's almost correct, except in some extreme rare cases. > > Case 1: it's possible the fault thread processes the userfaultfd messages > too fast so it can see >1 messages on one vCPU before the previous one was > resolved. > > Case 2: it's theoretically also possible one vCPU can get even more than > one message on the same fault address if a fault is retried by the > kernel (e.g., handle_userfault() got interrupted before page resolution). > > As this info might be important, instead of using commit message, I put > more details into the code as comment, when introducing an array > maintaining concurrent faults on one vCPU. Please refer to the comments > for details on both cases, especially case 1 which can be tricky. > > Case 1 sounds rare, but it can be easily reproduced locally for me when we > run blocktime together with the migration-test on the vanilla postcopy. > > New Design > ========== > > This patch should do almost what Alexey mentioned, but slightly > differently: instead of having an array to maintain vCPU fault addresses, > for each of the fault message we push a message into a hash, indexed by the > fault address. > > With the hash, it can replace the old two structs: both the vcpu_addr[] > array, and also the array to store the start time of the fault. However > due to above we need one more counter array to account concurrent faults on > the same vCPU - that should even be needed in the old code, it's just that > the old code was buggy and it will blindly overwrite an existing > entry.. now we'll start to really track everything. > > The hash structure might be more efficient than tree to maintain such > addr->(cpu, fault_time) information, so that the insert() and lookup() > paths should ideally both be ~O(1). After all, we do not need to sort. > > Here we need to do one remove() though after the lookup(). It could be > slow but only if many vCPUs faulted exactly on the same address (so when > the list of cpu entries is long), which should be unlikely. Even with that, > it's still a worst case O(N) (consider 400 vCPUs faulted on the same > address and how likely is it..) rather than a constant O(N) complexity. > > When at it, touch up the tracepoints to make them slightly more useful. > One tracepoint is added when walking all the fault entries. > > Signed-off-by: Peter Xu <pet...@redhat.com>
Reviewed-by: Fabiano Rosas <faro...@suse.de>