Hi, Marc, I measured the time from vcpu_load() (include it) to __guest_enter() on Kunpeng 920. On average, It takes 2.55 microseconds (not first run && the VPT is empty). So waiting for 10 microseconds in vcpu scheduling really hurts performance.
And I agree that delaying the execution of its_wait_vpt_parse_complete() might be a viable solution. -----Original Message----- From: Marc Zyngier [mailto:m...@kernel.org] Sent: 2020-09-16 16:40 To: lushenming <lushenm...@huawei.com> Cc: Thomas Gleixner <t...@linutronix.de>; Jason Cooper <ja...@lakedaemon.net>; linux-kernel@vger.kernel.org; Wanghaibin (D) <wanghaibin.w...@huawei.com>; yuzenghui <yuzeng...@huawei.com> Subject: Re: [PATCH] irqchip/gic-v4.1: Optimize the delay time of the poll on the GICR_VPENDBASER.Dirty bit On 2020-09-16 08:04, lushenming wrote: > Hi, > > Our team just discussed this issue again and consulted our GIC > hardware design team. They think the RD can afford busy waiting. So we > still think maybe 0 is better, at least for our hardware. > > In addition, if not 0, as I said before, in our measurement, it takes > only hundreds of nanoseconds, or 1~2 microseconds, to finish parsing > the VPT in most cases. So maybe 1 microseconds, or smaller, is more > appropriate. > Anyway, 10 microseconds is too much. > > But it has to be said that it does depend on the hardware > implementation. Exactly. And given that the only publicly available implementation is a software model, I am reluctant to change "performance" related things based on benchmarks that can't be verified and appears to me as a micro optimization. > Besides, I'm not sure where are the start and end point of the total > scheduling latency of a vcpu you said, which includes many events. Is > the parse time of the VPT not clear enough? Measure the time it takes from kvm_vcpu_load() to the point where the vcpu enters the guest. How much, in proportion, do these 1/2/10ms represent? Also, a better(?) course of action would maybe to consider whether we should split the its_vpe_schedule() call into two distinct operations: one that programs the VPE to be resident, and another that poll the Dirty bit *much later* on the entry path, giving the GIC a chance to work in parallel with the CPU on the entry path. If your HW is a quick as you say it is, it would pretty much guarantee a clear read of GICR_VPENDBASER without waiting. M. -- Jazz is not dead. It just smells funny...