On 01/07/26 11:43, Shrikanth Hegde wrote:
Hi Vishal,
Hi Shrikanth, Thanks for looking into it.
On 6/26/26 4:23 PM, Vishal Chourasia wrote:
Use the generic infrastructure to check for and handle pending work
before transitioning into guest mode, replacing the open-coded
need_resched() and cond_resched() checks.
This picks up handling for TIF_NOTIFY_RESUME, which was previously
ignored, meaning task work will now be correctly handled on every
guest re-entry.
Yes
It would indeed be good if powerpc moves go generic
VIRT_XFER_TO_GUEST_WORK.
It does take care of RESUME.
In addition today, I doubt powerpc kvm works well for LAZY preemption.
generic infra will take care of it too.
nice
Signed-off-by: Vishal Chourasia <[email protected]>
---
arch/powerpc/kvm/Kconfig | 1 +
arch/powerpc/kvm/book3s_hv.c | 58 +++++++++++++++++++++++++++++++-----
2 files changed, 52 insertions(+), 7 deletions(-)
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 9a0d1c1aca6c..36aec58c5f22 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -81,6 +81,7 @@ config KVM_BOOK3S_64_HV
depends on KVM_BOOK3S_64 && PPC_POWERNV
select KVM_BOOK3S_HV_POSSIBLE
select KVM_BOOK3S_HV_PMU
+ select VIRT_XFER_TO_GUEST_WORK
This takes care of HV only.
Does PR/booke run into the same problem?
Yes, I would think so, if the KVM vCPU runs inside system.
Does anyone out there who runs them still and care about cgroup
bandwidth control mechanism on it?
It may be worth adding comment that PR still runs into the same issue.
select CMA
help
Support running unmodified book3s_64 guest kernels in
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 61dbeea317f3..b012512342e6 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3850,10 +3850,20 @@ static noinline void kvmppc_run_core(struct
kvmppc_vcore *vc)
* and return without going into the guest(s).
* If the mmu_ready flag has been cleared, don't go into the
* guest because that means a HPT resize operation is in progress.
+ *
+ * xfer_to_guest_mode_work_pending() is the IRQs-disabled
recheck for
+ * pending guest-mode work (reschedule, signals, and
TIF_NOTIFY_RESUME
+ * task_work such as the deferred CFS throttle). It is the
pre-POWER9
+ * analog of the final gate in kvmhv_run_single_vcpu(), and a
superset
+ * of the old need_resched() check: it catches work that raced
in after
+ * the drain in kvmppc_run_vcpu(), so a CPU-bound vCPU is
throttled here
+ * instead of running one more guest dispatch past its quota.
IRQs are
+ * hard-disabled just above, so the non-__ variant (which
asserts that)
+ * is the correct one.
*/
local_irq_disable();
hard_irq_disable();
- if (lazy_irq_pending() || need_resched() ||
+ if (lazy_irq_pending() || xfer_to_guest_mode_work_pending() ||
recheck_signals_and_mmu(&core_info)) {
local_irq_enable();
vc->vcore_state = VCORE_INACTIVE;
@@ -4824,10 +4834,24 @@ static int kvmppc_run_vcpu(struct kvm_vcpu
*vcpu)
vc->runner = vcpu;
if (n_ceded == vc->n_runnable) {
kvmppc_vcore_blocked(vc);
- } else if (need_resched()) {
+ } else if (__xfer_to_guest_mode_work_pending()) {
kvmppc_vcore_preempt(vc);
- /* Let something else run */
- cond_resched_lock(&vc->lock);
+ /*
+ * Let something else run, and run pending guest-mode
+ * work (reschedule, and TIF_NOTIFY_RESUME task_work such
+ * as the deferred CFS throttle) before we would re-enter
+ * the guest, so a CPU-bound vCPU is actually throttled
+ * here instead of running past its quota. This is a
+ * superset of the old need_resched() check. Use the raw
+ * helper, not the kvm_ wrapper: signals (KVM_EXIT_INTR
+ * and the signal_exits stat) are accounted by this path's
+ * existing handling below, so going through the wrapper
+ * here would double-count them. The helper may schedule(),
+ * so the vcore lock is dropped around it.
+ */
+ spin_unlock(&vc->lock);
+ xfer_to_guest_mode_handle_work();
+ spin_lock(&vc->lock);
if (vc->vcore_state == VCORE_PREEMPT)
kvmppc_vcore_end_preempt(vc);
} else {
@@ -4899,8 +4923,21 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu
*vcpu, u64 time_limit,
}
}
- if (need_resched())
- cond_resched();
+ /*
+ * Run pending work before (re-)entering the guest, most
importantly
+ * task_work queued via TWA_RESUME (e.g. the deferred CFS bandwidth
+ * throttle, which only sets TIF_NOTIFY_RESUME). Without this a
CPU-bound
+ * vCPU that keeps returning RESUME_GUEST never reaches an
exit-to-user
+ * point, so the throttle is never enforced and the task runs
far beyond
+ * its quota. The helper also handles reschedule and signals,
replacing
+ * the cond_resched() that was here. It may schedule(), so it
runs before
+ * preemption and IRQs are disabled, with no vcore/KVM locks
held. This
+ * is the per-reentry site shared by the bare-metal and pseries
(nested)
+ * paths, so both are covered.
+ */
+ r = kvm_xfer_to_guest_mode_handle_work(vcpu);
+ if (r) /* -EINTR: signal pending, exit to userspace
(KVM_EXIT_INTR) */
+ return r;
kvmppc_update_vpas(vcpu);
@@ -4916,7 +4953,14 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu
*vcpu, u64 time_limit,
if (signal_pending(current))
goto sigpend;
xfer_to_guest_mode_work_pending checks for signals too right?
#define XFER_TO_GUEST_MODE_WORK \
(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | _TIF_SIGPENDING | \
_TIF_NOTIFY_SIGNAL | _TIF_NOTIFY_RESUME | \
ARCH_XFER_TO_GUEST_MODE_WORK)
Yes, I see originally we only had need_resched() check. will remove in v2
- if (need_resched() || !kvm->arch.mmu_ready)
+ /*
+ * Re-check for pending guest-mode work with IRQs disabled, to
catch
+ * anything (e.g. a TIF_NOTIFY_RESUME task_work such as the
deferred CFS
+ * throttle) that raced in after the check above. Bail back to
the outer
+ * loop, which re-enters here and runs the work. This is a
superset of
+ * the previous need_resched() check.
+ */
+ if (xfer_to_guest_mode_work_pending() || !kvm->arch.mmu_ready)
goto out;
vcpu->cpu = pcpu;
Copying from the discussion thread at that time,
"""
on x86:
kvm_arch_vcpu_ioctl_run
vcpu_run
for () {
.. run guest..
xfer_to_guest_mode_handle_work
schedule
}
on Powerpc: ( taking book3s_hv flavour):
kvm_arch_vcpu_ioctl_run
kvmppc_vcpu_run_hv *1
do while() {
kvmhv_run_single_vcpu or kvmppc_run_vcpu
-- checking for need_resched and signals and bails out *2
}
*1 - checks for need resched and signals before entering guest
*2 - checks for need resched and signals while running the guest
This patch is addressing only *1 but it needs to address *2 as well
using generic framework.
Not sure about *2 (while running the guest) part. Here the patch checks
twice before entering the guest.
Once when IRQs are not disabled and once after disabling IRQs.
I think it is doable for books3s_hv atleast. (though might need rewrite)
do what exactly?
"""
A few questions that comes to my mind.
1. I think you are doing for both *1 and *2 right? Is *1 necessary
still for powerpc?
yes, before entering the guest we check twice. Once when IRQs are not
disabled and once after disabling IRQs.
2. There is a for loop. What is it doing? Does this still need similar
checks?
if (is_kvmppc_resume_guest(r) &&
!kvmppc_vcpu_check_block(vcpu)) {
kvmppc_set_timer(vcpu);
prepare_to_rcuwait(wait);
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
if (signal_pending(current)) {
vcpu->stat.signal_exits++;
run->exit_reason = KVM_EXIT_INTR;
vcpu->arch.ret = -EINTR;
break;
}
if (kvmppc_vcpu_check_block(vcpu))
break;
trace_kvmppc_vcore_blocked(vcpu, 0);
schedule();
trace_kvmppc_vcore_blocked(vcpu, 1);
}
finish_rcuwait(wait);
}
vcpu->arch.ceded = 0;
IIUC, the vCPU has voluntarily went idle in the guest and it's
about to re-enter, but there's genuinely nothing to deliver yet.
So, it parks itself here until, something actually wake this vCPU.
I am not quite sure, if signal_pending() should be replaced with
xfer_*_pending() check, because vCPU is not re-entering the guest.
Adding the xfer_*_pending() check here will break the loop even
when other flags causing busy spins (cede/re-enter).
PS: cc'ing me would have helped me to see the mail earlier since you
had referred
to the patch I had sent.
Yes. I missed it.