book3s_hv: Use generic xfer to guest work function

Vishal Chourasia Wed, 01 Jul 2026 11:05:29 -0700


On 01/07/26 11:43, Shrikanth Hegde wrote:

Hi Vishal,

Hi Shrikanth, Thanks for looking into it.


On 6/26/26 4:23 PM, Vishal Chourasia wrote:

Use the generic infrastructure to check for and handle pending work
before transitioning into guest mode, replacing the open-coded
need_resched() and cond_resched() checks.

This picks up handling for TIF_NOTIFY_RESUME, which was previously
ignored, meaning task work will now be correctly handled on every
guest re-entry.

Yes

It would indeed be good if powerpc moves go genericVIRT_XFER_TO_GUEST_WORK.
It does take care of RESUME.

In addition today, I doubt powerpc kvm works well for LAZY preemption.
generic infra will take care of it too.

nice


Signed-off-by: Vishal Chourasia <[email protected]>
---
  arch/powerpc/kvm/Kconfig     |  1 +
  arch/powerpc/kvm/book3s_hv.c | 58 +++++++++++++++++++++++++++++++-----
  2 files changed, 52 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 9a0d1c1aca6c..36aec58c5f22 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -81,6 +81,7 @@ config KVM_BOOK3S_64_HV
      depends on KVM_BOOK3S_64 && PPC_POWERNV
      select KVM_BOOK3S_HV_POSSIBLE
      select KVM_BOOK3S_HV_PMU
+    select VIRT_XFER_TO_GUEST_WORK


This takes care of HV only.
Does PR/booke run into the same problem?

Yes, I would think so, if the KVM vCPU runs inside system.

Does anyone out there who runs them still and care about cgroupbandwidth control mechanism on it?
It may be worth adding comment that PR still runs into the same issue.
      select CMA
      help
        Support running unmodified book3s_64 guest kernels in
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 61dbeea317f3..b012512342e6 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3850,10 +3850,20 @@ static noinline void kvmppc_run_core(structkvmppc_vcore *vc)
       * and return without going into the guest(s).
       * If the mmu_ready flag has been cleared, don't go into the
       * guest because that means a HPT resize operation is in progress.
+     *
+ * xfer_to_guest_mode_work_pending() is the IRQs-disabledrecheck for+ * pending guest-mode work (reschedule, signals, andTIF_NOTIFY_RESUME+ * task_work such as the deferred CFS throttle). It is thepre-POWER9+ * analog of the final gate in kvmhv_run_single_vcpu(), and asuperset+ * of the old need_resched() check: it catches work that racedin after+ * the drain in kvmppc_run_vcpu(), so a CPU-bound vCPU isthrottled here+ * instead of running one more guest dispatch past its quota.IRQs are+ * hard-disabled just above, so the non-__ variant (whichasserts that)
+     * is the correct one.
       */
      local_irq_disable();
      hard_irq_disable();
-    if (lazy_irq_pending() || need_resched() ||
+    if (lazy_irq_pending() || xfer_to_guest_mode_work_pending() ||
          recheck_signals_and_mmu(&core_info)) {
          local_irq_enable();
          vc->vcore_state = VCORE_INACTIVE;
@@ -4824,10 +4834,24 @@ static int kvmppc_run_vcpu(struct kvm_vcpu*vcpu)
          vc->runner = vcpu;
          if (n_ceded == vc->n_runnable) {
              kvmppc_vcore_blocked(vc);
-        } else if (need_resched()) {
+        } else if (__xfer_to_guest_mode_work_pending()) {
              kvmppc_vcore_preempt(vc);
-            /* Let something else run */
-            cond_resched_lock(&vc->lock);
+            /*
+             * Let something else run, and run pending guest-mode
+             * work (reschedule, and TIF_NOTIFY_RESUME task_work such
+             * as the deferred CFS throttle) before we would re-enter
+             * the guest, so a CPU-bound vCPU is actually throttled
+             * here instead of running past its quota. This is a
+             * superset of the old need_resched() check. Use the raw
+             * helper, not the kvm_ wrapper: signals (KVM_EXIT_INTR
+             * and the signal_exits stat) are accounted by this path's
+             * existing handling below, so going through the wrapper
+             * here would double-count them. The helper may schedule(),
+             * so the vcore lock is dropped around it.
+             */
+            spin_unlock(&vc->lock);
+            xfer_to_guest_mode_handle_work();
+            spin_lock(&vc->lock);
              if (vc->vcore_state == VCORE_PREEMPT)
                  kvmppc_vcore_end_preempt(vc);
          } else {
@@ -4899,8 +4923,21 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu*vcpu, u64 time_limit,
          }
      }
  -    if (need_resched())
-        cond_resched();
+    /*
+ * Run pending work before (re-)entering the guest, mostimportantly
+     * task_work queued via TWA_RESUME (e.g. the deferred CFS bandwidth
+ * throttle, which only sets TIF_NOTIFY_RESUME). Without this aCPU-bound+ * vCPU that keeps returning RESUME_GUEST never reaches anexit-to-user+ * point, so the throttle is never enforced and the task runsfar beyond+ * its quota. The helper also handles reschedule and signals,replacing+ * the cond_resched() that was here. It may schedule(), so itruns before+ * preemption and IRQs are disabled, with no vcore/KVM locksheld. This+ * is the per-reentry site shared by the bare-metal and pseries(nested)
+     * paths, so both are covered.
+     */
+    r = kvm_xfer_to_guest_mode_handle_work(vcpu);
+ if (r) /* -EINTR: signal pending, exit to userspace(KVM_EXIT_INTR) */
+        return r;
        kvmppc_update_vpas(vcpu);
@@ -4916,7 +4953,14 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu*vcpu, u64 time_limit,
        if (signal_pending(current))
          goto sigpend;
xfer_to_guest_mode_work_pending checks for signals too right?

#define XFER_TO_GUEST_MODE_WORK                                         \
        (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | _TIF_SIGPENDING | \
         _TIF_NOTIFY_SIGNAL | _TIF_NOTIFY_RESUME |                      \
         ARCH_XFER_TO_GUEST_MODE_WORK)

Yes, I see originally we only had need_resched() check. will remove in v2

-    if (need_resched() || !kvm->arch.mmu_ready)
+    /*
+ * Re-check for pending guest-mode work with IRQs disabled, tocatch+ * anything (e.g. a TIF_NOTIFY_RESUME task_work such as thedeferred CFS+ * throttle) that raced in after the check above. Bail back tothe outer+ * loop, which re-enters here and runs the work. This is asuperset of
+     * the previous need_resched() check.
+     */
+    if (xfer_to_guest_mode_work_pending() || !kvm->arch.mmu_ready)
          goto out;
        vcpu->cpu = pcpu;
Copying from the discussion thread at that time,

"""
on x86:
kvm_arch_vcpu_ioctl_run
    vcpu_run
        for () {
            .. run guest..
            xfer_to_guest_mode_handle_work
                schedule
        }


on Powerpc:  ( taking book3s_hv flavour):
kvm_arch_vcpu_ioctl_run
kvmppc_vcpu_run_hv  *1
    do while() {
        kvmhv_run_single_vcpu or kvmppc_run_vcpu
            -- checking for need_resched and signals and bails out *2
    }


*1 - checks for need resched and signals before entering guest
*2 - checks for need resched and signals while running the guest
This patch is addressing only *1 but it needs to address *2 as wellusing generic framework.

Not sure about *2 (while running the guest) part. Here the patch checkstwice before entering the guest.

Once when IRQs are not disabled and once after disabling IRQs.

I think it is doable for books3s_hv atleast. (though might need rewrite)

do what exactly?

"""


A few questions that comes to my mind.
1. I think you are doing for both *1 and *2 right? Is *1 necessarystill for powerpc?

yes, before entering the guest we check twice. Once when IRQs are notdisabled and once after disabling IRQs.

2. There is a for loop. What is it doing? Does this still need similarchecks?

if (is_kvmppc_resume_guest(r) &&!kvmppc_vcpu_check_block(vcpu)) {

                kvmppc_set_timer(vcpu);

                prepare_to_rcuwait(wait);
                for (;;) {
                        set_current_state(TASK_INTERRUPTIBLE);
                        if (signal_pending(current)) {
                                vcpu->stat.signal_exits++;
                                run->exit_reason = KVM_EXIT_INTR;
                                vcpu->arch.ret = -EINTR;
                                break;
                        }

                        if (kvmppc_vcpu_check_block(vcpu))
                                break;

                        trace_kvmppc_vcore_blocked(vcpu, 0);
                        schedule();
                        trace_kvmppc_vcore_blocked(vcpu, 1);
                }
                finish_rcuwait(wait);
        }
        vcpu->arch.ceded = 0;


IIUC, the vCPU has voluntarily went idle in the guest and it's
about to re-enter, but there's genuinely nothing to deliver yet.
So, it parks itself here until, something actually wake this vCPU.

I am not quite sure, if signal_pending() should be replaced with
xfer_*_pending() check, because vCPU is not re-entering the guest.

Adding the xfer_*_pending() check here will break the loop even
when other flags causing busy spins (cede/re-enter).

PS: cc'ing me would have helped me to see the mail earlier since youhad referred
to the patch I had sent.


Yes. I missed it.

Re: [PATCH 1/1] KVM: powerpc/book3s_hv: Use generic xfer to guest work function

Reply via email to