On Fri, Jan 30, 2026 at 08:22:34PM +0000, Anirudh Rayabharam wrote:
> On Fri, Jan 30, 2026 at 10:51:10AM -0800, Stanislav Kinsburskii wrote:
> > On Fri, Jan 30, 2026 at 06:43:09PM +0000, Anirudh Rayabharam wrote:
> > > On Fri, Jan 30, 2026 at 10:37:38AM -0800, Stanislav Kinsburskii wrote:
> > > > On Fri, Jan 30, 2026 at 05:30:25PM +0000, Anirudh Rayabharam wrote:
> > > > > On Thu, Jan 29, 2026 at 11:09:46AM -0800, Stanislav Kinsburskii wrote:
> > > > > > On Thu, Jan 29, 2026 at 05:47:02PM +0000, Michael Kelley wrote:
> > > > > > > From: Stanislav Kinsburskii <[email protected]> 
> > > > > > > Sent: Wednesday, January 21, 2026 2:36 PM
> > > > > > > > 
> > > > > > > > From: Andreea Pintilie <[email protected]>
> > > > > > > > 
> > > > > > > > Query the hypervisor for integrated scheduler support and use 
> > > > > > > > it if
> > > > > > > > configured.
> > > > > > > > 
> > > > > > > > Microsoft Hypervisor originally provided two schedulers: root 
> > > > > > > > and core. The
> > > > > > > > root scheduler allows the root partition to schedule guest 
> > > > > > > > vCPUs across
> > > > > > > > physical cores, supporting both time slicing and CPU affinity 
> > > > > > > > (e.g., via
> > > > > > > > cgroups). In contrast, the core scheduler delegates 
> > > > > > > > vCPU-to-physical-core
> > > > > > > > scheduling entirely to the hypervisor.
> > > > > > > > 
> > > > > > > > Direct virtualization introduces a new privileged guest 
> > > > > > > > partition type - L1
> > > > > > > > Virtual Host (L1VH) — which can create child partitions from 
> > > > > > > > its own
> > > > > > > > resources. These child partitions are effectively siblings, 
> > > > > > > > scheduled by
> > > > > > > > the hypervisor's core scheduler. This prevents the L1VH parent 
> > > > > > > > from setting
> > > > > > > > affinity or time slicing for its own processes or guest VPs. 
> > > > > > > > While cgroups,
> > > > > > > > CFS, and cpuset controllers can still be used, their 
> > > > > > > > effectiveness is
> > > > > > > > unpredictable, as the core scheduler swaps vCPUs according to 
> > > > > > > > its own logic
> > > > > > > > (typically round-robin across all allocated physical CPUs). As 
> > > > > > > > a result,
> > > > > > > > the system may appear to "steal" time from the L1VH and its 
> > > > > > > > children.
> > > > > > > > 
> > > > > > > > To address this, Microsoft Hypervisor introduces the integrated 
> > > > > > > > scheduler.
> > > > > > >   This the s allows an L1VH partition to schedule its own vCPUs 
> > > > > > > and those of its
> > > > > > > > guests across its "physical" cores, effectively emulating root 
> > > > > > > > scheduler
> > > > > > > > behavior within the L1VH, while retaining core scheduler 
> > > > > > > > behavior for the
> > > > > > > > rest of the system.
> > > > > > > > 
> > > > > > > > The integrated scheduler is controlled by the root partition 
> > > > > > > > and gated by
> > > > > > > > the vmm_enable_integrated_scheduler capability bit. If set, the 
> > > > > > > > hypervisor
> > > > > > > > supports the integrated scheduler. The L1VH partition must then 
> > > > > > > > check if it
> > > > > > > > is enabled by querying the corresponding extended partition 
> > > > > > > > property. If
> > > > > > > > this property is true, the L1VH partition must use the root 
> > > > > > > > scheduler
> > > > > > > > logic; otherwise, it must use the core scheduler.
> > > > > > > > 
> > > > > > > > Signed-off-by: Andreea Pintilie <[email protected]>
> > > > > > > > Signed-off-by: Stanislav Kinsburskii 
> > > > > > > > <[email protected]>
> > > > > > > > ---
> > > > > > > >  drivers/hv/mshv_root_main.c |   79 
> > > > > > > > +++++++++++++++++++++++++++++--------------
> > > > > > > >  include/hyperv/hvhdk_mini.h |    6 +++
> > > > > > > >  2 files changed, 58 insertions(+), 27 deletions(-)
> > > > > > > > 
> > > > 
> > > >  <snip>
> > > > 
> > > > > > > > -root_sched_deinit:
> > > > > > > > -       root_scheduler_deinit();
> > > > > > > > -       return err;
> > > > > > > >  }
> > > > > > > > 
> > > > > > > > -static void mshv_init_vmm_caps(struct device *dev)
> > > > > > > > +static int mshv_init_vmm_caps(struct device *dev)
> > > > > > > >  {
> > > > > > > > -       /*
> > > > > > > > -        * This can only fail here if 
> > > > > > > > HVCALL_GET_PARTITION_PROPERTY_EX or
> > > > > > > > -        * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not 
> > > > > > > > supported. In that
> > > > > > > > -        * case it's valid to proceed as if all vmm_caps are 
> > > > > > > > disabled (zero).
> > > > > > > > -        */
> > > > > > > > -       if 
> > > > > > > > (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > > > > > > -                                             
> > > > > > > > HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > > > > > > -                                             0, 
> > > > > > > > &mshv_root.vmm_caps,
> > > > > > > > -                                             
> > > > > > > > sizeof(mshv_root.vmm_caps)))
> > > > > > > > -               dev_warn(dev, "Unable to get VMM 
> > > > > > > > capabilities\n");
> > > > > > > > +       int ret;
> > > > > > > > +
> > > > > > > > +       ret = 
> > > > > > > > hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > > > > > > +                                               
> > > > > > > > HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > > > > > > +                                               0, 
> > > > > > > > &mshv_root.vmm_caps,
> > > > > > > > +                                               
> > > > > > > > sizeof(mshv_root.vmm_caps));
> > > > > > > > +       if (ret) {
> > > > > > > > +               dev_err(dev, "Failed to get VMM capabilities: 
> > > > > > > > %d\n", ret);
> > > > > > > > +               return ret;
> > > > > > > > +       }
> > > > > > > 
> > > > > > > This is a functional change that isn't mentioned in the commit 
> > > > > > > message.
> > > > > > > Why is it now appropriate to fail instead of treating the VMM 
> > > > > > > capabilities
> > > > > > > as all disabled? Presumably there are older versions of the 
> > > > > > > hypervisor that
> > > > > > > don't support the requirements described in the original comment, 
> > > > > > > but
> > > > > > > perhaps they are no longer relevant?
> > > > > > > 
> > > > > > 
> > > > > > To fail is now the only option for the L1VH partition. It must 
> > > > > > discover
> > > > > > the scheduler type. Without this information, the partition cannot
> > > > > > operate. The core scheduler logic will not work with an integrated
> > > > > > scheduler, and vice versa.
> > > > > 
> > > > > I don't think we need to fail here. If we don't find vmm caps, that
> > > > > means we are on an older hypervisor that supports l1vh but not
> > > > > integrated scheduler (yes, such a version exists). In this case since
> > > > > integrated scheduler is not supported by the hypervisor, the core
> > > > > scheduler logic will work.
> > > > > 
> > > > 
> > > > The older hypervisor version won't have the integrated scheduler
> > > > capabity bit.
> > > > And we can't operate in core schedule mode if the integrated is enabled
> > > > underneath us.
> > > 
> > > The older hypervisor won't have the integrated scheduler capability bit.
> > > This means that the older hypervisor doesn't support integrated
> > > scheduler (this is how vmm caps work: if the bit doesn't exist or
> > > vmm caps themselves don't exist the feature should be assumed as not
> > > available). If the hypervisor doesn't support integrated scheduler in the
> > > first place, it can't be enabled underneath us. So, it is safe to
> > > operate in core scheduler mode.
> > > 
> > 
> > We can’t tell whether the hypervisor is older and simply doesn’t have
> > the VMM caps bit, or whether we just failed to fetch the VMM caps.
> 
> If we failed to fetch the VMM caps i.e. the hypervisor doesn't support
> the vmm caps property, we must assume that all the bits in vmm caps are
> 0 (i.e. no features are available). This is how vmm capabilities are
> supposed to be interpreted. This is something I checked with the
> hypervisor team some time back.
> 
> > 
> > In other words, we can’t distinguish between “an older hypervisor
> > without integrated scheduler support” and “a newer hypervisor with an
> > integrated scheduler, but we failed to fetch the VMM caps”.
> > 
> > But for completeness: are you saying there is an older hypervisor
> > version that supports L1VH, but does not support VMM caps?
> 
> I don't know how much of the Azure fleet still runs it but yes such a
> hypervisor version exists.
> 

We don't need to support interim hypervisor versions in the upstream
kernel: these version will go away, and then this logic will become not
only a dead code path but also incorrect.

We can keep the existing logic that treats failure to fetch VMM as
notrmal internally until required.

Thanks,
Stanislav

> Thanks,
> Anirudh
> 
> > 
> > Thanks, Stanislav
> > 
> > > Thanks,
> > > Anirudh.

Reply via email to