On Fri, Jan 30, 2026 at 08:22:34PM +0000, Anirudh Rayabharam wrote: > On Fri, Jan 30, 2026 at 10:51:10AM -0800, Stanislav Kinsburskii wrote: > > On Fri, Jan 30, 2026 at 06:43:09PM +0000, Anirudh Rayabharam wrote: > > > On Fri, Jan 30, 2026 at 10:37:38AM -0800, Stanislav Kinsburskii wrote: > > > > On Fri, Jan 30, 2026 at 05:30:25PM +0000, Anirudh Rayabharam wrote: > > > > > On Thu, Jan 29, 2026 at 11:09:46AM -0800, Stanislav Kinsburskii wrote: > > > > > > On Thu, Jan 29, 2026 at 05:47:02PM +0000, Michael Kelley wrote: > > > > > > > From: Stanislav Kinsburskii <[email protected]> > > > > > > > Sent: Wednesday, January 21, 2026 2:36 PM > > > > > > > > > > > > > > > > From: Andreea Pintilie <[email protected]> > > > > > > > > > > > > > > > > Query the hypervisor for integrated scheduler support and use > > > > > > > > it if > > > > > > > > configured. > > > > > > > > > > > > > > > > Microsoft Hypervisor originally provided two schedulers: root > > > > > > > > and core. The > > > > > > > > root scheduler allows the root partition to schedule guest > > > > > > > > vCPUs across > > > > > > > > physical cores, supporting both time slicing and CPU affinity > > > > > > > > (e.g., via > > > > > > > > cgroups). In contrast, the core scheduler delegates > > > > > > > > vCPU-to-physical-core > > > > > > > > scheduling entirely to the hypervisor. > > > > > > > > > > > > > > > > Direct virtualization introduces a new privileged guest > > > > > > > > partition type - L1 > > > > > > > > Virtual Host (L1VH) — which can create child partitions from > > > > > > > > its own > > > > > > > > resources. These child partitions are effectively siblings, > > > > > > > > scheduled by > > > > > > > > the hypervisor's core scheduler. This prevents the L1VH parent > > > > > > > > from setting > > > > > > > > affinity or time slicing for its own processes or guest VPs. > > > > > > > > While cgroups, > > > > > > > > CFS, and cpuset controllers can still be used, their > > > > > > > > effectiveness is > > > > > > > > unpredictable, as the core scheduler swaps vCPUs according to > > > > > > > > its own logic > > > > > > > > (typically round-robin across all allocated physical CPUs). As > > > > > > > > a result, > > > > > > > > the system may appear to "steal" time from the L1VH and its > > > > > > > > children. > > > > > > > > > > > > > > > > To address this, Microsoft Hypervisor introduces the integrated > > > > > > > > scheduler. > > > > > > > This the s allows an L1VH partition to schedule its own vCPUs > > > > > > > and those of its > > > > > > > > guests across its "physical" cores, effectively emulating root > > > > > > > > scheduler > > > > > > > > behavior within the L1VH, while retaining core scheduler > > > > > > > > behavior for the > > > > > > > > rest of the system. > > > > > > > > > > > > > > > > The integrated scheduler is controlled by the root partition > > > > > > > > and gated by > > > > > > > > the vmm_enable_integrated_scheduler capability bit. If set, the > > > > > > > > hypervisor > > > > > > > > supports the integrated scheduler. The L1VH partition must then > > > > > > > > check if it > > > > > > > > is enabled by querying the corresponding extended partition > > > > > > > > property. If > > > > > > > > this property is true, the L1VH partition must use the root > > > > > > > > scheduler > > > > > > > > logic; otherwise, it must use the core scheduler. > > > > > > > > > > > > > > > > Signed-off-by: Andreea Pintilie <[email protected]> > > > > > > > > Signed-off-by: Stanislav Kinsburskii > > > > > > > > <[email protected]> > > > > > > > > --- > > > > > > > > drivers/hv/mshv_root_main.c | 79 > > > > > > > > +++++++++++++++++++++++++++++-------------- > > > > > > > > include/hyperv/hvhdk_mini.h | 6 +++ > > > > > > > > 2 files changed, 58 insertions(+), 27 deletions(-) > > > > > > > > > > > > > > > > <snip> > > > > > > > > > > > > -root_sched_deinit: > > > > > > > > - root_scheduler_deinit(); > > > > > > > > - return err; > > > > > > > > } > > > > > > > > > > > > > > > > -static void mshv_init_vmm_caps(struct device *dev) > > > > > > > > +static int mshv_init_vmm_caps(struct device *dev) > > > > > > > > { > > > > > > > > - /* > > > > > > > > - * This can only fail here if > > > > > > > > HVCALL_GET_PARTITION_PROPERTY_EX or > > > > > > > > - * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not > > > > > > > > supported. In that > > > > > > > > - * case it's valid to proceed as if all vmm_caps are > > > > > > > > disabled (zero). > > > > > > > > - */ > > > > > > > > - if > > > > > > > > (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF, > > > > > > > > - > > > > > > > > HV_PARTITION_PROPERTY_VMM_CAPABILITIES, > > > > > > > > - 0, > > > > > > > > &mshv_root.vmm_caps, > > > > > > > > - > > > > > > > > sizeof(mshv_root.vmm_caps))) > > > > > > > > - dev_warn(dev, "Unable to get VMM > > > > > > > > capabilities\n"); > > > > > > > > + int ret; > > > > > > > > + > > > > > > > > + ret = > > > > > > > > hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF, > > > > > > > > + > > > > > > > > HV_PARTITION_PROPERTY_VMM_CAPABILITIES, > > > > > > > > + 0, > > > > > > > > &mshv_root.vmm_caps, > > > > > > > > + > > > > > > > > sizeof(mshv_root.vmm_caps)); > > > > > > > > + if (ret) { > > > > > > > > + dev_err(dev, "Failed to get VMM capabilities: > > > > > > > > %d\n", ret); > > > > > > > > + return ret; > > > > > > > > + } > > > > > > > > > > > > > > This is a functional change that isn't mentioned in the commit > > > > > > > message. > > > > > > > Why is it now appropriate to fail instead of treating the VMM > > > > > > > capabilities > > > > > > > as all disabled? Presumably there are older versions of the > > > > > > > hypervisor that > > > > > > > don't support the requirements described in the original comment, > > > > > > > but > > > > > > > perhaps they are no longer relevant? > > > > > > > > > > > > > > > > > > > To fail is now the only option for the L1VH partition. It must > > > > > > discover > > > > > > the scheduler type. Without this information, the partition cannot > > > > > > operate. The core scheduler logic will not work with an integrated > > > > > > scheduler, and vice versa. > > > > > > > > > > I don't think we need to fail here. If we don't find vmm caps, that > > > > > means we are on an older hypervisor that supports l1vh but not > > > > > integrated scheduler (yes, such a version exists). In this case since > > > > > integrated scheduler is not supported by the hypervisor, the core > > > > > scheduler logic will work. > > > > > > > > > > > > > The older hypervisor version won't have the integrated scheduler > > > > capabity bit. > > > > And we can't operate in core schedule mode if the integrated is enabled > > > > underneath us. > > > > > > The older hypervisor won't have the integrated scheduler capability bit. > > > This means that the older hypervisor doesn't support integrated > > > scheduler (this is how vmm caps work: if the bit doesn't exist or > > > vmm caps themselves don't exist the feature should be assumed as not > > > available). If the hypervisor doesn't support integrated scheduler in the > > > first place, it can't be enabled underneath us. So, it is safe to > > > operate in core scheduler mode. > > > > > > > We can’t tell whether the hypervisor is older and simply doesn’t have > > the VMM caps bit, or whether we just failed to fetch the VMM caps. > > If we failed to fetch the VMM caps i.e. the hypervisor doesn't support > the vmm caps property, we must assume that all the bits in vmm caps are > 0 (i.e. no features are available). This is how vmm capabilities are > supposed to be interpreted. This is something I checked with the > hypervisor team some time back. > > > > > In other words, we can’t distinguish between “an older hypervisor > > without integrated scheduler support” and “a newer hypervisor with an > > integrated scheduler, but we failed to fetch the VMM caps”. > > > > But for completeness: are you saying there is an older hypervisor > > version that supports L1VH, but does not support VMM caps? > > I don't know how much of the Azure fleet still runs it but yes such a > hypervisor version exists. >
We don't need to support interim hypervisor versions in the upstream kernel: these version will go away, and then this logic will become not only a dead code path but also incorrect. We can keep the existing logic that treats failure to fetch VMM as notrmal internally until required. Thanks, Stanislav > Thanks, > Anirudh > > > > > Thanks, Stanislav > > > > > Thanks, > > > Anirudh.
