On 2/27/2025 8:02 PM, Sean Christopherson wrote:
> On Thu, Feb 27, 2025, Manali Shukla wrote:
>> On 2/26/2025 3:37 AM, Sean Christopherson wrote:
>>>> @@ -5225,6 +5230,8 @@ static __init void svm_set_cpu_caps(void)
>>>> if (vnmi)
>>>> kvm_cpu_cap_set(X86_FEATURE_VNMI);
>>>>
>>>> + kvm_cpu_cap_check_and_set(X86_FEATURE_IDLE_HLT);
>>>
>>> I am 99% certain this is wrong. Or at the very least, severly lacking an
>>> explanation of why it's correct. If L1 enables Idle HLT but not HLT
>>> interception,
>>> then it is KVM's responsibility to NOT exit to L1 if there is a pending
>>> V_IRQ or
>>> V_NMI.
>>>
>>> Yeah, it's buggy. But, it's buggy in part because *existing* KVM support
>>> is buggy.
>>> If L1 disables HLT exiting, but it's enabled in KVM, then KVM will run L2
>>> with
>>> HLT exiting and so it becomes KVM's responsibility to check for valid L2
>>> wake events
>>> prior to scheduling out the vCPU if L2 executes HLT. E.g. nVMX handles
>>> this by
>>> reading vmcs02.GUEST_INTERRUPT_STATUS.RVI as part of
>>> vmx_has_nested_events(). I
>>> don't see the equivalent in nSVM.
>>>
>>> Amusingly, that means Idle HLT is actually a bug fix to some extent. E.g.
>>> if there
>>> is a pending V_IRQ/V_NMI in vmcb02, then running with Idle HLT will
>>> naturally do
>>> the right thing, i.e. not hang the vCPU.
>>>
>>> Anyways, for now, I think the easiest and best option is to simply skip
>>> full nested
>>> support for the moment.
>>>
>>
>> Got it, I see the issue you're talking about. I'll need to look into it a
>> bit more to
>> fully understand it. So yeah, we can hold off on full nested support for
>> idle HLT
>> intercept for now.
>>
>> Since we are planning to disable Idle HLT support on nested guests, should
>> we do
>> something like this ?
>>
>> @@ -167,10 +167,15 @@ void recalc_intercepts(struct vcpu_svm *svm)
>> if (!nested_svm_l2_tlb_flush_enabled(&svm->vcpu))
>> vmcb_clr_intercept(c, INTERCEPT_VMMCALL);
>>
>> + if (!guest_cpu_cap_has(&svm->vcpu, X86_FEATURE_IDLE_HLT))
>> + vmcb_clr_intercept(c, INTERCEPT_IDLE_HLT);
>> +
>>
>> When recalc_intercepts copies the intercept values from vmc01 to vmcb02, it
>> also copies
>> the IDLE HLT intercept bit, which is set to 1 in vmcb01. Normally, this
>> isn't a problem
>> because the HLT intercept takes priority when it's on. But if the HLT
>> intercept gets
>> turned off for some reason, the IDLE HLT intercept will stay on, which is
>> not what we
>> want.
>
> Why don't we want that?
The idle-HLT intercept remains '1' for the L2 guest. Now, when L2 executes HLT
and there
is no pending event available, it will still do idle-HLT exit, although Idle HLT
was never explicitly enabled on L2 guest.
I found this behavior by modifying the ipi_hlt_test a bit.
static void l2_guest_code(void)
{
uint64_t icr_val;
int i;
x2apic_enable();
icr_val = (APIC_DEST_SELF | APIC_INT_ASSERT | INTR_VECTOR);
for (i = 0; i < NUM_ITERATIONS; i++) {
cli();
x2apic_write_reg(APIC_ICR, icr_val);
safe_halt();
GUEST_ASSERT(READ_ONCE(irq_received));
WRITE_ONCE(irq_received, false);
asm volatile("hlt");
}
GUEST_DONE();
}
static void l1_svm_code(struct svm_test_data *svm)
{
unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
struct vmcb *vmcb = svm->vmcb;
generic_svm_setup(svm, l2_guest_code,
&l2_guest_stack[L2_GUEST_STACK_SIZE]);
vmcb->control.intercept &= ~BIT(INTERCEPT_HLT);
run_guest(vmcb, svm->vmcb_gpa);
GUEST_DONE();
}
Let me know if I am missing something.
-Manali