Re: [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
On 15/08/19 11:55, Yao, Jiewen wrote: > Hi Paolo > I am not sure what do you mean - "You do not need a reset vector ...". > If so, where is the first instruction of the new CPU in the virtualization > environment? > Please help me understand that at first. Then we can continue the discussion. The BSP starts running from 0xFFF0. APs do not start running at all and just sit waiting for an INIT-SIPI-SIPI sequence. Please see my proposal in the reply to Laszlo. Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#45741): https://edk2.groups.io/g/devel/message/45741 Mute This Topic: https://groups.io/mt/32852911/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
On 19/08/19 01:00, Yao, Jiewen wrote: > in real world, we deprecate AB-seg usage because they are vulnerable > to smm cache poison attack. I assume cache poison is out of scope in > the virtual world, or there is a way to prevent ABseg cache poison. Indeed the SMRR would not cover the A-seg on real hardware. However, if the chipset allowed aliasing A-seg SMRAM to 0x3, it would only be used for SMBASE relocation of hotplugged CPU. The firmware would still keep low SMRAM disabled, *except around SMBASE relocation of hotplugged CPUs*. To avoid cache poisoning attacks, you only have to issue a WBINVD before enabling low SMRAM and before disabling it. Hotplug SMI is not a performance-sensitive path, so it's not a big deal. So I guess you agree that PCI DMA attacks are a potential vector also on real hardware. As Alex pointed out, VT-d is not a solution because there could be legitimate DMA happening during CPU hotplug. For OVMF we'll probably go with Igor's idea, it would be nice if Intel chipsets supported it too. :) Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#46043): https://edk2.groups.io/g/devel/message/46043 Mute This Topic: https://groups.io/mt/32852911/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
On 14/08/19 15:20, Yao, Jiewen wrote: >> - Does this part require a new branch somewhere in the OVMF SEC code? >> How do we determine whether the CPU executing SEC is BSP or >> hot-plugged AP? > [Jiewen] I think this is blocked from hardware perspective, since the first > instruction. > There are some hardware specific registers can be used to determine if the > CPU is new added. > I don’t think this must be same as the real hardware. > You are free to invent some registers in device model to be used in OVMF hot > plug driver. Yes, this would be a new operation mode for QEMU, that only applies to hot-plugged CPUs. In this mode the AP doesn't reply to INIT or SMI, in fact it doesn't reply to anything at all. >> - How do we tell the hot-plugged AP where to start execution? (I.e. that >> it should execute code at a particular pflash location.) > [Jiewen] Same real mode reset vector at :FFF0. You do not need a reset vector or INIT/SIPI/SIPI sequence at all in QEMU. The AP does not start execution at all when it is unplugged, so no cache-as-RAM etc. We only need to modify QEMU so that hot-plugged APIs do not reply to INIT/SIPI/SMI. > I don’t think there is problem for real hardware, who always has CAR. > Can QEMU provide some CPU specific space, such as MMIO region? Why is a CPU-specific region needed if every other processor is in SMM and thus trusted. >> Does CPU hotplug apply only at the socket level? If the CPU is >> multi-core, what is responsible for hot-plugging all cores present in >> the socket? I can answer this: the SMM handler would interact with the hotplug controller in the same way that ACPI DSDT does normally. This supports multiple hotplugs already. Writes to the hotplug controller from outside SMM would be ignored. >>> (03) New CPU: (Flash) send board message to tell host CPU (GPIO->SCI) >>> -- I am waiting for hot-add message. >> >> Maybe we can simplify this in QEMU by broadcasting an SMI to existent >> processors immediately upon plugging the new CPU. The QEMU DSDT could be modified (when secure boot is in effect) to OUT to 0xB2 when hotplug happens. It could write a well-known value to 0xB2, to be read by an SMI handler in edk2. >> >>>(NOTE: Host CPU can only >> send >>> instruction in SMM mode. -- The register is SMM only) >> >> Sorry, I don't follow -- what register are we talking about here, and >> why is the BSP needed to send anything at all? What "instruction" do you >> have in mind? > [Jiewen] The new CPU does not enable SMI at reset. > At some point of time later, the CPU need enable SMI, right? > The "instruction" here means, the host CPUs need tell to CPU to enable SMI. Right, this would be a write to the CPU hotplug controller >>> (04) Host CPU: (OS) get message from board that a new CPU is added. >>> (GPIO -> SCI) >>> >>> (05) Host CPU: (OS) All CPUs enter SMM (SCI->SWSMI) (NOTE: New CPU >>> will not enter CPU because SMI is disabled) >> >> I don't understand the OS involvement here. But, again, perhaps QEMU can >> force all existent CPUs into SMM immediately upon adding the new CPU. > [Jiewen] OS here means the Host CPU running code in OS environment, not in > SMM environment. See above. >>> (06) Host CPU: (SMM) Save 38000, Update 38000 -- fill simple SMM >>> rebase code. >>> >>> (07) Host CPU: (SMM) Send message to New CPU to Enable SMI. >> >> Aha, so this is the SMM-only register you mention in step (03). Is the >> register specified in the Intel SDM? > [Jiewen] Right. That is the register to let host CPU tell new CPU to enable > SMI. > It is platform specific register. Not defined in SDM. > You may invent one in device model. See above. >>> (10) New CPU: (SMM) Response first SMI at 38000, and rebase SMBASE to >>> TSEG. >> >> What code does the new CPU execute after it completes step (10)? Does it >> halt? > > [Jiewen] The new CPU exits SMM and return to original place - where it is > interrupted to enter SMM - running code on the flash. So in our case we'd need an INIT/SIPI/SIPI sequence between (06) and (07). >>> (11) Host CPU: (SMM) Restore 38000. >> >> These steps (i.e., (06) through (11)) don't appear RAS-specific. The >> only platform-specific feature seems to be SMI masking register, which >> could be extracted into a new SmmCpuFeaturesLib API. >> >> Thus, would you please consider open sourcing firmware code for steps >> (06) through (11)? >> >> Alternatively -- and in particular because the stack for step (01) >> concerns me --, we could approach this from a high-level, functional >> perspective. The states that really matter are the relocated SMBASE for >> the new CPU, and the state of the full system, right at the end of step >> (11). >> >> When the SMM setup quiesces during normal firmware boot, OVMF could >> use >> existent (finalized) SMBASE infomation to *pre-program* some virtual >> QEMU hardware, with such state that would be expected, as
Re: [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
On 15/08/19 17:00, Laszlo Ersek wrote: > On 08/14/19 16:04, Paolo Bonzini wrote: >> On 14/08/19 15:20, Yao, Jiewen wrote: >>>> - Does this part require a new branch somewhere in the OVMF SEC code? >>>> How do we determine whether the CPU executing SEC is BSP or >>>> hot-plugged AP? >>> [Jiewen] I think this is blocked from hardware perspective, since the first >>> instruction. >>> There are some hardware specific registers can be used to determine if the >>> CPU is new added. >>> I don’t think this must be same as the real hardware. >>> You are free to invent some registers in device model to be used in OVMF >>> hot plug driver. >> >> Yes, this would be a new operation mode for QEMU, that only applies to >> hot-plugged CPUs. In this mode the AP doesn't reply to INIT or SMI, in >> fact it doesn't reply to anything at all. >> >>>> - How do we tell the hot-plugged AP where to start execution? (I.e. that >>>> it should execute code at a particular pflash location.) >>> [Jiewen] Same real mode reset vector at :FFF0. >> >> You do not need a reset vector or INIT/SIPI/SIPI sequence at all in >> QEMU. The AP does not start execution at all when it is unplugged, so >> no cache-as-RAM etc. >> >> We only need to modify QEMU so that hot-plugged APIs do not reply to >> INIT/SIPI/SMI. >> >>> I don’t think there is problem for real hardware, who always has CAR. >>> Can QEMU provide some CPU specific space, such as MMIO region? >> >> Why is a CPU-specific region needed if every other processor is in SMM >> and thus trusted. > > I was going through the steps Jiewen and Yingwen recommended. > > In step (02), the new CPU is expected to set up RAM access. In step > (03), the new CPU, executing code from flash, is expected to "send board > message to tell host CPU (GPIO->SCI) -- I am waiting for hot-add > message." For that action, the new CPU may need a stack (minimally if we > want to use C function calls). > > Until step (03), there had been no word about any other (= pre-plugged) > CPUs (more precisely, Jiewen even confirmed "No impact to other > processors"), so I didn't assume that other CPUs had entered SMM. > > Paolo, I've attempted to read Jiewen's response, and yours, as carefully > as I can. I'm still very confused. If you have a better understanding, > could you please write up the 15-step process from the thread starter > again, with all QEMU customizations applied? Such as, unnecessary steps > removed, and platform specifics filled in. Sure. (01a) QEMU: create new CPU. The CPU already exists, but it does not start running code until unparked by the CPU hotplug controller. (01b) QEMU: trigger SCI (02-03) no equivalent (04) Host CPU: (OS) execute GPE handler from DSDT (05) Host CPU: (OS) Port 0xB2 write, all CPUs enter SMM (NOTE: New CPU will not enter CPU because SMI is disabled) (06) Host CPU: (SMM) Save 38000, Update 38000 -- fill simple SMM rebase code. (07a) Host CPU: (SMM) Write to CPU hotplug controller to enable new CPU (07b) Host CPU: (SMM) Send INIT/SIPI/SIPI to new CPU. (08a) New CPU: (Low RAM) Enter protected mode. (08b) New CPU: (Flash) Signals host CPU to proceed and enter cli;hlt loop. (09) Host CPU: (SMM) Send SMI to the new CPU only. (10) New CPU: (SMM) Run SMM code at 38000, and rebase SMBASE to TSEG. (11) Host CPU: (SMM) Restore 38000. (12) Host CPU: (SMM) Update located data structure to add the new CPU information. (This step will involve CPU_SERVICE protocol) (13) New CPU: (Flash) do whatever other initialization is needed (14) New CPU: (Flash) Deadloop, and wait for INIT-SIPI-SIPI. (15) Host CPU: (OS) Send INIT-SIPI-SIPI to pull new CPU in.. In other words, the cache-as-RAM phase of 02-03 is replaced by the INIT-SIPI-SIPI sequence of 07b-08a-08b. >> The QEMU DSDT could be modified (when secure boot is in effect) to OUT >> to 0xB2 when hotplug happens. It could write a well-known value to >> 0xB2, to be read by an SMI handler in edk2. > > I dislike involving QEMU's generated DSDT in anything SMM (even > injecting the SMI), because the AML interpreter runs in the OS. > > If a malicious OS kernel is a bit too enlightened about the DSDT, it > could willfully diverge from the process that we design. If QEMU > broadcast the SMI internally, the guest OS could not interfere with that. > > If the purpose of the SMI is specifically to force all CPUs into SMM > (and thereby force them into trusted state), then the OS would be > explicitly counter-interested in carrying out the AML operations from > QEMU's DSDT. But since the hotplug controller
Re: [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
On 16/08/19 04:46, Yao, Jiewen wrote: > Comment below: > > >> -Original Message- >> From: Paolo Bonzini [mailto:pbonz...@redhat.com] >> Sent: Friday, August 16, 2019 12:21 AM >> To: Laszlo Ersek ; devel@edk2.groups.io; Yao, Jiewen >> >> Cc: edk2-rfc-groups-io ; qemu devel list >> ; Igor Mammedov ; >> Chen, Yingwen ; Nakajima, Jun >> ; Boris Ostrovsky ; >> Joao Marcal Lemos Martins ; Phillip Goerl >> >> Subject: Re: [edk2-devel] CPU hotplug using SMM with QEMU+OVMF >> >> On 15/08/19 17:00, Laszlo Ersek wrote: >>> On 08/14/19 16:04, Paolo Bonzini wrote: >>>> On 14/08/19 15:20, Yao, Jiewen wrote: >>>>>> - Does this part require a new branch somewhere in the OVMF SEC >> code? >>>>>> How do we determine whether the CPU executing SEC is BSP or >>>>>> hot-plugged AP? >>>>> [Jiewen] I think this is blocked from hardware perspective, since the >>>>> first >> instruction. >>>>> There are some hardware specific registers can be used to determine if >> the CPU is new added. >>>>> I don’t think this must be same as the real hardware. >>>>> You are free to invent some registers in device model to be used in >> OVMF hot plug driver. >>>> >>>> Yes, this would be a new operation mode for QEMU, that only applies to >>>> hot-plugged CPUs. In this mode the AP doesn't reply to INIT or SMI, in >>>> fact it doesn't reply to anything at all. >>>> >>>>>> - How do we tell the hot-plugged AP where to start execution? (I.e. >> that >>>>>> it should execute code at a particular pflash location.) >>>>> [Jiewen] Same real mode reset vector at :FFF0. >>>> >>>> You do not need a reset vector or INIT/SIPI/SIPI sequence at all in >>>> QEMU. The AP does not start execution at all when it is unplugged, so >>>> no cache-as-RAM etc. >>>> >>>> We only need to modify QEMU so that hot-plugged APIs do not reply to >>>> INIT/SIPI/SMI. >>>> >>>>> I don’t think there is problem for real hardware, who always has CAR. >>>>> Can QEMU provide some CPU specific space, such as MMIO region? >>>> >>>> Why is a CPU-specific region needed if every other processor is in SMM >>>> and thus trusted. >>> >>> I was going through the steps Jiewen and Yingwen recommended. >>> >>> In step (02), the new CPU is expected to set up RAM access. In step >>> (03), the new CPU, executing code from flash, is expected to "send board >>> message to tell host CPU (GPIO->SCI) -- I am waiting for hot-add >>> message." For that action, the new CPU may need a stack (minimally if we >>> want to use C function calls). >>> >>> Until step (03), there had been no word about any other (= pre-plugged) >>> CPUs (more precisely, Jiewen even confirmed "No impact to other >>> processors"), so I didn't assume that other CPUs had entered SMM. >>> >>> Paolo, I've attempted to read Jiewen's response, and yours, as carefully >>> as I can. I'm still very confused. If you have a better understanding, >>> could you please write up the 15-step process from the thread starter >>> again, with all QEMU customizations applied? Such as, unnecessary steps >>> removed, and platform specifics filled in. >> >> Sure. >> >> (01a) QEMU: create new CPU. The CPU already exists, but it does not >> start running code until unparked by the CPU hotplug controller. >> >> (01b) QEMU: trigger SCI >> >> (02-03) no equivalent >> >> (04) Host CPU: (OS) execute GPE handler from DSDT >> >> (05) Host CPU: (OS) Port 0xB2 write, all CPUs enter SMM (NOTE: New CPU >> will not enter CPU because SMI is disabled) >> >> (06) Host CPU: (SMM) Save 38000, Update 38000 -- fill simple SMM >> rebase code. >> >> (07a) Host CPU: (SMM) Write to CPU hotplug controller to enable >> new CPU >> >> (07b) Host CPU: (SMM) Send INIT/SIPI/SIPI to new CPU. > [Jiewen] NOTE: INIT/SIPI/SIPI can be sent by a malicious CPU. There is no > restriction that INIT/SIPI/SIPI can only be sent in SMM. All of the CPUs are now in SMM, and INIT/SIPI/SIPI will be discarded before 07a, so this is okay. However I do see a problem, because a PCI device's DMA could overwrite 0x38000 between (06) and (10) and hijack the code that is executed in SMM. How is this avoided on re
Re: [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
On 15/08/19 18:07, Igor Mammedov wrote: > Looking at Q35 code and Seabios SMM relocation as example, if I see it > right QEMU has: > - SMRAM is aliased from DRAM at 0xa > - and TSEG steals from the top of low RAM when configured > > Now problem is that default SMBASE at 0x3 isn't backed by anything > in SMRAM address space and default SMI entry falls-through to the same > location in System address space. > > The later is not trusted and entry into SMM mode will corrupt area + might > jump to 'random' SMI handler (hence save/restore code in Seabios). > > Here is an idea, can we map a memory region at 0x3 in SMRAM address > space with relocation space/code reserved. It could be a part of TSEG > (so we don't have to invent ABI to configure that)? No, there could be real mode code using it. What we _could_ do is initialize SMBASE to 0xa, but I think it's better to not deviate too much from processor behavior (even if it's admittedly a 20-years legacy that doesn't make any sense). Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#45808): https://edk2.groups.io/g/devel/message/45808 Mute This Topic: https://groups.io/mt/32852911/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
On 17/08/19 02:20, Yao, Jiewen wrote: > [Jiewen] That is OK. Then we MUST add the third adversary. > -- Adversary: Simple hardware attacker, who can use device to perform DMA > attack in the virtual world. > NOTE: The DMA attack in the real world is out of scope. That is be handled by > IOMMU in the real world, such as VTd. -- Please do clarify if this is TRUE. > > In the real world: > #1: the SMM MUST be non-DMA capable region. > #2: the MMIO MUST be non-DMA capable region. > #3: the stolen memory MIGHT be DMA capable region or non-DMA capable > region. It depends upon the silicon design. > #4: the normal OS accessible memory - including ACPI reclaim, ACPI > NVS, and reserved memory not included by #3 - MUST be DMA capable region. > As such, IOMMU protection is NOT required for #1 and #2. IOMMU > protection MIGHT be required for #3 and MUST be required for #4. > I assume the virtual environment is designed in the same way. Please > correct me if I am wrong. > Correct. The 0x3...0x3 area is the only problematic one; Igor's idea (or a variant, for example optionally remapping 0xa..0xa SMRAM to 0x3) is becoming more and more attractive. Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#46039): https://edk2.groups.io/g/devel/message/46039 Mute This Topic: https://groups.io/mt/32852911/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
On 21/08/19 17:48, Kinney, Michael D wrote: > Perhaps there is a way to avoid the 3000:8000 startup > vector. > > If a CPU is added after a cold reset, it is already in a > different state because one of the active CPUs needs to > release it by interacting with the hot plug controller. > > Can the SMRR for CPUs in that state be pre-programmed to > match the SMRR in the rest of the active CPUs? > > For OVMF we expect all the active CPUs to use the same > SMRR value, so a check can be made to verify that all > the active CPUs have the same SMRR value. If they do, > then any CPU released through the hot plug controller > can have its SMRR pre-programmed and the initial SMI > will start within TSEG. > > We just need to decide what to do in the unexpected > case where all the active CPUs do not have the same > SMRR value. > > This should also reduce the total number of steps. The problem is not the SMRR but the SMBASE. If the SMBASE area is outside TSEG, it is vulnerable to DMA attacks independent of the SMRR. SMBASE is also different for all CPUs, so it cannot be preprogrammed. (As an aside, virt platforms are also immune to cache poisoning so they don't have SMRR yet - we could use them for SMM_CODE_CHK_EN and block execution outside SMRR but we never got round to it). An even simpler alternative would be to make Ah the initial SMBASE. However, I would like to understand what hardware platforms plan to do, if anything. Paolo > Mike > >> -Original Message- >> From: r...@edk2.groups.io [mailto:r...@edk2.groups.io] On >> Behalf Of Yao, Jiewen >> Sent: Sunday, August 18, 2019 4:01 PM >> To: Paolo Bonzini >> Cc: Alex Williamson ; Laszlo >> Ersek ; devel@edk2.groups.io; edk2- >> rfc-groups-io ; qemu devel list >> ; Igor Mammedov >> ; Chen, Yingwen >> ; Nakajima, Jun >> ; Boris Ostrovsky >> ; Joao Marcal Lemos Martins >> ; Phillip Goerl >> >> Subject: Re: [edk2-rfc] [edk2-devel] CPU hotplug using >> SMM with QEMU+OVMF >> >> in real world, we deprecate AB-seg usage because they >> are vulnerable to smm cache poison attack. >> I assume cache poison is out of scope in the virtual >> world, or there is a way to prevent ABseg cache poison. >> >> thank you! >> Yao, Jiewen >> >> >>> 在 2019年8月19日,上午3:50,Paolo Bonzini >> 写道: >>> >>>> On 17/08/19 02:20, Yao, Jiewen wrote: >>>> [Jiewen] That is OK. Then we MUST add the third >> adversary. >>>> -- Adversary: Simple hardware attacker, who can use >> device to perform DMA attack in the virtual world. >>>> NOTE: The DMA attack in the real world is out of >> scope. That is be handled by IOMMU in the real world, >> such as VTd. -- Please do clarify if this is TRUE. >>>> >>>> In the real world: >>>> #1: the SMM MUST be non-DMA capable region. >>>> #2: the MMIO MUST be non-DMA capable region. >>>> #3: the stolen memory MIGHT be DMA capable region or >> non-DMA capable >>>> region. It depends upon the silicon design. >>>> #4: the normal OS accessible memory - including ACPI >> reclaim, ACPI >>>> NVS, and reserved memory not included by #3 - MUST be >> DMA capable region. >>>> As such, IOMMU protection is NOT required for #1 and >> #2. IOMMU >>>> protection MIGHT be required for #3 and MUST be >> required for #4. >>>> I assume the virtual environment is designed in the >> same way. Please >>>> correct me if I am wrong. >>>> >>> >>> Correct. The 0x3...0x3 area is the only >> problematic one; >>> Igor's idea (or a variant, for example optionally >> remapping >>> 0xa..0xa SMRAM to 0x3) is becoming more >> and more attractive. >>> >>> Paolo >> >> > -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#46171): https://edk2.groups.io/g/devel/message/46171 Mute This Topic: https://groups.io/mt/32979681/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
On 21/08/19 19:25, Kinney, Michael D wrote: > Could we have an initial SMBASE that is within TSEG. > > If we bring in hot plug CPUs one at a time, then initial > SMBASE in TSEG can reprogram the SMBASE to the correct > value for that CPU. > > Can we add a register to the hot plug controller that > allows the BSP to set the initial SMBASE value for > a hot added CPU? The default can be 3000:8000 for > compatibility. > > Another idea is when the SMI handler runs for a hot add > CPU event, the SMM monarch programs the hot plug controller > register with the SMBASE to use for the CPU that is being > added. As each CPU is added, a different SMBASE value can > be programmed by the SMM Monarch. Yes, all of these would work. Again, I'm interested in having something that has a hope of being implemented in real hardware. Another, far easier to implement possibility could be a lockable MSR (could be the existing MSR_SMM_FEATURE_CONTROL) that allows programming the SMBASE outside SMM. It would be nice if such a bit could be defined by Intel. Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#46203): https://edk2.groups.io/g/devel/message/46203 Mute This Topic: https://groups.io/mt/32979681/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
On 21/08/19 22:17, Kinney, Michael D wrote: > Paolo, > > It makes sense to match real HW. Note that it'd also be fine to match some kind of official Intel specification even if no processor (currently?) supports it. > That puts us back to > the reset vector and handling the initial SMI at > 3000:8000. That is all workable from a FW implementation > perspective. It look like the only issue left is DMA. > > DMA protection of memory ranges is a chipset feature. > For the current QEMU implementation, what ranges of > memory are guaranteed to be protected from DMA? Is > it only A/B seg and TSEG? Yes. Paolo >> Yes, all of these would work. Again, I'm interested in >> having something that has a hope of being implemented in >> real hardware. >> >> Another, far easier to implement possibility could be a >> lockable MSR (could be the existing >> MSR_SMM_FEATURE_CONTROL) that allows programming the >> SMBASE outside SMM. It would be nice if such a bit >> could be defined by Intel. >> >> Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#46204): https://edk2.groups.io/g/devel/message/46204 Mute This Topic: https://groups.io/mt/32979681/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
On 22/08/19 20:29, Laszlo Ersek wrote: > On 08/22/19 08:18, Paolo Bonzini wrote: >> On 21/08/19 22:17, Kinney, Michael D wrote: >>> DMA protection of memory ranges is a chipset feature. For the current >>> QEMU implementation, what ranges of memory are guaranteed to be >>> protected from DMA? Is it only A/B seg and TSEG? >> >> Yes. > > This thread (esp. Jiewen's and Mike's messages) are the first time that > I've heard about the *existence* of such RAM ranges / the chipset > feature. :) > > Out of interest (independently of virtualization), how is a general > purpose OS informed by the firmware, "never try to set up DMA to this > RAM area"? Is this communicated through ACPI _CRS perhaps? > > ... Ah, almost: ACPI 6.2 specifies _DMA, in "6.2.4 _DMA (Direct Memory > Access)". It writes, > > For example, if a platform implements a PCI bus that cannot access > all of physical memory, it has a _DMA object under that PCI bus that > describes the ranges of physical memory that can be accessed by > devices on that bus. > > Sorry about the digression, and also about being late to this thread, > continually -- I'm primarily following and learning. It's much simpler: these ranges are not in e820, for example kernel: BIOS-e820: [mem 0x00059000-0x0008bfff] usable kernel: BIOS-e820: [mem 0x0008c000-0x000f] reserved The ranges are not special-cased in any way by QEMU. Simply, AB-segs and TSEG RAM are not part of the address space except when in SMM. Therefore, DMA to those ranges ends up respectively to low VGA RAM[1] and to the bit bucket. When AB-segs are open, for example, DMA to that area becomes possible. Paolo [1] old timers may remember DEF SEG=: BLOAD "foo.img",0. It still works with some disk device models. -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#46233): https://edk2.groups.io/g/devel/message/46233 Mute This Topic: https://groups.io/mt/32979681/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
On 22/08/19 19:59, Laszlo Ersek wrote: > The firmware and QEMU could agree on a formula, which would compute the > CPU-specific SMBASE from a value pre-programmed by the firmware, and the > initial APIC ID of the hot-added CPU. > > Yes, it would duplicate code -- the calculation -- between QEMU and > edk2. While that's not optimal, it wouldn't be a first. No, that would be unmaintainable. The best solution to me seems to be to make SMBASE programmable from non-SMM code if some special conditions hold. Michael, would it be possible to get in contact with the Intel architects? Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#46232): https://edk2.groups.io/g/devel/message/46232 Mute This Topic: https://groups.io/mt/32979681/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
On 23/08/19 00:32, Kinney, Michael D wrote: > Paolo, > > It is my understanding that real HW hot plug uses the SDM defined > methods. Meaning the initial SMI is to 3000:8000 and they rebase > to TSEG in the first SMI. They must have chipset specific methods > to protect 3000:8000 from DMA. It would be great if you could check. > Can we add a chipset feature to prevent DMA to 64KB range from > 0x3-0x3 and the UEFI Memory Map and ACPI content can be > updated so the Guest OS knows to not use that range for DMA? If real hardware does it at the chipset level, we will probably use Igor's suggestion of aliasing A-seg to 3000:. Before starting the new CPU, the SMI handler can prepare the SMBASE relocation trampoline at A000:8000 and the hot-plugged CPU will find it at 3000:8000 when it receives the initial SMI. Because this is backed by RAM at 0xA-0xA, DMA cannot access it and would still go through to RAM at 0x3. Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#46306): https://edk2.groups.io/g/devel/message/46306 Mute This Topic: https://groups.io/mt/32979681/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
On 22/08/19 22:06, Kinney, Michael D wrote: > The SMBASE register is internal and cannot be directly accessed > by any CPU. There is an SMBASE field that is member of the SMM Save > State area and can only be modified from SMM and requires the > execution of an RSM instruction from SMM for the SMBASE register to > be updated from the current SMBASE field value. The new SMBASE > register value is only used on the next SMI. Actually there is also an SMBASE MSR, even though in current silicon it's read-only and its use is theoretically limited to SMM-transfer monitors. If that MSR could be made accessible somehow outside SMM, that would be great. > Once all the CPUs have been initialized for SMM, the CPUs that are not needed > can be hot removed. As noted above, the SMBASE value does not change on > an INIT. So as long as the hot add operation does not do a RESET, the > SMBASE value must be preserved. IIRC, hot-remove + hot-add will unplugs/plugs a completely different CPU. > Another idea is to emulate this behavior. If the hot plug controller > provide registers (only accessible from SMM) to assign the SMBASE address > for every CPU. When a CPU is hot added, QEMU can set the internal SMBASE > register value from the hot plug controller register value. If the SMM > Monarch sends an INIT or an SMI from the Local APIC to the hot added CPU, > then the SMBASE register should not be modified and the CPU starts execution > within TSEG the first time it receives an SMI. Yes, this would work. But again---if the issue is real on current hardware too, I'd rather have a matching solution for virtual platforms. If the current hardware for example remembers INIT-preserved across hot-remove/hot-add, we could emulate that. I guess the fundamental question is: how do bare metal platforms avoid this issue, or plan to avoid this issue? Once we know that, we can use that information to find a way to implement it in KVM. Only if it is impossible we'll have a different strategy that is specific to our platform. Paolo > Jiewen and I can collect specific questions on this topic and continue > the discussion here. For example, I do not think there is any method > other than what I referenced above to program the SMBASE register, but > I can ask if there are any other methods. -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#46307): https://edk2.groups.io/g/devel/message/46307 Mute This Topic: https://groups.io/mt/32979681/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] [Qemu-devel] [PATCH 1/2] q35: implement 128K SMRAM at default SMBASE address
On 20/09/19 11:28, Laszlo Ersek wrote: >> On QEMU side, we can drop black-hole approach and allocate >> dedicated SMRAM region, which explicitly gets mapped into >> RAM address space and after SMI hanlder initialization, gets >> unmapped (locked). So that SMRAM would be accessible only >> from SMM context. That way RAM at 0x3 could be used as >> normal when SMRAM is unmapped. > > I prefer the black-hole approach, introduced in your current patch > series, if it can work. Way less opportunity for confusion. Another possibility would be to alias the 0xA..0xB SMRAM to 0x3..0x4 (only when in SMM). I'm not super enthusiastic about adding this kind of QEMU-only feature. The alternative would be to implement VT-d range locking through the intel-iommu device's PCI configuration space (which includes _adding_ the configuration space, i.e. making the IOMMU a PCI device in the first place, and the support to the firmware for configuring the VT-d BAR at 0xfed9). This would be the right way to do it, but it would entail a lot of work throughout the stack. :( So I guess some variant of this would be okay, as long as it's peppered with "this is not how real hardware does it" comments in both QEMU and EDK2. Thanks, Paolo > I've started work on the counterpart OVMF patches; I'll report back. -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#47939): https://edk2.groups.io/g/devel/message/47939 Mute This Topic: https://groups.io/mt/34201782/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] privileged entropy sources in QEMU/KVM guests
On 07/11/19 12:55, Daniel P. Berrangé wrote: >> Yes, I would make SMM use a cryptographic pseudo-random number generator >> and seed it from virtio-rng from DXE, way before the OS starts and can >> "attack" it. >> >> Once you've gotten a seed, you can create a CSPRNG with a stream cipher >> such as ChaCha20, which is literally 30 lines of code. > If all we need is a one-time seed then virtio-rng is possibly overkill as > that provides a continuous stream. Instead could QEMU read a few bytes > from the host's /dev/urandom and pass it to EDK via fw_cfg, which can > use it for the CSPRNG seed. EDK would have to erase the fw_cfg field > to prevent the seed value leaking to the guest OS, but other than that > its quite straightforward. That would need anyway a change to the emulated hardware. If the guest is able to use virtio-rng after the firmware exits (which is the case is all the firmware needs is a one-time seed), then using virtio-rng is the simplest alternative as it needs no change at all outside the firmware. Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#50211): https://edk2.groups.io/g/devel/message/50211 Mute This Topic: https://groups.io/mt/45640732/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] privileged entropy sources in QEMU/KVM guests
On 07/11/19 12:52, Daniel P. Berrangé wrote: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bb5530e4082446aac3a3d69780cd4dbfa4520013 > > Is it practical to provide a jitter entropy source for EDK2 > too ? The hard part is not collecting jitter (though the firmware might be too deterministic for that), but rather turning it into a random number seed (mixing data from various sources, crediting entropy, etc.). Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#50210): https://edk2.groups.io/g/devel/message/50210 Mute This Topic: https://groups.io/mt/45640732/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] privileged entropy sources in QEMU/KVM guests
On 07/11/19 11:25, Ard Biesheuvel wrote: >> This looks problematic on QEMU. Entropy is a valuable resource, and >> whatever resource SMM drivers depend on, should not be possible for e.g. >> a 3rd party UEFI driver (or even for the runtime OS) to exhaust. >> Therefore, it's not *only* the case that SMM drivers must not consume >> EFI_RNG_PROTOCOL (which exists at a less critical privilege level, i.e. >> outside of SMM/SMRAM), but also that SMM drivers must not depend on the >> same piece of *hardware* that feeds EFI_RNG_PROTOCOL. >> > The typical model is to seed a DRBG [deterministic pseudorandom > sequence generator] using a sufficient amount of high quality entropy. > Once you have done that, it is rather hard to exhaust a DRBG - it is a > mathematical construction that is designed to last for a long time (<= > 2^48 invocations [not bytes] according to the NIST spec), after which > it does not degrade although it may have generated so much output that > its internal state may be inferred if you have captured enough of it > (which is a rather theoretical issue IMHO) > > The problem is that using the output of a DRBG as a seed is > non-trivial - the spec describes ways to do this, but wiring > virtio-rng to a DRBG in the host and using its output to seed a DRBG > in the guest is slighly problematic. > > So it seems to me that the correct way to model this is to make the > host's true entropy source a shared resource like any other. > Yes, I would make SMM use a cryptographic pseudo-random number generator and seed it from virtio-rng from DXE, way before the OS starts and can "attack" it. Once you've gotten a seed, you can create a CSPRNG with a stream cipher such as ChaCha20, which is literally 30 lines of code. Paolo #define ROTL(a,b) (((a) << (b)) | ((a) >> (32 - (b #define QR(a, b, c, d) (\ a += b, d ^= a, d = ROTL(d,16), \ c += d, b ^= c, b = ROTL(b,12), \ a += b, d ^= a, d = ROTL(d, 8), \ c += d, b ^= c, b = ROTL(b, 7)) #define ROUNDS 20 // initial state: // in[0] = 0x65787061 // in[1] = 0x6e642033 // in[2] = 0x322d6279 // in[3] = 0x7465206b // in[4]..in[11] = key (seed) // in[12]..in[13] = 0 // in[14]..in[15] = nonce, can probably use RDTSC? static uint32_t in[16]; uint32_t chacha_rng(void) { int i; static uint32_t x[16], p; if (p < 16) return in[p++] + x[p++]; if (++in[12] == 0) ++in[13]; for (i = 0; i < 16; ++i) x[i] = in[i]; // 10 loops × 2 rounds/loop = 20 rounds for (i = 0; i < ROUNDS; i += 2) { // Odd round QR(x[0], x[4], x[ 8], x[12]); // column 0 QR(x[1], x[5], x[ 9], x[13]); // column 1 QR(x[2], x[6], x[10], x[14]); // column 2 QR(x[3], x[7], x[11], x[15]); // column 3 // Even round QR(x[0], x[5], x[10], x[15]); // diagonal 1 (main diagonal) QR(x[1], x[6], x[11], x[12]); // diagonal 2 QR(x[2], x[7], x[ 8], x[13]); // diagonal 3 QR(x[3], x[4], x[ 9], x[14]); // diagonal 4 } p = 1; return in[0] + x[0]; } -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#50203): https://edk2.groups.io/g/devel/message/50203 Mute This Topic: https://groups.io/mt/45640732/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] privileged entropy sources in QEMU/KVM guests
On 07/11/19 14:27, Laszlo Ersek wrote: > The VirtioRngDxe driver is a UEFI driver that follows the UEFI driver > model. Meaning (in this context), it is connected to the virtio-rng > device in the BDS phase, by platform BDS code. > > Put differently, the non-privileged driver that's the source of the > sensitive data would have to be a "platform DXE driver". The virtio > drivers are not such drivers however. Yes, it would have to be a platform DXE driver. What is it that limits virtio to the BDS phase? Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#50229): https://edk2.groups.io/g/devel/message/50229 Mute This Topic: https://groups.io/mt/45640732/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] [PATCH] OvmfPkg: End timer interrupt later to avoid stack overflow under load
On 17/06/20 17:46, Laszlo Ersek wrote: >> That said, Igor's patch seems correct to me. In fact, I'd even move >> DisableInterrupts before gBS->RestoreTPL unless there's a good reason >> not to do so. > OK, thank you! > > Igor, please confirm if you'd like to submit v2 with the update > suggested by Paolo, or if you prefer the current version. We're at the > beginning of the current development cycle, so I guess we can apply the > patch and see how it works in practice. If it ends up wreaking havoc on > some platforms, we can always revert the patch in time for the next > stable tag (edk2-stable202008). For what it's worth "correct" means that I don't see anything that could break and in fact I find it good policy hygienic not to allow recursive interrupts. v1 is okay for me too, so: Reviewed-by: Paolo Bonzini Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#61432): https://edk2.groups.io/g/devel/message/61432 Mute This Topic: https://groups.io/mt/74913405/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] [PATCH] OvmfPkg: End timer interrupt later to avoid stack overflow under load
On 16/06/20 20:42, Laszlo Ersek wrote: > (Hmmm... maybe the hypervisor *has* to queue the timer interrupts, > otherwise some of them would simply be lost, and the guest would lose > track of time.) Yes, there are various kinds of coalescing of interrupts that hypervisors perform to help the guest keep track of time. This is especially true of the PIT and RTC; newer OSes track time directly from the TSC, the HPET or the APIC timer so they tolerate lost ticks much better. That said, Igor's patch seems correct to me. In fact, I'd even move DisableInterrupts before gBS->RestoreTPL unless there's a good reason not to do so. Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#61416): https://edk2.groups.io/g/devel/message/61416 Mute This Topic: https://groups.io/mt/74913405/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-rfc] [edk2-devel] RFC: design review for TDVF in OVMF
On 09/06/21 16:28, James Bottomley wrote: That would cut across the ApEntrypoint and the guidedStructureEnd. However, nothing says anything in the reset vector guided structure has to be data ... so it could equally well be code. That means we can do guid based entries that contain the 32 bit real and 64 bit entry points. This would also come with the added advantage that we can scan the OVMF binary to see what entry points it supports. Isn't the initial state included in the save area just like for SEV-ES? So it's not even QEMU, but rather some external tool that builds the encrypted image, that needs to understand that GUIDed structure. The GUIDed structure can either include the entry point code; or it could have room for a couple 8-byte pointers since any fixed-size area in the GUIDed structure would be just a jump anyway. Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#76282): https://edk2.groups.io/g/devel/message/76282 Mute This Topic: https://groups.io/mt/83283616/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] [RFC PATCH 00/14] Firmware Support for Fast Live Migration for AMD SEV
Hi Tobin, as mentioned in the reply to the QEMU patches posted by Tobin, I think the firmware helper approach is very good, but there are some disadvantages in the idea of auxiliary vCPUs. These are especially true in the VMM, where it's much nicer to have a separate VM that goes through a specialized run loop; however, even in the firmware level there are some complications (as you pointed out) in letting MpService workers run after ExitBootServices. My idea would be that the firmware would start the VM as usual using the same launch data; then, the firmware would detect it was running as a migration helper VM during the SEC or PEI phases (for example via the GHCB or some other unencrypted communication area), and divert execution to the migration helper instead of proceeding to the next boot phase. This would be somewhat similar in spirit to how edk2 performs S3 resume, if my memory serves correctly. What do you think? Thanks, Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#72431): https://edk2.groups.io/g/devel/message/72431 Mute This Topic: https://groups.io/mt/81036365/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] [PATCH v8 07/10] OvmfPkg/SmmCpuFeaturesLib: call CPU hot-eject handler
On 23/02/21 18:06, Laszlo Ersek wrote: On 02/23/21 08:45, Paolo Bonzini wrote: On 22/02/21 15:53, Laszlo Ersek wrote: + + if (mCpuHotEjectData != NULL) { + CPU_HOT_EJECT_HANDLER Handler; + + Handler = mCpuHotEjectData->Handler; This patch looks otherwise OK to me, but: In patch v8 08/10, we have a ReleaseMemoryFence(). (For now, it is only expressed as a MemoryFence() call; we'll make that more precise later.) (1) I think that should be paired with an AcquireMemoryFence() call, just before loading "mCpuHotEjectData->Handler" above -- for now, also expressed as a MemoryFence() call only. In Linux terms, there is a control dependency here. However, it should at least be a separate statement to load mCpuHotEjectData (which from my EDK2 reminiscences should be a global) into a local variable. So EjectData = mCPUHotEjectData; // Optional AcquireMemoryFence here if (EjectData != NULL) { CPU_HOT_EJECT_HANDLER Handler; Handler = EjectData->Handler; if (Handler != NULL) { Handler (CpuIndex); } } Yes, "mCPUHotEjectData" is a global. "mCpuHotEjectData" itself is set up on the BSP (from the entry point function of the PiSmmCpuSmmDxe driver), before any other APs have a chance to execute any SMM-related code at all. Furthermore, once set up, mCpuHotEjectData never changes -- it remains set to a particular non-NULL value forever, or it remains NULL forever. (The latter case applies when the possible CPU count is 1; IOW, then there is no AP at all.) Ok, that's what I was missing. However, your code below has *two* loads of mCpuHotEjectData and the fence would have to go after the second (between the load of mCpuHotEjectData and the load of the Handler field). Therefore I would still use a local variable even if you decide to put the fence inside the "if", which I agree is the clearest. Paolo Because of that, I thought that the first comparison (mCpuHotEjectData != NULL) would not need any fence -- I thought it was similar to a userspace program that (a) set a global variable in the "main" thread, before calling pthread_create(), (b) treated the global variable as a constant, ever after (meaning all threads). However, mCpuHotEjectData->Handler is changed regularly (modified by the BSP, and read "later" by all processors). That's why I thought the acquire fence was needed in the following location: if (mCpuHotEjectData != NULL) { CPU_HOT_EJECT_HANDLER Handler; // // HERE -- AcquireMemoryFence() // Handler = mCpuHotEjectData->Handler; if (Handler != NULL) { Handler (CpuIndex); } } Thanks! Laszlo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#72094): https://edk2.groups.io/g/devel/message/72094 Mute This Topic: https://groups.io/mt/80819862/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] [PATCH v8 07/10] OvmfPkg/SmmCpuFeaturesLib: call CPU hot-eject handler
On 22/02/21 15:53, Laszlo Ersek wrote: + + if (mCpuHotEjectData != NULL) { +CPU_HOT_EJECT_HANDLER Handler; + +Handler = mCpuHotEjectData->Handler; This patch looks otherwise OK to me, but: In patch v8 08/10, we have a ReleaseMemoryFence(). (For now, it is only expressed as a MemoryFence() call; we'll make that more precise later.) (1) I think that should be paired with an AcquireMemoryFence() call, just before loading "mCpuHotEjectData->Handler" above -- for now, also expressed as a MemoryFence() call only. In Linux terms, there is a control dependency here. However, it should at least be a separate statement to load mCpuHotEjectData (which from my EDK2 reminiscences should be a global) into a local variable. So EjectData = mCPUHotEjectData; // Optional AcquireMemoryFence here if (EjectData != NULL) { CPU_HOT_EJECT_HANDLER Handler; Handler = EjectData->Handler; if (Handler != NULL) { Handler (CpuIndex); } } Thanks, Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#72038): https://edk2.groups.io/g/devel/message/72038 Mute This Topic: https://groups.io/mt/80819862/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] [RFC PATCH 00/14] Firmware Support for Fast Live Migration for AMD SEV
On 04/03/21 21:45, Laszlo Ersek wrote: On 03/04/21 10:21, Paolo Bonzini wrote: Hi Tobin, as mentioned in the reply to the QEMU patches posted by Tobin, I think the firmware helper approach is very good, but there are some disadvantages in the idea of auxiliary vCPUs. These are especially true in the VMM, where it's much nicer to have a separate VM that goes through a specialized run loop; however, even in the firmware level there are some complications (as you pointed out) in letting MpService workers run after ExitBootServices. My idea would be that the firmware would start the VM as usual using the same launch data; then, the firmware would detect it was running as a migration helper VM during the SEC or PEI phases (for example via the GHCB or some other unencrypted communication area), and divert execution to the migration helper instead of proceeding to the next boot phase. This would be somewhat similar in spirit to how edk2 performs S3 resume, if my memory serves correctly. Very cool. You'd basically warm-reboot the virtual machine into a new boot mode (cf. BOOT_WITH_FULL_CONFIGURATION vs. BOOT_ON_S3_RESUME in OvmfPkg/PlatformPei). To me that's much more attractive than a "background job". The S3 parallel is great. What I'm missing is: - Is it possible to warm-reboot an SEV VM? (I vaguely recall that it's not possible for SEV-ES at least.) Because, that's how we'd transfer control to the early parts of the firmware again, IIUC your idea, while preserving the memory contents. It's not exactly a warm reboot. It's two VMs booted at the same time, with exactly the same contents as far as encrypted RAM goes, but different unencrypted RAM. The difference makes one VM boot regularly and the other end up in the migration helper. The migration helper can be entirely contained in PEI, or it can even be its own OS, stored as a flat binary in the firmware. Whatever is easier. The divergence would happen much earlier than S3 though. It would have to happen before the APs are brought up, for example, and essentially before the first fw_cfg access if (as is likely) the migration helper VM does not have fw_cfg at all. That's why I brought up the possibility of diverging as soon as SEC. - Who would initiate this process? S3 suspend is guest-initiated. (Not that we couldn't use the guest agent, if needed.) (In case the idea is really about a separate VM, and not about rebooting the already running VM, then I don't understand -- how would a separate VM access the guest RAM that needs to be migrated?) Answering the other message: (For some unsolicited personal information, now I feel less bad about this idea never occurring to me -- I never knew about the KVM patch set that would enable encryption context sharing. (TBH I thought that was prevented, by design, in the SEV hardware...)) As far as the SEV hardware is concerned, a "VM" is defined by the ASID. The VM would be separate at the KVM level, but it would share the ASID (and thus the guest RAM) with the primary VM. So as far as the SEV hardware and the processor are concerned, the separate VM would be just one more VMCB that runs with that ASID. Only KVM knows that they are backed by different file descriptors etc. In fact, another advantage is that it would be much easier to scale the migration helper to multiple vCPUs. This is probably also a case for diverging much earlier than PEI, because a multi-processor migration helper running in PEI or DXE would require ACPI tables and a lot of infrastructure that is probably undesirable. Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#72486): https://edk2.groups.io/g/devel/message/72486 Mute This Topic: https://groups.io/mt/81036365/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] separate OVMF binary for TDX? [was: OvmfPkg: Reserve the Secrets and Cpuid page for the SEV-SNP guest]
On 15/04/21 01:34, Erdem Aktas wrote: We do not want to generate different binaries for AMD, Intel, Intel with TDX, AMD with SEV/SNP etc My question is why the user would want a single binary for VMs with and without TDX/SNP. I know there is attestation, but why would you even want the _possibility_ that your guest starts running without TDX or SNP protection, and only find out later via attestation? For a similar reason, OVMF already supports shipping a binary that fails to boot if SMM is not available to the firmware, because then secure boot would be trivially circumvented. I can understand having a single binary for both TDX or SNP. That's not a problem since you can set up the SEV startup VMSA to 32-bit protected mode just like TDX wants. therefore we were expecting the TDX changes to be part of the upstream code. Having 1 or more binaries should be unrelated to the changes being upstream (or more likely, I am misunderstanding you). Thanks, Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#74118): https://edk2.groups.io/g/devel/message/74118 Mute This Topic: https://groups.io/mt/81969494/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] [PATCH v2] UefiCpuPkg/MmSaveStateLib: Remove checking Smm Rev ID in AMD MmSaveStateLib
On Tue, Oct 31, 2023 at 10:16 AM Attar, AbdulLateef (Abdul Lateef) wrote: > > [Public] > > +Laszlo, +Gerd, +Paolo > PR: https://github.com/tianocore/edk2/pull/4982 I left a comment in the PR. The patch is only correct if this code is only ever used on 64-bit processors. Note that KVM uses the legacy 32-bit format of the SMRAM state save area, if the virtual machine is configured to clear the LM bit of CPUID. Second, the commit message does not explain why you are doing this. Without such an explanation, it is impossible to provide more constructive feedback. Paolo > -Original Message- > From: Lin, Jacque > Sent: Tuesday, October 31, 2023 11:07 AM > To: devel@edk2.groups.io > Cc: Lin, Jacque ; Attar, AbdulLateef (Abdul Lateef) > ; Chang, Abner > Subject: [PATCH v2] UefiCpuPkg/MmSaveStateLib: Remove checking Smm Rev ID in > AMD MmSaveStateLib > > Remove checking SMM Rev ID in AMD Save State lib when reading Save State > Register EFI_MM_SAVE_STATE_REGISTER_IO. > For AMD, it is not necessary to check SmmRevId when reading Save State > Register EFI_MM_SAVE_STATE_REGISTER_IO. > > Cc: Abdul Lateef Attar > Cc: Abner Chang > Signed-off-by: Jacque Lin > --- > UefiCpuPkg/Library/MmSaveStateLib/AmdMmSaveState.c | 13 - > 1 file changed, 13 deletions(-) > > diff --git a/UefiCpuPkg/Library/MmSaveStateLib/AmdMmSaveState.c > b/UefiCpuPkg/Library/MmSaveStateLib/AmdMmSaveState.c > index 3315a6cc44..c4bf6ad4bb 100644 > --- a/UefiCpuPkg/Library/MmSaveStateLib/AmdMmSaveState.c > +++ b/UefiCpuPkg/Library/MmSaveStateLib/AmdMmSaveState.c > @@ -102,7 +102,6 @@ MmSaveStateReadRegister ( >OUT VOID*Buffer ) {- UINT32 > SmmRevId; EFI_MM_SAVE_STATE_IO_INFO *IoInfo; AMD_SMRAM_SAVE_STATE_MAP > *CpuSaveState; UINT8 DataWidth;@@ -124,18 +123,6 @@ > MmSaveStateReadRegister ( > // Check for special EFI_MM_SAVE_STATE_REGISTER_IO if (Register == > EFI_MM_SAVE_STATE_REGISTER_IO) {-//-// Get SMM Revision ID-//- > MmSaveStateReadRegisterByIndex (CpuIndex, > AMD_MM_SAVE_STATE_REGISTER_SMMREVID_INDEX, sizeof (SmmRevId), );-- > //-// See if the CPU supports the IOMisc register in the save state- > //-if (SmmRevId < AMD_SMM_MIN_REV_ID_X64) {- return EFI_NOT_FOUND;- > }- // Check if IO Restart Dword [IO Trap] is valid or not using bit 1. >if (!(CpuSaveState->x64.IO_DWord & 0x02u)) { return EFI_NOT_FOUND;-- > 2.36.1.windows.1 > -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#110402): https://edk2.groups.io/g/devel/message/110402 Mute This Topic: https://groups.io/mt/102292190/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] GuestPhysAddrSize questions
On Mon, Mar 4, 2024 at 6:24 PM Tom Lendacky wrote: > > On 3/4/24 07:09, Gerd Hoffmann wrote: > >Hi, > > > >>> 23:16 GuestPhysAddrSize Maximum guest physical address size in bits. > >>> This number applies only to guests using > >>> nested > >>> paging. When this field is zero, refer to the > >>> PhysAddrSize field for the maximum guest > >>> physical address size. See “Secure Virtual > >>> Machine” in APM Volume 2. > > > >> I believe the main purpose of GuestPhysAddrSize was for software use (for > >> nested virtualization) and that the hardware itself has always returned > >> zero > >> for that value. So you should be able to use that field. Adding @Paolo for > >> his thoughts. > > > > Reviewers mentioned this is meant for nested guests, i.e. (if I > > understand this correctly) the l0 hypervisor can use that to tell > > the l1 hypervisor what the l2 guest phys-bits should be. > > > > Is this nested virtualization use documented somewhere? Tried to > > search for GuestPhysAddrSize or Fn8000_0008_EAX in APM Volume 2, > > found nothing. > > Right, and I don't think you'll see anything added to the APM that will > state how it can be used by software. The APM is an architectural > definition and won't talk about hypervisors and using nested paging, etc. I don't think that's a problem. The problem is that the definition in the APM is ambiguous and can mean one of three things: 1) it can be a suggested value for PhysAddrSize (bits 7:0) of guests that use nested paging. This would imply that in a nested page guest, with GuestPhysAddrSize=48, setting bits 51:48 would cause a #PF(reserved) exception. 2) it can be equivalent to LinAddrSize (bits 15:8) but for nested page tables. This would imply that, with GuestPhysAddrSize=48, VMRUN would fail if hCR4.LA57=1. 3) it can be what I propose above: the architecture defined a situation that can only happen when using nested paging (on AMD: host_CR4.LA57=0, PhysAddrSize=52), and GuestPhysAddrSize is an architectural way to explain the situation to guests. The message above suggests that the intended meaning is (1). That is because "the l0 hypervisor can use that to tell the l1 hypervisor what the l2 guest phys-bits should be" is exactly the same as "the processor can use that to tell the hypervisor what the guest phys-bits should be" (just shifted one level down). However, there are no processors that implement (1) or (2), so my suggestion is to clarify that the intended meaning is (3). Do you agree that the above proposal is a plausible interpretation of what is already in the APM, but clearer? And do you think there is a way for the clarification to make it into the APM? Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#116465): https://edk2.groups.io/g/devel/message/116465 Mute This Topic: https://groups.io/mt/104510523/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] GuestPhysAddrSize questions
On 2/22/24 16:44, Tom Lendacky wrote: On 2/22/24 05:24, Gerd Hoffmann wrote: Hi, + if (Cr4.Bits.LA57) { + if (PhysBits > 48) { + /* + * Some Intel CPUs support 5-level paging, have more than 48 + * phys-bits but support only 4-level EPT, which effectively + * limits guest phys-bits to 48. + * + * AMD Processors have a different but somewhat related + * problem: They can handle guest phys-bits larger than 48 + * only in case the host runs in 5-level paging mode. + * + * Until we have some way to communicate that kind of + * limitations from hypervisor to guest, limit phys-bits + * to 48 unconditionally. + */ So I'm looking for some communication path. One option would be to use some bits in the KVM cpuid leaves. Another possible candidate is cpuid leaf 0x8008. From the AMD APM (revision 3.35): CPUID Fn8000_0008_EAX Long Mode Size Identifiers The value returned in EAX provides information about the maximum host and guest physical and linear address width (in bits) supported by the processor. Bits FieldName Description 31:24 — Reserved 23:16 GuestPhysAddrSize Maximum guest physical address size in bits. This number applies only to guests using nested paging. When this field is zero, refer to the PhysAddrSize field for the maximum guest physical address size. See “Secure Virtual Machine” in APM Volume 2. 15:8 LinAddrSize Maximum linear address size in bits. 7:0 PhysAddrSize Maximum physical address size in bits. When GuestPhysAddrSize is zero, this field also indicates the maximum guest physical address size. The description of the GuestPhysAddrSize is somewhat vague. Is this a value the hypervisor should use to figure how much address space it can give to guests? Or is this a value the hypervisor can set to inform the guest about the available address space (which would be a solution to the problem outlined in the comment above)? Or both? I believe the main purpose of GuestPhysAddrSize was for software use (for nested virtualization) and that the hardware itself has always returned zero for that value. So you should be able to use that field. Adding @Paolo for his thoughts. I have already discussed this with Gerd, so I don't really have many thoughts to add. Anyhow, basically we would like GuestPhysAddrSize to be "redefined" as Maximum usable physical address size in bits. Physical addresses above this size should not be used, but will not produce a "reserved" page fault. When this field is zero, all bits up to PhysAddrSize are usable. This field is expected to be nonzero only on guests where the hypervisor is using nested paging. Also, to clarify the hardware behavior, if hCR4.LA57=0 and host PhysAddrSize==52, then will guest physical addresses above 2^48 1) cause a reserved #PF in the guest, or 2) cause a non-present NPF exit in the hypervisor? I remember that several years ago we had a discussion on hCR4.LA57=0 reducing the address space compared to MAXPHYADDR, but I cannot find the emails and also at the time I didn't notice GuestPhysAddrSize. Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#115832): https://edk2.groups.io/g/devel/message/115832 Mute This Topic: https://groups.io/mt/104510523/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] GuestPhysAddrSize questions
On Thu, Feb 22, 2024 at 5:13 PM Paolo Bonzini wrote: > Also, to clarify the hardware behavior, if hCR4.LA57=0 and host > PhysAddrSize==52, then will guest physical addresses above 2^48 > > 1) cause a reserved #PF in the guest, or > > 2) cause a non-present NPF exit in the hypervisor? > > I remember that several years ago we had a discussion on hCR4.LA57=0 > reducing the address space compared to MAXPHYADDR, but I cannot find the > emails and also at the time I didn't notice GuestPhysAddrSize. Found them! They say that "according to the design document, CPU will report #NPF if the guest references a PA that is greater than 48 bits while the hypervisor is in 4-level nested paging mode". That's nice, because it's the same behavior as the affected Intel processors. Paolo > Anyhow, basically we would like GuestPhysAddrSize to be "redefined" as > > Maximum usable physical address size in bits. Physical addresses > above this size should not be used, but will not produce a "reserved" > page fault. When this field is zero, all bits up to PhysAddrSize are > usable. This field is expected to be nonzero only on guests where > the hypervisor is using nested paging. -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#115859): https://edk2.groups.io/g/devel/message/115859 Mute This Topic: https://groups.io/mt/104510523/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue due to nested interrupts
On Fri, Mar 1, 2024 at 4:08 AM Ni, Ray wrote: > @@ -161,5 +191,46 @@ CoreRestoreTpl ( >IN EFI_TPL NewTpl >) > { > + BOOLEAN InInterruptHandler = FALSE; > + > + // > + // Unwind the nested interrupt handlers up to the required > + // TPL, paying attention not to overflow the stack. While > + // not strictly necessary according to the specification, > + // accept the possibility that multiple RaiseTPL calls are > + // undone by a single RestoreTPL > + // > + while ((INTN)NewTpl <= HighBitSet64 (mInterruptedTplMask)) { > 1. why "<="? I thought when RestoreTPL() is called there are only two cases: >a. NewTpl == HighBitSet64 (...) >b. NewTpl > HighBitSet64 (...) > 1.a is the case when TimerInterruptHandler() or CoreTimerTick() restores > TPL from HIGH to non-HIGH. > 1.b is the case when the pending event backs call RaiseTPL/RestoreTPL(). > Because only pending events whose TPL > "Interrupted TPL" can run, the > RestoreTPL() call from the event callbacks cannot change the TPL to a value > less than or equal to "Interrupted TPL". > So, I think "<=" can be "==". > > 2. can you explain a bit more about the reason of "while"? Both are just for extra safety. The required invariant is that all bits at or below current TPL are cleared, and using "while (... <= ...)" makes it more robust to incorrect usage of gBS->RestoreTPL(). Indeed, the patch at the top of thread also uses "(INTN)gEfiCurrentTpl > HighBitSet64 (mInterruptedTplMask)", which is <= when you reverse the condition. It then asserts inside the conditional that "==" would be enough. So I am starting to see more and more similarities between the two approaches. I went a step further with fresh mind, removing the while loop... and basically reinvented your and Michael's patch. :) The only difference in the logic is a slightly different handling of mInterruptedTplMask in CoreRestoreTpl(), which is a bit safer in my case. However, my roundabout way of getting to the same patch resulted in very different comments. Personally, I found the large text at the head of mInterruptedTplMask a bit too much, and the ones inside the function too focused on "how" and not "why". Maybe it's my exposure to NestedInterruptTplLib, but I find that a much smaller text can achieve the same purpose, by explaining the logic instead of the individual steps. My version is attached, feel free to reuse it (either entirely or partially) for a hypothetical v2. Apologies to you and Mike K for the confusion! > + > +if (InterruptedTpl == NewTpl) { > + break; > 3. "break" or "return"? I think we should exit from this function. Indeed, this should have been a return. Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#116224): https://edk2.groups.io/g/devel/message/116224 Mute This Topic: https://groups.io/mt/104642317/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=- From b9f0bc3ef83b40c29e093acda1d0741b8f5610e5 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Fri, 1 Mar 2024 09:11:48 +0100 Subject: [PATCH] MdeModulePkg: fix stack overflow issue due to nested interrupts Content-Type: text/plain; charset=UTF-8 This is a heavily simplified version of the fix in the OvmfPkg NestedInterruptTplLib. Putting it in DXE core allows CoreRestoreTpl() to lower the TPL while keeping interrupts disabled, removing the need for either DisableInterruptsOnIret() or the complex deferred execution mechanism. Instead, CoreRaiseTpl() uses the current state of the interrupt flag to second guess whether it's being called from an interrupt handler; when restoring the outer TPL at the end of the handler, interrupts remain disabled until IRET. This eliminates the possibility that a nested invocation of the interrupt handler has the same TPL as the outer one. Signed-off-by: Michael D Kinney Signed-off-by: Ray Ni [Rewrote the CoreRestoreTpl part and the commit message. - Paolo] Signed-off-by: Paolo Bonzini --- MdeModulePkg/Core/Dxe/Event/Tpl.c | 85 ++- 1 file changed, 72 insertions(+), 13 deletions(-) diff --git a/MdeModulePkg/Core/Dxe/Event/Tpl.c b/MdeModulePkg/Core/Dxe/Event/Tpl.c index b33f80573c..0a4f99521c 100644 --- a/MdeModulePkg/Core/Dxe/Event/Tpl.c +++ b/MdeModulePkg/Core/Dxe/Event/Tpl.c @@ -9,6 +9,16 @@ SPDX-License-Identifier: BSD-2-Clause-Patent #include "DxeMain.h" #include "Event.h" +/// +/// Bit mask of TPLs that were interrupted (typically during RestoreTPL's +/// event dispatching, though there are reports that the Windows boot loader +/// executes stray STIs at TPL_HI
Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue due to nested interrupts
On Fri, Mar 1, 2024 at 10:27 AM Michael Brown wrote: > > My version is attached, feel free to reuse it (either entirely or > > partially) for a hypothetical v2. Apologies to you and Mike K for the > > confusion! > > I much prefer this version of the patch, because the explanations are > easier to follow and to reason about. Thanks! I find the new logic to be quite appealing... now that I understand it. Hopefully this provides a way to find the best of both worlds. > There is an implicit assumption that if interrupts are disabled when > RaiseTPL() is called then we must have been called from an interrupt > handler. How sure are we that this assumption is correct? > > It's possible that it doesn't matter. The new logic will effectively > mean that RestoreTPL() will restore not only the TPL but also the > interrupts-enabled state to whatever existed at the time of the > corresponding RaiseTPL(). Right: that's what my comment says + // However, when the handler calls RestoreTPL + // before returning, we want to keep interrupts disabled. This + // restores the exact state at the beginning of the handler, + // before the call to RaiseTPL(): low TPL and interrupts disabled. but indeed it applies beyond interrupt handlers. It might even be a bugfix. > Maybe this is what we want? If so, then we > should probably phrase the comments in these terms instead of in terms > of being called from an interrupt handler. I think phrasing the comments with reference to interrupt handlers is fine, but it may be worth adding a comment to either mInterruptedTplMask or CoreRestoreTpl(), like +/// NOTE: Strictly speaking, this applies to anything that +/// calls RaiseTPL() with interrupts disabled, not just +/// interrupt handlers. Interrupt handlers are just the case +/// that we care the most about, because of the potential +/// for stack overflow. Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#116228): https://edk2.groups.io/g/devel/message/116228 Mute This Topic: https://groups.io/mt/104642317/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue due to nested interrupts
One fix is needed in the code. On Thu, Feb 29, 2024 at 2:04 PM Ray Ni wrote: > + // > + // Save the "Interrupted TPL" (TPL that was interrupted). > + // > + mInterruptedTplMask |= (UINTN)(1 << gEfiCurrentTpl); > +} >} > + // > + // Clear interrupted TPL level mask, but do not re-enable interrupts > here > + // This will return to CoreTimerTick() and interrupts will be > re-enabled > + // when the timer interrupt handlers returns from interrupt context. > + // > + ASSERT ((INTN)gEfiCurrentTpl == HighBitSet64 (mInterruptedTplMask)); > + mInterruptedTplMask &= ~(UINTN)(1 << gEfiCurrentTpl); > +} >} Both of these need to use "1U" to avoid sign extending bit 31 into bits 31..63. The same issue is (in three places) present in my own version of the patch. :( Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#116225): https://edk2.groups.io/g/devel/message/116225 Mute This Topic: https://groups.io/mt/104642317/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue due to nested interrupts
On Thu, Feb 29, 2024 at 2:04 PM Ray Ni wrote: > @@ -134,9 +262,9 @@ CoreRestoreTpl ( >} > >// > - // Set the new value > + // Set the new TPL with interrupt disabled. >// > - > + CoreSetInterruptState (FALSE); >gEfiCurrentTpl = NewTpl; > >// > @@ -144,7 +272,22 @@ CoreRestoreTpl ( >// interrupts are enabled >// >if (gEfiCurrentTpl < TPL_HIGH_LEVEL) { > -CoreSetInterruptState (TRUE); > +if ((INTN)gEfiCurrentTpl > HighBitSet64 (mInterruptedTplMask)) { > + // > + // Only enable interrupts if restoring to a level above the highest > + // interrupted TPL level. This allows interrupt nesting, but only for > + // events at higher TPL level than the current TPL level. > + // > + CoreSetInterruptState (TRUE); > +} else { > + // > + // Clear interrupted TPL level mask, but do not re-enable interrupts > here > + // This will return to CoreTimerTick() and interrupts will be > re-enabled > + // when the timer interrupt handlers returns from interrupt context. > + // > + ASSERT ((INTN)gEfiCurrentTpl == HighBitSet64 (mInterruptedTplMask)); > + mInterruptedTplMask &= ~(UINTN)(1 << gEfiCurrentTpl); > +} >} Ok, now I understand what's going on and it's indeed the same logic as NestedInterruptTplLib, with DisableInterruptsOnIret() replaced by skipping CoreSetInterruptState(TRUE). It's similar to what I proposed elsewhere in the thread, just written differenty. I agree with Michael Brown that the spec is unclear on the state of the interrupt flag on exit from gBS->RestoreTPL(), but perhaps this change is feasible if the interrupt handlers just raise the TPL first and restore it last. Just as an exercise for me to understand the code better, I tried rewriting the code in terms of the CoreRestoreTplInternal() function that I proposed. I find it easier to read, but I guess that's a bit in the eye of the beholder, and it is a little more defensively coded. I attach it (untested beyond compilation) for reference. Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#116192): https://edk2.groups.io/g/devel/message/116192 Mute This Topic: https://groups.io/mt/104642317/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=- From 23c4f60cf79f29ab5eff55a02c72bb504804d02a Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Thu, 29 Feb 2024 23:54:50 +0100 Subject: [PATCH 1/2] MdeModulePkg: introduce CoreRestoreTplInternal() Content-Type: text/plain; charset=UTF-8 Introduce a function that restores the TPL just like gBS->RestoreTPL(), but can optionally return with interrupts disabled. This can be used from interrupt handlers, so that any nested interrupt handler will only see an elevated TPL and not the value on entry to the interrupt handler. Signed-off-by: Paolo Bonzini --- MdeModulePkg/Core/Dxe/Event/Tpl.c | 46 +-- 1 file changed, 31 insertions(+), 15 deletions(-) diff --git a/MdeModulePkg/Core/Dxe/Event/Tpl.c b/MdeModulePkg/Core/Dxe/Event/Tpl.c index b33f80573c..fe95ea3896 100644 --- a/MdeModulePkg/Core/Dxe/Event/Tpl.c +++ b/MdeModulePkg/Core/Dxe/Event/Tpl.c @@ -90,10 +90,11 @@ CoreRaiseTpl ( @param NewTpl New, lower, task priority **/ -VOID +static VOID EFIAPI -CoreRestoreTpl ( - IN EFI_TPL NewTpl +CoreRestoreTplInternal ( + IN EFI_TPL NewTpl, + IN BOOLEAN DesiredInterruptState ) { EFI_TPL OldTpl; @@ -107,11 +108,6 @@ CoreRestoreTpl ( ASSERT (VALID_TPL (NewTpl)); - // - // If lowering below HIGH_LEVEL, make sure - // interrupts are enabled - // - if ((OldTpl >= TPL_HIGH_LEVEL) && (NewTpl < TPL_HIGH_LEVEL)) { gEfiCurrentTpl = TPL_HIGH_LEVEL; } @@ -126,6 +122,11 @@ CoreRestoreTpl ( } gEfiCurrentTpl = PendingTpl; + +// +// If lowering below HIGH_LEVEL, make sure +// interrupts are enabled +// if (gEfiCurrentTpl < TPL_HIGH_LEVEL) { CoreSetInterruptState (TRUE); } @@ -134,16 +135,31 @@ CoreRestoreTpl ( } // - // Set the new value + // Set the new TPL with interrupt disabled. If DesiredInterruptState + // is FALSE, this ensures that any nested interrupt handler will only + // see an elevated TPL and not NewTpl. // - + CoreSetInterruptState (FALSE); gEfiCurrentTpl = NewTpl; - // - // If lowering below HIGH_LEVEL, make sure - // interrupts are enabled - // - if (gEfiCurrentTpl < TPL_HIGH_LEVEL) { + if (DesiredInterruptState) { +ASSERT(gEfiCurrentTpl < TPL_HIGH_LEVEL); CoreSetInterruptState (TRUE); } } + +/** + Lowers the task priority to the previous value. If the new + priority unmasks events at a higher priority, they are dispatched. + +
Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue due to nested interrupts
Il ven 1 mar 2024, 12:10 Michael Brown ha scritto: > It feels as though this should be able to be cleanly modelled with a > single global state array > >BOOLEAN mSavedInterruptState[TPL_HIGH_LEVEL] > Pretty much, yes. But you only have to write it when the interrupts are already disabled, so the bitmask works and makes it easier to clear "all values at NewTpl and above" with just an AND. > > (or possibly a bitmask, though using the array avoids having to disable > interrupts just to write a value). > > I still need to think through the subtleties, to make sure it could cope > with pathological edge cases such as > >OldTpl = gBS->RaiseTPL (TPL_HIGH_LEVEL); > >... > >gBS->RestoreTPL (OldTpl); >gBS->RestoreTPL (OldTpl); > > or > >OldTpl = gBS->RaiseTPL (TPL_HIGH_LEVEL - 1); >gBS->RaiseTPL (TPL_HIGH_LEVEL); > >.. > >gBS->RestoreTPL (OldTpl); > > I think that at least one of the above pathological usage patterns would > break the existing mInterruptedTplMask patches, since they currently > clear state in RestoreTPL() and so will not correctly handle a duplicate > call to RestoreTPL(). > I think my patch works (modulo the 1 vs. 1U issue) for the second. Declaring the first to be invalid is a good idea IMO. Also it would only break in interrupt handlers and would revert to just causing a stack overflow in the interrupt storm scenario, so it wouldn't be too bad... Paolo > I'll try to get a patch put together over the weekend. > > Thanks, > > Michael > > -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#116238): https://edk2.groups.io/g/devel/message/116238 Mute This Topic: https://groups.io/mt/104642317/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue due to nested interrupts
Il gio 29 feb 2024, 17:45 Kinney, Michael D ha scritto: > Hi Michael, > > Can you provide a pointer to the UEFI Spec statement this breaks? > The spec does say that interrupts are disabled for TPL_HIGH_LEVEL, but indeed it doesn't say they are always enabled at lower levels. However, if the interrupts aren't always enabled whenever you're below TPL_HIGH_LEVEL, you get priority inversions (and deadlocks). For example, if you end up running with interrupts disabled at TPL_CALLBACK, you are disabling the dispatching of timers at TPL_NOTIFY. I guess this can be deduced from these two passages: - "The functions in these queues are invoked in FIFO order, starting with the highest priority level queue and proceeding to the lowest priority queue that is unmasked by the current TPL" - "If Type is TimerRelative and TriggerTime is 0, then the timer event will be signaled on the next timer tick" (in the description of gBS->SetTimer) Paolo > Thanks, > > Mike > > > -Original Message- > > From: Michael Brown > > Sent: Thursday, February 29, 2024 5:23 AM > > To: devel@edk2.groups.io; Ni, Ray > > Cc: Kinney, Michael D ; Liming Gao > > ; Laszlo Ersek ; Paolo > > Bonzini > > Subject: Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack > > overflow issue due to nested interrupts > > > > On 29/02/2024 13:02, Ni, Ray wrote: > > > A ideal solution is to not keep the interrupt disabled when > > > RestoreTPL(TPL_HIGH -> not TPL_HIGH) is executed in the timer > > interrupt > > > context because the interrupt handler will re-enable the interrupt > > with > > > arch specific instructions (e.g.: IRET for x86). > > > > > > The patch introduces mInterruptedTplMask which tells RestoreTPL() if > > > it's called in the interrupt context and whether it should defer > > enabling > > > the interrupt. > > > > NACK. This breaks the specification-defined behaviour for > > RestoreTPL(). > > > > What guarantees do we have that there is no code anywhere in the world > > that relies upon RestoreTPL() unconditionally re-enabling interrupts. > > > > I also find this code substantially harder to follow than > > NestedInterruptTplLib (which does not break any specified behaviour). > > > > Thanks, > > > > Michael > > -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#116179): https://edk2.groups.io/g/devel/message/116179 Mute This Topic: https://groups.io/mt/104642317/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue due to nested interrupts
On 2/29/24 20:22, Michael Brown wrote: The design of NestedInterruptTplLib is that each nested interrupt must increase the TPL, but if I understand correctly there is a hole here: // // Call RestoreTPL() to allow event notifications to be // dispatched. This will implicitly re-enable interrupts. // gBS->RestoreTPL (InterruptedTPL); // // Re-disable interrupts after the call to RestoreTPL() to ensure // that we have exclusive access to the shared state. // DisableInterrupts (); because gBS->RestoreTPL will unconditionally enable interrupts if InterruptedTPL < TPL_HIGH_LEVEL. There's no hole there. Yes, what I meant is that the whole of NestedInterruptTplLib is designed around plugging that hole. But if you can avoid re-enabling interrupts at the end of CoreRestoreTpl(), you don't need DisableInterruptsOnIret() anymore. See the comments in the code for further details - I made them fairly extensive. Yup, I remember. :) If possible, the easiest solution would be to merge NestedInterruptTplLib into Core DXE. The question with that approach would be how to cleanly violate the abstraction layer that separates the timer interrupt handler (existing in a separate DXE driver executable) from the implementation of CoreRestoreTplInternal() (existing in core DXE and not exposed via the boot services table). See outline in the other mail I have sent. Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#116190): https://edk2.groups.io/g/devel/message/116190 Mute This Topic: https://groups.io/mt/104642317/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue due to nested interrupts
On 2/29/24 14:02, Ray Ni wrote: In the end, it will lower the TPL to TPL_APPLICATION with interrupt enabled. However, it's possible that another timer interrupt happens just in the end of RestoreTPL() function when TPL is TPL_APPLICATION. How do non-OVMF platforms solve the issue? Do they just have the same bug as in https://bugzilla.tianocore.org/show_bug.cgi?id=4162 ? The design of NestedInterruptTplLib is that each nested interrupt must increase the TPL, but if I understand correctly there is a hole here: // // Call RestoreTPL() to allow event notifications to be // dispatched. This will implicitly re-enable interrupts. // gBS->RestoreTPL (InterruptedTPL); // // Re-disable interrupts after the call to RestoreTPL() to ensure // that we have exclusive access to the shared state. // DisableInterrupts (); because gBS->RestoreTPL will unconditionally enable interrupts if InterruptedTPL < TPL_HIGH_LEVEL. If possible, the easiest solution would be to merge NestedInterruptTplLib into Core DXE. This way, instead of calling gBS->RestoreTPL, NestedInterruptTplLib can call a custom version of CoreRestoreTpl that exits with interrupts disabled. That is, something like VOID EFIAPI CoreRestoreTplInternal(IN EFI_TPL NewTpl, IN BOOLEAN InterruptState) { // // The caller can request disabled interrupts to access shared // state, but TPL_HIGH_LEVEL must *not* have them enabled. // ASSERT(!(NewTpl == TPL_HIGH_LEVEL && InterruptState)); // ... gEfiCurrentTpl = NewTpl; CoreSetInterruptState (InterruptState); } Now, CoreRestoreTpl is just // // If lowering below HIGH_LEVEL, make sure // interrupts are enabled // CoreRestoreTplInternal(NewTpl, NewTpl < TPL_HIGH_LEVEL); whereas NestedInterruptRestoreTPL can do // // Call RestoreTPL() to allow event notifications to be // dispatched. This will implicitly re-enable interrupts, // but only if events have to be dispatched. // CoreRestoreTplInternal(InterruptedTPL, FALSE); // // Interrupts are now disabled, so we can access shared state. // This avoids the unlimited nesting of interrupts because each stack frame will indeed have a higher TPL than the outer version. Paolo -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#116181): https://edk2.groups.io/g/devel/message/116181 Mute This Topic: https://groups.io/mt/104642317/21656 Group Owner: devel+ow...@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue due to nested interrupts
On 2/29/24 20:16, Kinney, Michael D wrote: -Original Message- From: Paolo Bonzini Sent: Thursday, February 29, 2024 11:04 AM To: Ni, Ray ; devel@edk2.groups.io Cc: Kinney, Michael D ; Liming Gao ; Laszlo Ersek ; Michael Brown Subject: Re: [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue due to nested interrupts On 2/29/24 14:02, Ray Ni wrote: In the end, it will lower the TPL to TPL_APPLICATION with interrupt enabled. However, it's possible that another timer interrupt happens just in the end of RestoreTPL() function when TPL is TPL_APPLICATION. How do non-OVMF platforms solve the issue? Do they just have the same bug as in https://bugzilla.tianocore.org/show_bug.cgi?id=4162 ? Yes. This same issue can be reproduced on non-OVMF platforms. This proposal here is an attempt to integrate a common fix into the DXE Core. I would agree conceptually that integrating the NestedInterruptTplLib work into the DXE Core is another option. I believe the root cause of all of these scenarios is enabling interrupts in RestoreTPL() when processing a timer interrupt between the last processed event and the return from the interrupt handler. Ther are some instances of the Timer Arch Protocol implementation that call Raise/Restore TPL, so we want a DXE Core change that is compatible with the DXE Core doing Raise/Restore when processing a timer interrupt and the Timer Arch Protocol implementation also doing the Raise/Restore TPL. Ok, now I understand better. The reason why the NestedInterruptTplLib was introduced (as opposed to doing it in core DXE) was to enable returning with disabled interrupts from the nested interrupt handler, but I think it can be done with a function like the CoreRestoreTplInternal() I outlined in the previous email, which is the same as current CoreRestoreTpl() but finishes with if (!DesiredInterruptState) { CoreSetInterruptState (FALSE); } gEfiCurrentTpl = NewTpl; if (DesiredInterruptState) { ASSERT (gEfiCurrentTpl < TPL_HIGH_LEVEL); CoreSetInterruptState (TRUE); } The new CoreRaiseTpl would be the same as in Ray and your patch, while the CoreRestoreTpl would be something like this: if (NewTpl == HighBitSet64 (mInterruptedTplMask)) { static NESTED_INTERRUPT_STATE NestedInterruptState; mInterruptedTplMask &= ~(UINTN)(1 << NewTpl); // // Use the deferred invocation logic that is currently // in NestedInterruptTplLib. // // But unlike current NestedInterruptRestoreTPL(), if the logic // is part of core DXE, the // //gBS->RestoreTPL (InterruptedTPL); //DisableInterrupts (); // // pair that requires "disable interrupts on IRET" logic can // be done without ever enabling interrupts, with // CoreRestoreTplInternal(InterruptedTPL, FALSE) // // As an aside, NestedInterruptState might as well become a // pair of globals. // NestedInterruptRestoreTPL (NewTpl, ); } else { CoreRestoreTplInternal(NewTpl, NewTpl < TPL_HIGH_LEVEL); } Requiring matching raise/restore pairs is a bit scary. It can be avoided by changing the "if" to a while (NewTpl >= HighBitSet64 (mInterruptedTplMask)) mInterruptedTplMask &= ~(UINTN)(1 << HighBitSet64 (mInterruptedTplMask)); Then, if inlining NestedInterruptRestoreTPL() allows simplifications, they can be done on top after the merge of NestedInterruptTplLib. In particular, I suspect that the while loop above can be unified with the loop in NestedInterruptRestoreTPL(). But again, that would be best reviewed as a separate change. All this, as Michael said, is however conditional on being able to deal with the TPL_HIGH_LEVEL+STI shenanigans that Windows does. Paolo The design of NestedInterruptTplLib is that each nested interrupt must increase the TPL, but if I understand correctly there is a hole here: // // Call RestoreTPL() to allow event notifications to be // dispatched. This will implicitly re-enable interrupts. // gBS->RestoreTPL (InterruptedTPL); // // Re-disable interrupts after the call to RestoreTPL() to ensure // that we have exclusive access to the shared state. // DisableInterrupts (); because gBS->RestoreTPL will unconditionally enable interrupts if InterruptedTPL < TPL_HIGH_LEVEL. If possible, the easiest solution would be to merge NestedInterruptTplLib into Core DXE. This way, instead of calling gBS->RestoreTPL, NestedInterruptTplLib can call a custom version of CoreRestoreTpl that exits with interrupts disabled. That is, something like VOID EFIAPI CoreRestoreTplInternal(IN EFI_TPL NewTpl, IN BOOLEAN InterruptState) { // // The caller can request disabled interrupts to access shared // state, but TPL_HIGH_LEVEL must *not* have them enabled