On Fri, Mar 11, 2016 at 04:46:49PM +0100, Laszlo Ersek wrote: > On 03/11/16 05:26, Gary Lin wrote: > > On Thu, Mar 10, 2016 at 12:31:18PM +0100, Laszlo Ersek wrote: > >> On 03/10/16 11:02, Gary Lin wrote: > >>> On Thu, Mar 10, 2016 at 10:20:12AM +0100, Laszlo Ersek wrote: > >>>> On 03/10/16 08:49, Gary Lin wrote: > >> > >>>>> I found that it's related to iPXE. If I disable iPXE with > >>>>> > >>>>> "-netdev user,id=hostnet0 -device > >>>>> virtio-net-pci,romfile=,netdev=hostnet0" > >>>>> > >>>>> then everything works as expected. I'll try to dig deeper to find more > >>>>> information. > >>>> > >>>> How fresh is your ipxe? And did you build it with CONFIG=qemu? > >>>> > >>> It comes from the qemu 2.5.0 tarball and I just installed it from > >>> openSUSE Virtualization repo. > >>> https://build.opensuse.org/package/show/Virtualization/qemu > >> > >> QEMU 2.5 bundles iPXE binaries built with CONFIG=qemu, but that > >> configuration doesn't enable IPv6 for the moment. > >> > >> Apparently, combining iPXE (CONFIG=qemu) with NETWORK_IP6_ENABLE (set in > >> OVMF) causes problems. I don't understand how this causes side effects, > >> because CONFIG=qemu instructs iPXE to provide SNP drivers only. SNP is > >> independent of IP version. > >> > >> With this combination, IPv6 PXE should work (not be absent), and IPv6 > >> HTTP Boot should work too (not blow up). > >> > >> Can you perhaps remove grub2 from the equation? What if you play with > >> disconnect / reconnect / connect in the UEFI shell? > >> > > hmmm I found another way to crash the system with a different assert. > > > > Just disconnect/reconnect the handle providing HttpServiceBinding, and > > OVMF crashed with/without iPXE: > > > > ASSERT /home/gary/git/edk2/NetworkPkg/HttpBootDxe/HttpBootDxe.c(796): CR > > has Bad Signature > > > > Here is the function that issues the assert. > > > > EFIAPI HttpBootIp6DxeDriverBindingStart () > > ... > > if (!EFI_ERROR (Status)) { > > Private = HTTP_BOOT_PRIVATE_DATA_FROM_ID(Id); <---- CRASH > > } else { > > ... > > > > This also only happens after fa848a4048943251fc057fe8d6c5a82e01d2ffb6. > > I found the bug. There are two key UEFI facts that are necessary for > understanding it. > > (1) The first fact is that for any given handle, at most one instance of > a given protocol interface (= GUID) can be installed on it. In other > words, if you pick a handle, and pick a GUID, the handle either has the > protocol interface installed, or not. The handle cannot have two or more > instances of the same protocol. > > (2) The second fact is how the DisconnectController() boot service > works, in case the DriverImageHandle parameter is specified as non-NULL, > and the ChildHandle parameter is NULL. (This case means "disconnect the > given driver from the given controller".) Let me quote the spec: > > A driver is disconnected from a controller by calling the Stop() > service of the EFI_DRIVER_BINDING_PROTOCOL. The > EFI_DRIVER_BINDING_PROTOCOL is on the driver image handle, and the > handle of the controller is passed into the Stop() service. > > Now put the two facts together. You have a driver. For the driver, you > have one ImageHandle. On that handle, you can install one instance of > the driver binding protocol (due to fact (1)). So, when someone wants to > disconnect a given controller from your driver, and calls > gBS->DisconnectController() accordingly, exactly *one* Stop() function > in your driver will be called. After that, the DisconnectController() > boot service will return, and a complete successful disconnection > between your driver and the controller will be assumed by the system. > > The HttpBootDxe driver violates this. > > If you look at its entry point function (called > HttpBootDxeDriverEntryPoint()), you see that it installs *two* instances > of the driver binding protocol -- but only the first (the IPv4 one) goes > on the driver's genuine ImageHandle. The second instance (the IPv6 one) > goes on to a NULL handle. > > The EfiLibInstallDriverBindingComponentName2() utility function seems to > support this. A new handle is created by the system, and the > "gHttpBootIp6DxeDriverBinding" protocol instance is installed on that > separate, new handle. > > Now let us consider the HttpBootConfigFormInit() and > HttpBootConfigFormUnload() functions that were introduced by the commit > you found. The HttpBootConfigFormInit() function adds a permanent > reference to the parent handle's HTTP SB protocol, using the attribute > EFI_OPEN_PROTOCOL_BY_CHILD_CONTROLLER. The HttpBootConfigFormUnload() > function removes this reference. > > The IPv4 and IPv6 variants of the driver binding start & stop functions > implement a kind of reference counting between each other. Namely, > regardless of which one of the IPv4 or IPv6 binding start functions is > called first, *that* first one will call HttpBootConfigFormInit(), and > the second one will not call it. The resources set up by the > HttpBootConfigFormInit() function (including the child reference to HTTP > SB) are shared between IPv4 and IPv6. > > Similarly, when the stop functions are called, their ordering is > irrelevant; it is always the second one that is supposed to call > HttpBootConfigFormUnload(). > > Now, assume that the IPv4 start function is called first. The > HttpBootConfigFormInit() function is invoked, and it creates the > by-child reference to the parent's HTTB SB protocol interface. Also, > this reference is tagged with the image handle on which the IPv4 driver > binding protocol is installed -- the real image handle. > > Assume the IPv6 start function is called second. The shared data is > already in place. > > In my testing, when I call "disconnect B2" at this point in the UEFI > shell (0xB2 being the handle of the virtio-net NIC), the HTTPv4 Stop > function is called first. It does not call HttpBootConfigFormUnload(), > because the IPv6 side of the driver still uses the shared data. > > The problem is, the by-child reference left behind (for the IPv6 side to > release later) is *still* associated with the IPv4 image handle. This > violates the DisconnectController() contract, and things go downhill > from there. The CoreDisconnectControllersUsingProtocolInterface() > function in "MdeModulePkg/Core/Dxe/Hand/Handle.c" finds the lost > reference (OpenCount > 0), and it tries to reconnect the drivers that > have been already (partially) disconnected -- including the HTTPv4 start > function. That's when things completely blow up, and an ASSERT() is > triggered. > > I don't think it is possible to share such a by-child reference. I think > each half should have its own reference, and that reference should be > dropped when the matching Stop function runs. > > I think there were problems with HttpBootConfigFormInit() during > development -- a field called "Initilized" is checked at the beginning > of the function, but it isn't used anywhere else in the driver. I think > it is residue from an earlier attempt to get this sharing right. > > It is also clear why the bug only hits with IPv6 enabled in OVMF. If > IPv6 is disabled, then the v6 Start() function never runs / succeeds (so > no sharing takes place actually), and then the IPv4 Stop function > releases everything, including the by-child reference. > Wow, thanks for the thorough analysis!
I thought it caused by accidental double-free in the commit but it turned out the problem is more than that. Glad to see the root cause is identified. Thanks, Gary Lin _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel