On Fri 21 Sep 2018 at 16:57, Andrew Fish <[email protected]> wrote: > > > > On Sep 21, 2018, at 4:24 PM, Vladimir Olovyannikov < > [email protected]> wrote: > > > > On Thu, Sep 20, 2018 at 2:52 PM Vladimir Olovyannikov > > <[email protected]> wrote: > >> > >> On Wed, Sep 19, 2018 at 5:21 PM Bill Paul <[email protected]> wrote: > >>> > >>> Of all the gin joints in all the towns in all the world, Vladimir > >>> Olovyannikov > >>> had to walk into mine at 16:58 on Wednesday 19 September 2018 and say: > >>> > >>>>> From: Ard Biesheuvel [mailto:[email protected]] > >>>>> Sent: Wednesday, September 19, 2018 4:38 PM > >>>>> To: Vladimir Olovyannikov > >>>>> Cc: [email protected] > >>>>> Subject: Re: Stack issue after warm UEFI reset and MMU enabling on an > >>>>> Armv8 platform > >>>>> > >>>>> > >>>>> On 19 September 2018 at 15:55, Vladimir Olovyannikov > >>>>> > >>>>> <[email protected]> wrote: > >>>>>> Hi All, > >>>>>> > >>>>>> I need UEFI experts help on the problem with Armv8 board on warm > UEFI > >>>>>> reset. > >>>>>> Cold reset works fine. > >>>>>> > >>>>>> Here is how I set up a warm reset: > >>>>>> > >>>>>> STATIC > >>>>>> EFI_STATUS > >>>>>> ShutdownUefiBootServices ( > >>>>>> > >>>>>> VOID > >>>>>> ) > >>>>>> > >>>>>> { > >>>>>> > >>>>>> EFI_STATUS Status; > >>>>>> UINTN MemoryMapSize; > >>>>>> EFI_MEMORY_DESCRIPTOR *MemoryMap; > >>>>>> UINTN MapKey; > >>>>>> UINTN DescriptorSize; > >>>>>> UINT32 DescriptorVersion; > >>>>>> UINTN Pages; > >>>>>> > >>>>>> MemoryMap = NULL; > >>>>>> MemoryMapSize = 0; > >>>>>> Pages = 0; > >>>>>> > >>>>>> do { > >>>>>> > >>>>>> Status = gBS->GetMemoryMap ( > >>>>>> > >>>>>> &MemoryMapSize, > >>>>>> MemoryMap, > >>>>>> &MapKey, > >>>>>> &DescriptorSize, > >>>>>> &DescriptorVersion > >>>>>> ); > >>>>>> > >>>>>> if (Status == EFI_BUFFER_TOO_SMALL) { > >>>>>> > >>>>>> Pages = EFI_SIZE_TO_PAGES (MemoryMapSize) + 1; > >>>>>> MemoryMap = AllocatePages (Pages); > >>>>>> > >>>>>> // > >>>>>> // Get System MemoryMap > >>>>>> // > >>>>>> Status = gBS->GetMemoryMap ( > >>>>>> > >>>>>> &MemoryMapSize, > >>>>>> MemoryMap, > >>>>>> &MapKey, > >>>>>> &DescriptorSize, > >>>>>> &DescriptorVersion > >>>>>> ); > >>>>>> > >>>>>> } > >>>>>> > >>>>>> // Don't do anything between the GetMemoryMap() and > >>>>>> ExitBootServices() if (!EFI_ERROR(Status)) { > >>>>>> > >>>>>> Status = gBS->ExitBootServices (gImageHandle, MapKey); > >>>>>> if (EFI_ERROR(Status)) { > >>>>>> > >>>>>> FreePages (MemoryMap, Pages); > >>>>>> MemoryMap = NULL; > >>>>>> MemoryMapSize = 0; > >>>>>> > >>>>>> } > >>>>>> > >>>>>> } > >>>>>> > >>>>>> } while (EFI_ERROR(Status)); > >>>>>> > >>>>>> return Status; > >>>>>> > >>>>>> } > >>>>>> > >>>>>> Then perform > >>>>>> ArmCleanDataCache (); > >>>>>> ArmInvalidateDataCache (); > >>>>>> ArmDisableInstructionCache (); > >>>>>> ArmInvalidateInstructionCache (); > >>>>> > >>>>> These don't do anything useful on ARM. You can only reliably perform > >>>>> cache > >>>>> maintenance by virtual address. > >>>> > >>>> So, should I just remove them altogether? > >>>> > >>>>>> ArmDisableMmu (); > >>>>> > >>>>> ... so after this call returns, all bets are off with regards to > >>>>> whether > >>>>> what is popped from the stack is actually what we pushed when we > >>>>> entered > >>>>> the function. > >>>> > >>>> OK, thank you for explanation. > >>>> But this call returns back into ResetLib implementation as it should, > >>>> and > >>>> then there is a direct jump to the start of FV. > >>>> Am I doing anything wrong here? > >>>> Then, up to the point of enabling of MMU the stack is OK. But right > >>>> after > >>>> enabling MMU it points at _ModuleEntryPoint end of function in > >>>> DxeCoreEntryPoint.c > >>>> Am I missing anything? Maybe some stack cleanup before jumping to the > >>>> start > >>>> of FV? > >>> > >>> When the MMU is enabled, does the mapping for the stack pages change? > That > >>> is, > >>> could the stack now be mapped to different physical page now? > >> Thanks for ideas Bill, > >> No, the mapping stays the same. > >> The issue is only with warm reset, and only on an A72 board. > >> There is another platform on A53 sharing the same code, which has no > issues > >> with warm reset. > >> I cannot explain why. > >>> > >>> Instead of showing a stack trace, can you dump the stack pages and > compare > >>> the > >>> before and after contents? > >> I can clearly see that before and after contents are different. > >>> > >>> Assuming the same physical memory pages are still being used, then > there > >>> could > >>> be a cache flushing problem. What could happen is: > >>> > >>> - some stack memory has been touched recently and is now in the data > cache > >>> - changes are made, which are written to the cache, but not yet flushed > >>> out to > >>> RAM > >>> - enabling the MMU causes a full invalidate of the cache > >>> > >>> Now when you look at the stack, you see the earlier contents that were > in > >>> RAM > >>> -- the changes previously only written to the cache have been lost. > >>> > >>> Enabling/disabling caches and MMU is always tricky. I can't say for > sure, > >>> but > >>> I wouldn't be surprised if there's some subtle bug that causes a flush > >>> operation to be missed and things may just work by coincidence in the > cold > >>> start case. > >> I might be missing something preparing for warm reset. > >> Disabling interrupts does not help though. > >> Ard, I switched off all DMA-capable devices, so am just booting into > UEFI > >> with no disks or network, > >> disabled interrupts. The issue is here. Any ideas on how to debug it and > >> fix? > > Update: when I add this as an experiment: > > UINTN StackBottom; > > > > __asm__ volatile ("mov %0,SP" : "=r"(StackBottom)); > > WriteBackInvalidateDataCacheRange((VOID *)StackBottom, > > PcdGet64(PcdSystemMemorySize) + > > PcdGet64(PcdSystemMemoryBase) - StackBottom); > > then the stack data were not corrupted anymore on the next > ArmEnableMmu() call, > > and warm reset worked (though unreliably, can throw exception on > > memcpy down the road on UEFI boot; probably because I invalidated only > > the stack area in the experiment). > > Considering that A53 board does not have issues, does this means that > > ArmInvalidateDataCache() implementation is useless for A72? > > Based on this, should the approach be "find all data regions and > > invalidate them using InvalidateDataCacheRange()"? > > Vladimir, > > We hit an issue like this a while back on x86 and it turned out our > sequence was dependent on the C compiler code generation not touching a > specific register. It might be worth while to disassemble this code and > take a look at the assembler. I'd also point out the __asm__ volatile is a > serializing event to the C memory model so it could change how the C code > was optimized. You could try __asm__ volatile with a no-op instruction to > see if it really is the SP read at this point that fixes the issue. But it > is likely the bug will be more obvious if you look at the assembly. >
Is your primary FV hosted in DRAM? Are you sure it has not been corrupted by the time you attempt your warm reboot? > > > >>> > >>> -Bill > >>> > >>>>>> Then jump to start of FV: > >>>>>> > >>>>>> typedef > >>>>>> VOID > >>>>>> > >>>>>> (EFIAPI *START_FV)( > >>>>>> > >>>>>> VOID > >>>>>> > >>>>>> ); > >>>>>> StartOfFv = (START_FV)(UINTN)PcdGet64(PcdFvBaseAddress); > >>>>>> StartOfFv (); > >>>>>> > >>>>>> Now this is what happens on warm reset: > >>>>>> reset -c warm > >>>>>> 1. Until ArmEnableMmu() gets called, everything works as expected. > >>>>>> > >>>>>> Here is the stack right before ArmEnableMmu() is called: > >>>>>> ArmConfigureMmu+0x4f8 > >>>>>> InitMmu+0x24 > >>>>>> MemoryPeim+0x440 > >>>>>> PrePiMain+0x114 > >>>>>> PrimaryMain+0x68 > >>>>>> CEntryPoint+0xC4 > >>>>>> EL2:0x00000000800008BC > >>>>>> ----- End of stack info ----- > >>>>>> > >>>>>> 2. Here is the stack as soon as Mmu is enabled with ArmEnableMmu() : > >>>>>> ArmConfigureMmu+0x4fc <-- This one is correct, at line 745 in > >>>>>> > >>>>>> ArmConfigureMmu() in > ArmPkg/Library/ArmMmuLib/AArch64/ArmMmuLibCore.c > >>>>>> (return EFI_SUCCESS) > >>>>>> > >>>>>> _ModuleEntryPoint+0x24 <-- Wrong. This points directly to > >>>>>> > >>>>>> ASSERT(FALSE); and to CpuDeadLoop() in DxeCoreEntryPoint.c, lines > >>>>>> 59-60. > >>>>>> > >>>>>> El2:0x000000008E5E8300 <-- Absolutely bogus > >>>>>> > >>>>>> --- End of stack info --- > >>>>>> > >>>>>> So, as soon as ArmEnableMmu() exits, execution jumps directly to > >>>>>> CpuDeadLoop() in DxeCoreEntryPoint of _ModuleEntryPoint(). > >>>>>> > >>>>>> Would be grateful for any advice. > >>>>>> > >>>>>> Thank you, > >>>>>> Vladimir > >>>> > >>>> _______________________________________________ > >>>> edk2-devel mailing list > >>>> [email protected] > >>>> https://lists.01.org/mailman/listinfo/edk2-devel > >>> -- > >>> > ============================================================================= > >>> -Bill Paul (510) 749-2329 | Senior Member of Technical > Staff, > >>> [email protected] | Master of Unix-Fu - Wind River > >>> Systems > >>> > ============================================================================= > >>> "I put a dollar in a change machine. Nothing changed." - George > Carlin > >>> > ============================================================================= > > _______________________________________________ > > edk2-devel mailing list > > [email protected] > > https://lists.01.org/mailman/listinfo/edk2-devel > > _______________________________________________ > edk2-devel mailing list > [email protected] > https://lists.01.org/mailman/listinfo/edk2-devel > _______________________________________________ edk2-devel mailing list [email protected] https://lists.01.org/mailman/listinfo/edk2-devel

