On 01/28/14 21:55, Bill Paul wrote: > I think part of my problem is that I don't quite understand how the > firmware image volumes work either. You start out with one OVMF.fd > image, which contains all of the firmware in compressed form. I'm > assuming this image is mapped by QEMU into the address space such > that there's some initial bootstrap code placed at the reset vector > so that the CPU hits it at power-up/reset, and from there it extracts > the contents into RAM.
Correct. Regarding the on-disk format of the flash device, please see the commit message https://github.com/tianocore/edk2/commit/b36f701d (the "OVMF.fd after" part). It is mapped just below 4GB (*) by qemu. See pc_system_firmware_init() in file "hw/i386/pc_sysfw.c". We mostly care about pc_system_flash_init() there. (*) The size of OVMF.fd is normally 2MB for debug builds, and 1MB for release builds. You can ask for the other size in both cases with -D FD_SIZE_1MB and -D FD_SIZE_2MB. (See <https://github.com/tianocore/edk2/commit/8184a764>.) The reset vector code and the SEC code are uncompressed. OVMF's reset vector is located in OvmfPkg/ResetVector. It reuses the "generic" edk2 reset vector when SEC+PEI are 32-bit (Ia32). When SEC+PEI are 64-bit (X64), then the reset vector sets up initial page tables too. (We used to keep the prebuilt page tables too in read-only flash, but KVM didn't really like to have them there, because it wanted to write the Accessed bits in the page table entries, even if they were all pre-set to 1. I can't recall the exact circumstances, but I believe it was only a problem when nested paging was supported and enabled on the host. See <https://github.com/tianocore/edk2/commit/c90e37b5>.) The SEC code is entered at SecCoreStartupWithStack(), called from "OvmfPkg/Sec/X64/SecEntry.S". The C code is in "SecMain.c". It sets up some temporary stack and heap near SEC_TOP_OF_STACK, decompresses the one FV FFS (= Firmware Volume / Firmware File System) file to a temporary RAM buffer (starting at 9MB) from the flash (located below 4GB). It finds firmware volume headers in the decompressed output. One chunk corresponds to PEIFV, and the other corresponds to DXEFV. These are then copied to their final places. Later on control is transferred to PEI. The last phase of PEI will key off the S3 status (cold boot or resume). In the former case, it will start DXE. In the latter case, it will jump to the OS's resume vector. At S3 resume the reset vector and the SEC code run just the same from the flash below 4GB. The SEC code will determine if we're cold booting or resuming from S3 sleep. In the former case, see above. In the latter case, we won't decompress anything. First, won't need DXE at all. Second, we'll need PEI, but that's been decompressed before, and protected from the OS as ACPI NVS, so we'll just jump to it. > What I don't know is just where everything ends up in RAM. Well in my versions of the patchset :), ie. up to v3, the series used to start with a text file documenting the final RAM layout. Of course that's completely obsolete now, so I'm not giving you any link lest it confuse you. We're between v4 and v5 now (an initial sequence from v4 has been pushed, and AFAIK Jordan is about to post v5 of the rest). You *can* glean the final layout from the FDF files (at the end of Jordan's series), precisely from the spot where you've been looking anyway: https://github.com/jljusten/edk2/blob/ovmf-s3/OvmfPkg/OvmfPkgX64.fdf > [FD.MEMFD] > BaseAddress = 0x800000 > Size = 0x800000 > ErasePolarity = 1 > BlockSize = 0x10000 > NumBlocks = 0x80 So, we're basing it at 8MB, and the size is also 8MB. Within that range, with relative start addresses: > > 0x000000|0x006000 > gUefiOvmfPkgTokenSpaceGuid.PcdOvmfSecPageTablesBase|gUefiOvmfPkgTokenSpaceGuid.PcdOvmfSecPageTablesSize These are the initial page tables built by the reset vector code, identity-mapping the first 4GB. It comprises six 4K pages. The first two pages host the page directories, the four other pages host the page tables. The PTEs in there map 4GB with 2MB pages. (If I recall correctly... 4GB/2MB == 2048 PTEs needed, 4*4KB=16384 bytes available for PTEs, 16384/2048==8 bytes per PTE.) So this is at 0x800000 + 0x000000 == 8MB. > 0x006000|0x001000 > gUefiOvmfPkgTokenSpaceGuid.PcdOvmfLockBoxStorageBase|gUefiOvmfPkgTokenSpaceGuid.PcdOvmfLockBoxStorageSize This chunk (1 page) will be needed for internal purposes. Some data to save across S3 sleep are prepared during DXE, before booting the OS. Those data are separately allocated and saved in ACPI NVS regions (as high as possible below the end of 32-bit RAM, ie. below the 32-bit PCI hole), and they are linked into this small administrative range (which hosts basically a linked list of pointers and sizes). Range: 8MB+24KB to 8MB+28K. > 0x010000|0x008000 > gUefiOvmfPkgTokenSpaceGuid.PcdOvmfSecPeiTempRamBase|gUefiOvmfPkgTokenSpaceGuid.PcdOvmfSecPeiTempRamSize This area hosts the initial (temporary) heap and stack for SEC and PEI that I mentioned above. After PEI detects the size of available RAM later on, it informs the PEI core about it ("installs permanent system memory"), and then this heap and stack are dynamically relocated higher. Range: 8MB+64KB to 8MB+96KB. > 0x018000|0x008000 > gUefiOvmfPkgTokenSpaceGuid.PcdS3AcpiReservedMemoryBase|gEfiIntelFrameworkModulePkgTokenSpaceGuid.PcdS3AcpiReservedMemorySize This range is not used for anything (other than reserving it from the OS) during cold boot. During S3 resume, the temporary stack and heap are *not* migrated to some dynamic place in the full system memory (because that's already used by the OS). Instead, the "permanent" PEI stack and heap are relocated to this region (which has been kept away from the OS). Range: 8MB+96KB to 8MB+128KB. > 0x020000|0x0E0000 > gUefiOvmfPkgTokenSpaceGuid.PcdOvmfPeiMemFvBase|gUefiOvmfPkgTokenSpaceGuid.PcdOvmfPeiMemFvSize > FV = PEIFV This region hosts the PEI modules (after decompression), ie. it's the final place for PEIFV. Range: 8MB+128KB to 9MB. > 0x100000|0x700000 > gUefiOvmfPkgTokenSpaceGuid.PcdOvmfDxeMemFvBase|gUefiOvmfPkgTokenSpaceGuid.PcdOvmfDxeMemFvSize > FV = DXEFV This region hosts the DXE modules (after decompression), ie. it's the final place for DXEFV. Range: 9MB to 16MB. For S3 purposes, we must reserve all of these as ACPI NVS, except the last one (ie. DXE modules), because DXE is not run/reached during S3 resume. > You have sections marked BS_Code, BS_Date, RT_Code, RT_Data and > LoaderCode. Is LoaderCode the guts of the firmware? Hmmm I don't think so. Type "EfiLoaderCode" normally stands for "The code portions of a loaded application. (Note that UEFI OS loaders are UEFI applications.)" -- see Table 25 in the UEFI spec. For example, the "grub2-efi" binary qualifies. The OS can release/repurpose ranges of this type (see table 26). > Are you saying the PEIFV area contains yet more guts? Internally, yes (it contains a bunch of PEI drivers), but the OS doesn't need to know. Same for the DXEFV range. > And that at the time you have to decide where to put it, you don't > know how much RAM is available yet and/or the code isn't relocatable? Correct. When we decompress the "nameless" FV FFS file in SEC, and copy PEIFV and DXEFV to their "final" places from the decompressed output, we don't yet know how much RAM is available. We only determine that in one of the PEI modules (OvmfPkg/PlatformPei/), which is code located inside PEIFV. (At which point we (will) also install the "permanent PEI memory", triggering the temporary-to-permanent stack/heap migration.) In theory, we could perhaps fetch the amount of RAM from the CMOS in SEC too, and use a 8MB range somewhere below the PCI hole rather than at fixed 8MB..16MB. We certainly need Jordan to chime in here. The base address @ 8MB dates back to a time when I wasn't around yet. Moving it to the other end of guest RAM could regress stuff that I'm not aware of. >> How large a contiguous range would you need from 1MB upwards? >> (Because the address that we'd shift this up to would likely directly >> impact the minimum qemu guest memory requirements.) > > Unfortunately I'm not sure I have a good answer to that question. We > typically load the VxWorks image at 0x408000, and I think out of the > box the 32-bit build needs about 300MB. (Yes, I know: that doesn't > sound very embedded, does it.) Ouch! > But I don't think this is the right way to approach the issue either. > Something tells me there's a better way to do what you're trying to > do, but I don't understand enough about the problem yet to offer an > alternate solution. I can't of course *prove* that what OVMF does is the best way, but I'll note that you load the VxWorks kernel at a fixed address, with a fixed size requirement (same as we do in OVMF, basically), even though the VxWorks kernel is higher up on the abstraction ladder. I don't think we should even ask the question "who's right" here. For example, I sometimes glance at #linaro-enterprise on FreeNode. The Aarch64 Linux kernel being discussed there seems to put other (different) address restrictions on the UEFI firmware that loads it (<http://irclogs.linaro.org/2014/01/28/%23linaro-enterprise.html>). This suggests that firmware, OS boot loader, and OS should find some understanding, and that this understanding will be arbitrary (because it can't be really justified by anything else than "well this is how our OS works"). I assume that you boot, from under OVMF, a VxWorks-specific boot loader (which is a UEFI application), which in turn pulls in the 300MB kernel image at 0x408000. Is that correct? Maybe the boot loader could "simply" call gBS->AllocatePages() with the appropriate address hints instead. Or, if loading occurs after ExitBootServices(), then the initial runtime code could iterate over the UEFI memory map, and find a sufficiently large contiguous range that consists of EfiConventionalMemory only (plus whatever types Table 26 allows to be freed), and load the kernel there. > >>> It is possible to tweak things in VxWorks to avoid this problem, but >>> it's a pain. It's also not something we typically encounter on real >>> hardware. >> >> I don't think we'd like to hard-wire a *very* different base address >> statically. Maybe we could add a build option, but that only moves >> the pain around. >> >> Re it being different from real hardware, the explanation is that >> most of OVMF's modules are stored compressed in the flash, and are >> decompressed to (and then run from) RAM at startup. I assume on real >> hardware the firmware simply runs from flash. (Hm, I guess it could >> be shadowed into RAM too, but I have no data about what addresses.) > > I think it equally likely that you'd have compressed flash images on > real hardware too. (We actually offer a romCompressed option with > VxWorks, where there's no firmware on the system: there's just > VxWorks, and it disgorges itself into RAM to execute. There is also a > romResident option if you have enough flash/ROM to hold the whole > image and don't mind the performance hit.) > > But if it's a question of just having the executable code still around > somewhere and you can't manage that with compressed images, (we can) > why not create an uncompressed build option too? Yes I know it would > take up some more address space, but that may be the only way to make > it work. If we kept the PEI and DXE modules uncompressed in flash, then: - we could indeed execute them directly from below 4GB, probably, - but we'd still need to reserve other areas, - and the flash size would grow significantly. In the past, concerns were raised on the mailing list about raising the default flash size from 1MB to 2MB (I wasn't (and am not) aware why), but I do think such a jump in size would be concerning again. >>>> Additionally, after the full S3 support series committed, further >>>> code will be added to honor the case when the user disables S3 on >>>> the qemu command line ("-global PIIX4_PM.disable_s3=1"). Then the >>>> memory allocation in question will be qualified as Boot Services >>>> Data (rather than ACPI NVS), and the OS will be able to drop it >>>> after transitioning to runtime. >>> >>> It appears I need a newer version of QEMU for that option: >>> >>> root@core:/home/wpaul/ovmf # qemu-system-x86_64 -global >>> PIIX4_PM.disable_s3=1 qemu-system-x86_64: Property '.disable_s3' not >>> found >> >> Correct. This property was added in >> >> commit 459ae5ea5ad682c2b3220beb244d4102c1a4e332 >> Author: Gleb Natapov <g...@redhat.com> >> Date: Mon Jun 4 14:31:55 2012 +0300 >> >> Add PIIX4 properties to control PM system states. >> >> first released in v1.2.0. >> >> I searched the FreeBSD ports repo for qemu, and it seems that the >> "qemu-devel" package is at 1.7.0. (Not sure if you can easily get it >> in 9.1-RELEASE.) > > I'm sure I can shoehorn it in somehow. :) Please note though that OVMF code to actually honor this setting will only be written/added once the "basic" S3 functionality is complete. (Which it is not, for the time being.) Naturally, you can try to convince Jordan to implement that ASAP :) (My v3 contained those patches at the end of the series. Jordan has taken over for v4 and v5, among other things changing the memory map significantly (for the better, I have no problems admitting that), but we've also split off / postponed honoring the disable_s3 property for a future, separate series.) > >>> That aside, this would be an acceptable compromise, at least until >>> VxWorks supports S3 resume on the Intel architecture. :) >>> >>> I still think the placement of the PEIFV block is much less than >>> ideal, but for the time being I can deal with it. >> >> Alternatively, please propose the lowest address that would work out >> of the box for your use case, and then Jordan could decide if it was >> reasonable to re-wire the FDFs with that address. > > I don't think assuming 300MB of RAM is reasonable Fully agreed :) > so I don't think that will work. Maybe once I've read more of the code > I can suggest a better idea. The in-tree code that fetches the amount of RAM from CMOS is in "OvmfPkg/PlatformPei/MemDetect.c", and as I said it runs in PEI. The decompression code is in "OvmfPkg/Sec/SecMain.c". It runs in SEC, before PEI. I think any proposed changes should be synchronized with Jordan's S3 series (v5 is coming soon; see the "ovmf-s3" branch reference above), because I believe it's going to rework some of the MemDetect / reservation bits. Thanks, Laszlo ------------------------------------------------------------------------------ WatchGuard Dimension instantly turns raw network data into actionable security intelligence. It gives you real-time visual feedback on key security issues and trends. Skip the complicated setup - simply import a virtual appliance and go from zero to informed in seconds. http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk _______________________________________________ edk2-devel mailing list edk2-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/edk2-devel