On Mon, Dec 29, 2025 at 4:03 PM Pratyush Yadav <[email protected]> wrote: > > On Tue, Dec 23 2025, Pasha Tatashin wrote: > > >> > if (WARN_ON_ONCE(info.magic != KHO_PAGE_MAGIC || info.order > > >> > MAX_PAGE_ORDER)) > >> > return NULL; > >> > >> See my patch that drops this restriction: > >> https://lore.kernel.org/linux-mm/[email protected]/ > >> > >> I think it was wrong to add it in the first place. > > > > Agree, the restriction can be removed. Indeed, it is wrong as it is > > not enforced during preservation. > > > > However, I think we are going to be in a world of pain if we allow > > preserving memory from different topologies within the same order. In > > kho_preserve_pages(), we have to check if the first and last page are > > from the same nid; if not, reduce the order by 1 and repeat until they > > are. It is just wrong to intermix different memory into the same > > order, so in addition to removing that restriction, I think we should > > implement this enforcement. > > Sure, makes sense. > > > > > Also, perhaps we should pass the NID in the Jason's radix tree > > together with the order. We could have a single tree that encodes both > > order and NID information in the top level, or we can have one tree > > per NID. It does not really matter to me, but that should help us with > > faster struct page initialization. > > Can we use NIDs in ABI? Do they stay stable across reboots? I never > looked at how NIDs actually get assigned. > > Not sure if we should target it for the initial merge of the radix tree, > but I think this is something we can try to figure out later down the > line. > > > > >> >> To get the nid, you would need to call early_pfn_to_nid(). This takes a > >> >> spinlock and searches through all memblock memory regions. I don't think > >> >> it is too expensive, but it isn't free either. And all this would be > >> >> done serially. With the zone search, you at least have some room for > >> >> concurrency. > >> >> > >> >> I think either approach only makes a difference when we have a large > >> >> number of low-order preservations. If we have a handful of high-order > >> >> preservations, I suppose the overhead of nid search would be negligible. > >> > > >> > We should be targeting a situation where the vast majority of the > >> > preserved memory is HugeTLB, but I am still worried about lower order > >> > preservation efficiency for IOMMU page tables, etc. > >> > >> Yep. Plus we might get VMMs stashing some of their state in a memfd too. > > > > Yes, that is true, but hopefully those are tiny compared to everything else. > > > >> >> Long term, I think we should hook this into page_alloc_init_late() so > >> >> that all the KHO pages also get initalized along with all the other > >> >> pages. This will result in better integration of KHO with rest of MM > >> >> init, and also have more consistent page restore performance. > >> > > >> > But we keep KHO as reserved memory, and hooking it up into > >> > page_alloc_init_late() would make it very different, since that memory > >> > is part of the buddy allocator memory... > >> > >> The idea I have is to have a separate call in page_alloc_init_late() > >> that initalizes KHO pages. It would traverse the radix tree (probably in > >> parallel by distributing the address space across multiple threads?) and > >> initialize all the pages. Then kho_restore_page() would only have to > >> double-check the magic and it can directly return the page. > > > > I kind of do not like relying on magic to decide whether to initialize > > the struct page. I would prefer to avoid this magic marker altogether: > > i.e. struct page is either initialized or not, not halfway > > initialized, etc. > > The magic is purely sanity checking. It is not used to decide anything > other than to make sure this is actually a KHO page. I don't intend to > change that. My point is, if we make sure the KHO pages are properly > initialized during MM init, then restoring can actually be a very cheap > operation, where you only do the sanity checking. You can even put the > magic check behind CONFIG_KEXEC_HANDOVER_DEBUG if you want, but I think > it is useful enough to keep in production systems too.
It is part of a critical hotpath during blackout, should really be behind CONFIG_KEXEC_HANDOVER_DEBUG > > Magic is not reliable. During machine reset in many firmware > > implementations, and in every kexec reboot, memory is not zeroed. The > > kernel usually allocates vmemmap using exactly the same pages, so > > there is just too high a chance of getting magic values accidentally > > inherited from the previous boot. > > I don't think that can happen. All the pages are zeroed when > initialized, which will clear the magic. We should only be setting the > magic on an initialized struct page. This can happen due to bugs when we use a partially initialized "struct page", something that Mike have been looking to do. So, pass some information in a struct page before it is fully initialized. > >> Radix tree makes parallelism easier than the linked lists we have now. > > > > Agree, radix tree can absolutely help with parallelism. > > > >> >> Jason's radix tree patches will make that a bit easier to do I think. > >> >> The zone search will scale better I reckon. > >> > > >> > It could, perhaps early in boot we should reserve the radix tree, and > >> > use it as a source of truth look-ups later in boot? > >> > >> Yep. I think the radix tree should mark its own pages as preserved too > >> so they stick around later in boot. > > > > Unfortunately, this can only be done in the new kernel, not in the old > > kernel; otherwise we can end up with a recursive dependency that may > > never be satisfied. > > Right. It shouldn't be too hard to do in the new kernel though. We will > walk the whole tree anyway. > > -- > Regards, > Pratyush Yadav
