On Mon, Mar 03, 2025 at 09:25:30AM +0100, David Hildenbrand wrote: > On 20.02.25 17:48, Jiri Bohac wrote: > > > > By reserving additional crashkernel memory from CMA, the main > > crashkernel reservation can be just large enough to fit the > > kernel and initrd image, minimizing the memory taken away from > > the production system. Most of the run-time memory for the crash > > kernel will be memory previously available to userspace in the > > production system. As this memory is no longer wasted, the > > reservation can be done with a generous margin, making kdump more > > reliable. Kernel memory that we need to preserve for dumping is > > never allocated from CMA. User data is typically not dumped by > > makedumpfile. When dumping of user data is intended this new CMA > > reservation cannot be used. > > I'll note that your comment about "user space" is currently the case, but > will likely not hold in the long run. The assumption you are making is that > only user-space memory will be allocated from MIGRATE_CMA, which is not > necessarily the case. Any movable allocation will end up in there. > > Besides LRU folios (user space memory and the pagecache), we already support > migration of some kernel allocations using the non-lru migration framework. > Such allocations (which use __GFP_MOVABLE, see __SetPageMovable()) currently > only include > * memory balloon: pages we never want to dump either way > * zsmalloc (->zpool): only used by zswap (-> compressed LRU pages) > * z3fold (->zpool): only used by zswap (-> compressed LRU pages) > > Just imagine if we support migration of other kernel allocations, such as > user page tables. The dump would be missing important information. > > Once that happens, it will become a lot harder to judge whether CMA can be > used or not. At least, the kernel could bail out/warn for these kernel > configs.
Thanks for ponting this out. I still don't see this as a roadblock for my primary usecase of the CMA reservation: get at least some (less reliable and potentially less useful) kdump where the user is not prepared to sacrifice the memory needed for the standard reservation and where the only other option is no kdump at all. Still a lot can be analyzed with a vmcore that is missing those __GFP_MOVABLE pages. Even if/when some user page tables are missing. I'll send a v3 with the documenatation part updated to better describe this. > > The fourth patch adds a short delay before booting the kdump > > kernel, allowing pending DMA transfers to finish. > > > What does "short" mean? At least in theory, long-term pinning is forbidden > for MIGRATE_CMA, so we should not have such pages mapped into an iommu where > DMA can happily keep going on for quite a while. See patch 4/5 in the series: I propose 1 second, which is a negligible time from the kdump POV but I assume it should be plenty enough for non-long-term pins in MIGRATE_CMA. > But that assumes that our old kernel is not buggy, and doesn't end up > mapping these pages into an IOMMU where DMA will just continue. I recall > that DRM might currently be a problem, described here [1]. > > If kdump starts not working as expected in case our old kernel is buggy, > doesn't that partially destroy the purpose of kdump (-> debug bugs in the > old kernel)? Again, this is meant as a kind of "lightweight best effort kdump". If there is a bug that causes the crash _and_ a bug in a driver that hogs MIGRATE_CMA and maps it into IOMMU then this lightweight kdump may break. Then it's time to sacrifice more memory and use a normal crashkernel reservation. It's not like any bug in the old kernel will break it. It's a very specific kind of bug that can potentially break it. I see this whole thing as particularly useful for VMs. Unlike big physical machines, where taking away a couple hundred MBs of memory for kdump does not really hurt, a VM can ideally be given just enough memory for its particular task. This can often be less than 1 GB. Proper kdump reservation needs a couple hundred MBs, so a very large proportion of the VM memory. In case of a virtualization host running hundreds or thousands such VMs this means a huge waste of memory. And VMs often don't use too many drivers for real hardware, decreasing the risk of hitting a buggy driver like this. Thanks, -- Jiri Bohac <jbo...@suse.cz> SUSE Labs, Prague, Czechia