On 04.03.25 05:20, Baoquan He wrote:
On 03/03/25 at 09:17am, Donald Dutile wrote:


On 3/3/25 3:25 AM, David Hildenbrand wrote:
On 20.02.25 17:48, Jiri Bohac wrote:
Hi,

this series implements a way to reserve additional crash kernel
memory using CMA.

Link to the v1 discussion:
https://lore.kernel.org/lkml/zwd_fapqewkfl...@dwarf.suse.cz/
See below for the changes since v1 and how concerns from the
discussion have been addressed.

Currently, all the memory for the crash kernel is not usable by
the 1st (production) kernel. It is also unmapped so that it can't
be corrupted by the fault that will eventually trigger the crash.
This makes sense for the memory actually used by the kexec-loaded
crash kernel image and initrd and the data prepared during the
load (vmcoreinfo, ...). However, the reserved space needs to be
much larger than that to provide enough run-time memory for the
crash kernel and the kdump userspace. Estimating the amount of
memory to reserve is difficult. Being too careful makes kdump
likely to end in OOM, being too generous takes even more memory
from the production system. Also, the reservation only allows
reserving a single contiguous block (or two with the "low"
suffix). I've seen systems where this fails because the physical
memory is fragmented.

By reserving additional crashkernel memory from CMA, the main
crashkernel reservation can be just large enough to fit the
kernel and initrd image, minimizing the memory taken away from
the production system. Most of the run-time memory for the crash
kernel will be memory previously available to userspace in the
production system. As this memory is no longer wasted, the
reservation can be done with a generous margin, making kdump more
reliable. Kernel memory that we need to preserve for dumping is
never allocated from CMA. User data is typically not dumped by
makedumpfile. When dumping of user data is intended this new CMA
reservation cannot be used.


Hi,

I'll note that your comment about "user space" is currently the case, but will 
likely not hold in the long run. The assumption you are making is that only user-space 
memory will be allocated from MIGRATE_CMA, which is not necessarily the case. Any movable 
allocation will end up in there.

Besides LRU folios (user space memory and the pagecache), we already support 
migration of some kernel allocations using the non-lru migration framework. 
Such allocations (which use __GFP_MOVABLE, see __SetPageMovable()) currently 
only include
* memory balloon: pages we never want to dump either way
* zsmalloc (->zpool): only used by zswap (-> compressed LRU pages)
* z3fold (->zpool): only used by zswap (-> compressed LRU pages)

Just imagine if we support migration of other kernel allocations, such as user 
page tables. The dump would be missing important information.

IOMMUFD is a near-term candidate for user page tables with multi-stage iommu 
support with going through upstream review atm.
Just saying, that David's case will be a norm in high-end VMs with 
performance-enhanced guest-driven iommu support (for GPUs).

Thank both for valuable inputs, David and Don. I agree that we may argue
not every system have ballon or enabling swap for now, while future
extending of migration on other kernel allocation could become obstacle
we can't detour.

If we have known for sure this feature could be a bad code, we may need
to stop it in advance.

Sorry for the late reply.

I think we just have to be careful to document it properly -- especially the shortcomings and that this feature might become a problem in the future. Movable user-space page tables getting placed on CMA memory would probably not be a problem if we don't care about ... user-space data either way.

The whole "Direct I/O takes max 1s" part is a bit shaky. Maybe it could be configurable how long to wait? 10s is certainly "safer".

But maybe, in the target use case: VMs, direct I/O will not be that common.

--
Cheers,

David / dhildenb


Reply via email to