HugeTLB manages its own pages. It allocates them on boot and uses those to fulfill hugepage requests.
To support live update for a hugetlb-backed memfd, it is necessary to track how many pages of each hstate are coming from live update. This is needed to ensure the boot time allocations don't over-allocate huge pages, causing the rest of the system unexpected memory pressure. For example, say the system has 100G memory and it uses 90 1G huge pages, with 10G put aside for other processes. Now say 5 of those pages are preserved via KHO for live updating a huge memfd. But during boot, the system will still see that it needs 90 huge pages, so it will attempt to allocate those. When the file is later retrieved, those 5 pages also get added to the huge page pool, resulting in 95 total huge pages. This exceeds the original expectation of 90 pages, and ends up wasting memory. LUO has file-lifecycle-bound (FLB) data to keep track of global state of a subsystem. Use it to track how many huge pages are used up for each hstate. When a file is preserved, it will increment to the counter, and when it is unpreserved, it will decrement it. During boot time allocations, this data can be used to calculate how many hugepages actually need to be allocated. Design note: another way of doing this would be to preserve the entire set of hugepages using the FLB, skip boot time allocation, and restore them all on FLB retrieve. The pain problem with that approach is that it would need to freeze all hstates after serializing them. This will need a lot more invasive changes in hugetlb since there are many ways folios can be added to or removed from a hstate. Doing it this way is simpler and less invasive. Signed-off-by: Pratyush Yadav <[email protected]> --- Documentation/mm/memfd_preservation.rst | 9 ++ MAINTAINERS | 1 + include/linux/kho/abi/hugetlb.h | 66 +++++++++ kernel/liveupdate/Kconfig | 12 ++ mm/Makefile | 1 + mm/hugetlb.c | 1 + mm/hugetlb_internal.h | 15 ++ mm/hugetlb_luo.c | 179 ++++++++++++++++++++++++ 8 files changed, 284 insertions(+) create mode 100644 include/linux/kho/abi/hugetlb.h create mode 100644 mm/hugetlb_luo.c diff --git a/Documentation/mm/memfd_preservation.rst b/Documentation/mm/memfd_preservation.rst index 66e0fb6d5ef0..6068dd55f4fb 100644 --- a/Documentation/mm/memfd_preservation.rst +++ b/Documentation/mm/memfd_preservation.rst @@ -16,6 +16,15 @@ Memfd Preservation ABI .. kernel-doc:: include/linux/kho/abi/memfd.h :internal: +HugeTLB-backed memfd Preservation ABI +===================================== + +.. kernel-doc:: include/linux/kho/abi/hugetlb.h + :doc: hugetlb-backed memfd live update ABI + +.. kernel-doc:: include/linux/kho/abi/hugetlb.h + :internal: + See Also ======== diff --git a/MAINTAINERS b/MAINTAINERS index fc23a0381e19..55ef24e80ae5 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14481,6 +14481,7 @@ F: include/linux/liveupdate/ F: include/uapi/linux/liveupdate.h F: kernel/liveupdate/ F: lib/tests/liveupdate.c +F: mm/hugetlb_luo.c F: mm/memfd_luo.c F: tools/testing/selftests/liveupdate/ diff --git a/include/linux/kho/abi/hugetlb.h b/include/linux/kho/abi/hugetlb.h new file mode 100644 index 000000000000..55e833569c48 --- /dev/null +++ b/include/linux/kho/abi/hugetlb.h @@ -0,0 +1,66 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +/* + * Copyright (C) 2025 Amazon.com Inc. or its affiliates. + * Pratyush Yadav <[email protected]> + */ + +#ifndef _LINUX_KHO_ABI_HUGETLB_H +#define _LINUX_KHO_ABI_HUGETLB_H + +#include <linux/hugetlb.h> + +/** + * DOC: hugetlb-backed memfd live update ABI + * + * This header defines the ABI for preserving the state of the hugetlb subsystem + * and a hugetlb-backed memfd across a kexec reboot using LUO. + * + * This interface is a contract. Any modification to the structure layout + * constitutes a breaking change. Such changes require incrementing the version + * number in the HUGETLB_FLB_COMPATIBLE or HUGE_MEMFD_COMPATIBLE strings for + * hugetlb FLB or hugetlb-backed memfd, respectively. + */ + +/* + * Keep the serialized max hstates separate from the kernel's HUGE_MAX_HSTATE to + * keep the value stable. + * + * Currently x86 and arm64 are supported. x86 has HUGE_MAX_HSTATE as 2 and arm64 + * has 4. Pick 4 as the number to start with. + */ +#define HUGETLB_SER_MAX_HSTATES 4 + +static_assert(HUGETLB_SER_MAX_HSTATES >= HUGE_MAX_HSTATE); + +/** + * struct hugetlb_hstate_ser: Serialized state of a hstate. + * @nr_pages: Number of preserved pages in the hstate. + * @order: Order of the hstate this struct describes. + * + * The only state needed for hstates is the number of pages that are preserved + * from this hstate. The preserved pages are added to the hstate when the file + * is retrieved. This information gets used in early boot to calculate the + * remaining pages that must be allocated by the normal path. + */ +struct hugetlb_hstate_ser { + /* Number of _preserved_ pages in the hstate. */ + u64 nr_pages; + u8 order; +} __packed; + +/** + * struct hugetlb_ser - The main serialization structure for HugeTLB FLB. + * @hstates: Array of serialized hstates. + * @nr_hstates: Number of serialized hstates in the array. + */ +struct hugetlb_ser { + struct hugetlb_hstate_ser hstates[HUGETLB_SER_MAX_HSTATES]; + u8 nr_hstates; +} __packed; + +static_assert(sizeof(struct hugetlb_ser) <= PAGE_SIZE); + +#define HUGETLB_FLB_COMPATIBLE "hugetlb-v1" + +#endif /* _LINUX_KHO_ABI_HUGETLB_H */ diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index 9b2515f31afb..86e76aed8a93 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -72,4 +72,16 @@ config LIVEUPDATE If unsure, say N. +config LIVEUPDATE_HUGETLB + bool "Live update support for HugeTLB" + depends on LIVEUPDATE && HUGETLBFS + help + + Enable live update support for the HugeTLB subsystem. This allows live + updating memfd backed by huge pages. This can be used by hypervisors that + use hugetlb memfd to back VM memory, or for other user workloads needing + to live update huge pages. + + If unsure, say N. + endmenu diff --git a/mm/Makefile b/mm/Makefile index 7738ec416f00..753bc1e3f3fd 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -101,6 +101,7 @@ obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o obj-$(CONFIG_LIVEUPDATE) += memfd_luo.o +obj-$(CONFIG_LIVEUPDATE_HUGETLB) += hugetlb_luo.o obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o ifdef CONFIG_SWAP diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 0f818086bf4f..ff90ceacf62c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4702,6 +4702,7 @@ static int __init hugetlb_init(void) hugetlb_sysfs_init(); hugetlb_cgroup_file_init(); hugetlb_sysctl_init(); + hugetlb_luo_init(); #ifdef CONFIG_SMP num_fault_mutexes = roundup_pow_of_two(8 * num_possible_cpus()); diff --git a/mm/hugetlb_internal.h b/mm/hugetlb_internal.h index edfb4eb75828..b7b149c56567 100644 --- a/mm/hugetlb_internal.h +++ b/mm/hugetlb_internal.h @@ -9,6 +9,7 @@ #include <linux/hugetlb.h> #include <linux/hugetlb_cgroup.h> #include <linux/list.h> +#include <linux/liveupdate.h> void init_new_hugetlb_folio(struct folio *folio); void account_new_hugetlb_folio(struct hstate *h, struct folio *folio); @@ -32,4 +33,18 @@ static inline struct resv_map *inode_resv_map(struct inode *inode) return (struct resv_map *)(&inode->i_data)->i_private_data; } +#ifdef CONFIG_LIVEUPDATE_HUGETLB +void hugetlb_luo_init(void); +unsigned long hstate_liveupdate_pages(struct hstate *h); +#else +static inline void hugetlb_luo_init(void) +{ +} + +static inline unsigned long hstate_liveupdate_pages(struct hstate *h) +{ + return 0; +} +#endif /* CONFIG_LIVEUPDATE_HUGETLB */ + #endif /* __HUGETLB_INTERNAL_H */ diff --git a/mm/hugetlb_luo.c b/mm/hugetlb_luo.c new file mode 100644 index 000000000000..80e3e015eca5 --- /dev/null +++ b/mm/hugetlb_luo.c @@ -0,0 +1,179 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2025 Amazon.com Inc. or its affiliates. + * Copyright (C) 2025 Pratyush Yadav <[email protected]> + */ + +/* The documentation for this is in mm/memfd_luo.c */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/liveupdate.h> +#include <linux/kexec_handover.h> +#include <linux/hugetlb.h> +#include <linux/kho/abi/hugetlb.h> +#include <linux/spinlock.h> + +#include "hugetlb_internal.h" + +struct hugetlb_flb_obj { + /* Serializes access to ser and its hstates. */ + spinlock_t lock; + struct hugetlb_ser *ser; +}; + +static int hugetlb_flb_preserve(struct liveupdate_flb_op_args *args) +{ + struct hugetlb_ser *hugetlb_ser; + struct hugetlb_flb_obj *obj; + u8 nr_hstates = 0; + struct hstate *h; + + obj = kmalloc(sizeof(*obj), GFP_KERNEL); + if (!obj) + return -ENOMEM; + + hugetlb_ser = kho_alloc_preserve(sizeof(*hugetlb_ser)); + if (!hugetlb_ser) { + kfree(obj); + return -ENOMEM; + } + + spin_lock_init(&obj->lock); + obj->ser = hugetlb_ser; + + for_each_hstate(h) { + struct hugetlb_hstate_ser *hser = &hugetlb_ser->hstates[nr_hstates]; + + hser->nr_pages = 0; + hser->order = h->order; + nr_hstates++; + } + + hugetlb_ser->nr_hstates = nr_hstates; + + args->obj = obj; + args->data = virt_to_phys(hugetlb_ser); + + return 0; +} + +static void hugetlb_flb_unpreserve(struct liveupdate_flb_op_args *args) +{ + kho_unpreserve_free(phys_to_virt(args->data)); + kfree(args->obj); +} + +static void hugetlb_flb_finish(struct liveupdate_flb_op_args *args) +{ + /* No live state on the retrieve side. */ +} + +static int hugetlb_flb_retrieve(struct liveupdate_flb_op_args *args) +{ + /* + * The FLB is only needed for boot-time calculation of how many + * hugepages are needed. This is done by early boot handlers already. + * Free the serialized state now. + */ + kho_restore_free(phys_to_virt(args->data)); + + /* + * HACK: But since LUO FLB still needs an obj, use ZERO_SIZE_PTR to + * satisfy it. + */ + args->obj = ZERO_SIZE_PTR; + return 0; +} + +static struct liveupdate_flb_ops hugetlb_luo_flb_ops = { + .preserve = hugetlb_flb_preserve, + .unpreserve = hugetlb_flb_unpreserve, + .finish = hugetlb_flb_finish, + .retrieve = hugetlb_flb_retrieve, +}; + +static struct liveupdate_flb hugetlb_luo_flb = { + .ops = &hugetlb_luo_flb_ops, + .compatible = HUGETLB_FLB_COMPATIBLE, +}; + +static struct hugetlb_hstate_ser +*hugetlb_flb_get_hser(struct hugetlb_ser *hugetlb_ser, unsigned int order) +{ + for (u8 i = 0; i < hugetlb_ser->nr_hstates; i++) { + if (hugetlb_ser->hstates[i].order == order) + return &hugetlb_ser->hstates[i]; + } + + return NULL; +} + +static int hugetlb_flb_add_folio(struct hstate *h) +{ + struct hugetlb_ser *hugetlb_ser; + struct hugetlb_hstate_ser *hser; + struct hugetlb_flb_obj *obj; + int err; + + err = liveupdate_flb_get_outgoing(&hugetlb_luo_flb, (void **)&obj); + if (err) + return err; + + hugetlb_ser = obj->ser; + + guard(spinlock)(&obj->lock); + hser = hugetlb_flb_get_hser(hugetlb_ser, h->order); + if (!hser) + return -ENOENT; + + hser->nr_pages++; + return 0; +} + +static int hugetlb_flb_del_folio(struct hstate *h) +{ + struct hugetlb_ser *hugetlb_ser; + struct hugetlb_hstate_ser *hser; + struct hugetlb_flb_obj *obj; + int err; + + err = liveupdate_flb_get_outgoing(&hugetlb_luo_flb, (void **)&obj); + if (err) + return err; + + hugetlb_ser = obj->ser; + + guard(spinlock)(&obj->lock); + hser = hugetlb_flb_get_hser(hugetlb_ser, h->order); + if (!hser) + return -ENOENT; + + hser->nr_pages--; + return 0; +} + +unsigned long __init hstate_liveupdate_pages(struct hstate *h) +{ + struct hugetlb_hstate_ser *hser; + struct hugetlb_ser *hugetlb_ser; + u64 data; + int err; + + err = liveupdate_flb_incoming_early(&hugetlb_luo_flb, &data); + if (err) + /* If FLB can't be fetched, assume no pages from liveupdate. */ + return 0; + + hugetlb_ser = phys_to_virt(data); + + /* NOTE: No need for locking since this is read-only on incoming side. */ + hser = hugetlb_flb_get_hser(hugetlb_ser, h->order); + return hser ? hser->nr_pages : 0; +} + +void __init hugetlb_luo_init(void) +{ + if (!liveupdate_enabled()) + return; +} -- 2.43.0
