Hi Andrew,
On 13.05.25 01:59, Andrew Morton wrote:
On Mon, 12 May 2025 22:57:52 +0000 Alexander Graf <g...@amazon.com> wrote:
When booting a new kernel with kexec_file, the kernel picks a target
location that the kernel should live at, then allocates random pages,
checks whether any of those patches magically happens to coincide with
a target address range and if so, uses them for that range.
For every page allocated this way, it then creates a page list that the
relocation code - code that executes while all CPUs are off and we are
just about to jump into the new kernel - copies to their final memory
location. We can not put them there before, because chances are pretty
good that at least some page in the target range is already in use by
the currently running Linux environment.
All of this is inefficient.
Since kexec got introduced, Linux has gained the CMA framework which
can perform physically contiguous memory mappings, while keeping that
memory available for movable memory when it is not needed for contiguous
allocations. The default CMA allocator is for DMA allocations.
This patch adds logic to the kexec file loader to attempt to place the
target payload at a location allocated from CMA. If successful, it uses
that memory range directly instead of creating copy instructions during
the hot phase. To ensure that there is a safety net in case anything goes
wrong with the CMA allocation, it also adds a flag for user space to force
disable CMA allocations.
Using CMA allocations has two advantages:
1) Faster. There is no more need to copy in the hot phase.
How much faster? Kinda matters as "fast" is the whole point of the patch!
It mostly depends on your memory bandwidth, cache size and the amount of
payload you kexec into. Today, we copy kernel, initrd and dtb during the
kexec hot phase. With this patch, we copy none.
At a hypothetical payload size of 100MiB and an AMLogic S905 we're
looking at 50ms. On an Ice Lake system at about 4ms.
As for real life benchmarks, on an Apple M1 VM with 53MiB payload I
measure a 2.2ms speedup.
That said, to me robustness is at least as important as speed here. The
current scheme creates memory allocation conflicts almost by design,
which increase the odds of something going wrong.
2) More robust. Even if by accident some page is still in use for DMA,
the new kernel image will be safe from that access because it resides
in a memory region that is considered allocated in the old kernel and
has a chance to reinitialize that component.
Is this known to be a problem in current code?
Yes, see David's email. I'll clarify in the patch description.
Some minor observations:
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
...
@@ -331,6 +336,7 @@ struct kimage {
*/
unsigned int hotplug_support:1;
#endif
+ unsigned int no_cma : 1;
"no_cma:1" is more conventional.
Fixed.
#ifdef ARCH_HAS_KIMAGE_ARCH
struct kimage_arch arch;
diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
index 5ae1741ea8ea..8958ebfcff94 100644
--- a/include/uapi/linux/kexec.h
+++ b/include/uapi/linux/kexec.h
...
+static int kimage_load_cma_segment(struct kimage *image, struct kexec_segment
*segment)
+{
+ unsigned long maddr;
+ size_t ubytes, mbytes;
+ int result = 0;
+ unsigned char __user *buf = NULL;
+ unsigned char *kbuf = NULL;
+ char *ptr = page_address(segment->cma);
+
+ if (image->file_mode)
+ kbuf = segment->kbuf;
+ else
+ buf = segment->buf;
+ ubytes = segment->bufsz;
+ mbytes = segment->memsz;
+ maddr = segment->mem;
+
+ /* Initialize the buffer with zeros to allow for smaller input buffers */
+ memset(ptr, 0, mbytes);
Would it be more efficient to zero the remainder after performing the copy?
Yup. Changed :).
+ /* Then copy from source buffer to the CMA one */
+ while (mbytes) {
+ size_t uchunk, mchunk;
+
+ ptr += maddr & ~PAGE_MASK;
+ mchunk = min_t(size_t, mbytes,
+ PAGE_SIZE - (maddr & ~PAGE_MASK));
+ uchunk = min(ubytes, mchunk);
+
+ if (uchunk) {
+ /* For file based kexec, source pages are in kernel
memory */
+ if (image->file_mode)
+ memcpy(ptr, kbuf, uchunk);
+ else
+ result = copy_from_user(ptr, buf, uchunk);
+ ubytes -= uchunk;
+ if (image->file_mode)
+ kbuf += uchunk;
+ else
+ buf += uchunk;
+ }
+
+ if (result) {
+ result = -EFAULT;
+ goto out;
+ }
+
+ ptr += mchunk;
+ maddr += mchunk;
+ mbytes -= mchunk;
+
+ cond_resched();
+ }
+out:
+ return result;
+}
+
...
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
...
@@ -632,6 +635,38 @@ static int kexec_walk_resources(struct kexec_buf *kbuf,
return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
}
+static int kexec_alloc_contig(struct kexec_buf *kbuf)
+{
+ u32 pages = (u32)(kbuf->memsz >> PAGE_SHIFT);
I don't think the cast is needed?
I added it to make it more explicit we're casting (and hence potentially
lose a few bits). At size_t, that is no longer necessary.
+ unsigned long mem;
+ struct page *p;
+
+ if (kbuf->image->no_cma)
+ return -EPERM;
+
+ p = dma_alloc_from_contiguous(NULL, pages, get_order(kbuf->buf_align),
true);
dma_alloc_from_contiguous()'s `count' arg is size_t. Making `pages'
size_t seems best here. (And nr_pages would be a better identifier!)
+ if (!p)
+ return -EADDRNOTAVAIL;
EADDRNOTAVAIL is a networking thing. People will be surprised to see
kexec returning networking errors. Perhaps choose something more
appropriate?
I was surprised too when I saw it in the surrounding code. It's the
error code that locate_mem_hole() returns, so I copied it. Let me change
it at least in this function to something sensible.
Alex
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597