On Mon, 12 May 2025 22:57:52 +0000 Alexander Graf <g...@amazon.com> wrote:
> When booting a new kernel with kexec_file, the kernel picks a target > location that the kernel should live at, then allocates random pages, > checks whether any of those patches magically happens to coincide with > a target address range and if so, uses them for that range. > > For every page allocated this way, it then creates a page list that the > relocation code - code that executes while all CPUs are off and we are > just about to jump into the new kernel - copies to their final memory > location. We can not put them there before, because chances are pretty > good that at least some page in the target range is already in use by > the currently running Linux environment. > > All of this is inefficient. > > Since kexec got introduced, Linux has gained the CMA framework which > can perform physically contiguous memory mappings, while keeping that > memory available for movable memory when it is not needed for contiguous > allocations. The default CMA allocator is for DMA allocations. > > This patch adds logic to the kexec file loader to attempt to place the > target payload at a location allocated from CMA. If successful, it uses > that memory range directly instead of creating copy instructions during > the hot phase. To ensure that there is a safety net in case anything goes > wrong with the CMA allocation, it also adds a flag for user space to force > disable CMA allocations. > > Using CMA allocations has two advantages: > > 1) Faster. There is no more need to copy in the hot phase. How much faster? Kinda matters as "fast" is the whole point of the patch! > 2) More robust. Even if by accident some page is still in use for DMA, > the new kernel image will be safe from that access because it resides > in a memory region that is considered allocated in the old kernel and > has a chance to reinitialize that component. Is this known to be a problem in current code? Some minor observations: > --- a/include/linux/kexec.h > +++ b/include/linux/kexec.h > > ... > > @@ -331,6 +336,7 @@ struct kimage { > */ > unsigned int hotplug_support:1; > #endif > + unsigned int no_cma : 1; "no_cma:1" is more conventional. > #ifdef ARCH_HAS_KIMAGE_ARCH > struct kimage_arch arch; > diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h > index 5ae1741ea8ea..8958ebfcff94 100644 > --- a/include/uapi/linux/kexec.h > +++ b/include/uapi/linux/kexec.h > > ... > > +static int kimage_load_cma_segment(struct kimage *image, struct > kexec_segment *segment) > +{ > + unsigned long maddr; > + size_t ubytes, mbytes; > + int result = 0; > + unsigned char __user *buf = NULL; > + unsigned char *kbuf = NULL; > + char *ptr = page_address(segment->cma); > + > + if (image->file_mode) > + kbuf = segment->kbuf; > + else > + buf = segment->buf; > + ubytes = segment->bufsz; > + mbytes = segment->memsz; > + maddr = segment->mem; > + > + /* Initialize the buffer with zeros to allow for smaller input buffers > */ > + memset(ptr, 0, mbytes); Would it be more efficient to zero the remainder after performing the copy? > + /* Then copy from source buffer to the CMA one */ > + while (mbytes) { > + size_t uchunk, mchunk; > + > + ptr += maddr & ~PAGE_MASK; > + mchunk = min_t(size_t, mbytes, > + PAGE_SIZE - (maddr & ~PAGE_MASK)); > + uchunk = min(ubytes, mchunk); > + > + if (uchunk) { > + /* For file based kexec, source pages are in kernel > memory */ > + if (image->file_mode) > + memcpy(ptr, kbuf, uchunk); > + else > + result = copy_from_user(ptr, buf, uchunk); > + ubytes -= uchunk; > + if (image->file_mode) > + kbuf += uchunk; > + else > + buf += uchunk; > + } > + > + if (result) { > + result = -EFAULT; > + goto out; > + } > + > + ptr += mchunk; > + maddr += mchunk; > + mbytes -= mchunk; > + > + cond_resched(); > + } > +out: > + return result; > +} > + > > ... > > --- a/kernel/kexec_file.c > +++ b/kernel/kexec_file.c > > ... > > @@ -632,6 +635,38 @@ static int kexec_walk_resources(struct kexec_buf *kbuf, > return walk_system_ram_res(0, ULONG_MAX, kbuf, func); > } > > +static int kexec_alloc_contig(struct kexec_buf *kbuf) > +{ > + u32 pages = (u32)(kbuf->memsz >> PAGE_SHIFT); I don't think the cast is needed? > + unsigned long mem; > + struct page *p; > + > + if (kbuf->image->no_cma) > + return -EPERM; > + > + p = dma_alloc_from_contiguous(NULL, pages, get_order(kbuf->buf_align), > true); dma_alloc_from_contiguous()'s `count' arg is size_t. Making `pages' size_t seems best here. (And nr_pages would be a better identifier!) > + if (!p) > + return -EADDRNOTAVAIL; EADDRNOTAVAIL is a networking thing. People will be surprised to see kexec returning networking errors. Perhaps choose something more appropriate? > + pr_debug("allocated %d DMA pages at 0x%lx", pages, page_to_boot_pfn(p)); > + > + mem = page_to_boot_pfn(p) << PAGE_SHIFT; > + > + if (kimage_is_destination_range(kbuf->image, mem, mem + kbuf->memsz)) { > + /* Our region is already in use by a statically defined one. > Bail out. */ > + pr_debug("CMA overlaps existing mem: 0x%lx+0x%lx\n", mem, > kbuf->memsz); > + dma_release_from_contiguous(NULL, p, pages); > + return -EADDRNOTAVAIL; > + } > + > + kbuf->mem = page_to_boot_pfn(p) << PAGE_SHIFT; > + kbuf->cma = p; > + > + arch_kexec_post_alloc_pages(page_address(p), pages, 0); > + > + return 0; > +} > + > > ... >