On Mon, 12 May 2025 22:57:52 +0000 Alexander Graf <g...@amazon.com> wrote:

> When booting a new kernel with kexec_file, the kernel picks a target
> location that the kernel should live at, then allocates random pages,
> checks whether any of those patches magically happens to coincide with
> a target address range and if so, uses them for that range.
> 
> For every page allocated this way, it then creates a page list that the
> relocation code - code that executes while all CPUs are off and we are
> just about to jump into the new kernel - copies to their final memory
> location. We can not put them there before, because chances are pretty
> good that at least some page in the target range is already in use by
> the currently running Linux environment.
> 
> All of this is inefficient.
> 
> Since kexec got introduced, Linux has gained the CMA framework which
> can perform physically contiguous memory mappings, while keeping that
> memory available for movable memory when it is not needed for contiguous
> allocations. The default CMA allocator is for DMA allocations.
> 
> This patch adds logic to the kexec file loader to attempt to place the
> target payload at a location allocated from CMA. If successful, it uses
> that memory range directly instead of creating copy instructions during
> the hot phase. To ensure that there is a safety net in case anything goes
> wrong with the CMA allocation, it also adds a flag for user space to force
> disable CMA allocations.
> 
> Using CMA allocations has two advantages:
> 
>   1) Faster. There is no more need to copy in the hot phase.

How much faster?  Kinda matters as "fast" is the whole point of the patch!

>   2) More robust. Even if by accident some page is still in use for DMA,
>      the new kernel image will be safe from that access because it resides
>      in a memory region that is considered allocated in the old kernel and
>      has a chance to reinitialize that component.

Is this known to be a problem in current code?

Some minor observations:

> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
>
> ...
>
> @@ -331,6 +336,7 @@ struct kimage {
>        */
>       unsigned int hotplug_support:1;
>  #endif
> +     unsigned int no_cma : 1;

"no_cma:1" is more conventional.

>  #ifdef ARCH_HAS_KIMAGE_ARCH
>       struct kimage_arch arch;
> diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
> index 5ae1741ea8ea..8958ebfcff94 100644
> --- a/include/uapi/linux/kexec.h
> +++ b/include/uapi/linux/kexec.h
>
> ...
>
> +static int kimage_load_cma_segment(struct kimage *image, struct 
> kexec_segment *segment)
> +{
> +     unsigned long maddr;
> +     size_t ubytes, mbytes;
> +     int result = 0;
> +     unsigned char __user *buf = NULL;
> +     unsigned char *kbuf = NULL;
> +     char *ptr = page_address(segment->cma);
> +
> +     if (image->file_mode)
> +             kbuf = segment->kbuf;
> +     else
> +             buf = segment->buf;
> +     ubytes = segment->bufsz;
> +     mbytes = segment->memsz;
> +     maddr = segment->mem;
> +
> +     /* Initialize the buffer with zeros to allow for smaller input buffers 
> */
> +     memset(ptr, 0, mbytes);

Would it be more efficient to zero the remainder after performing the copy?

> +     /* Then copy from source buffer to the CMA one */
> +     while (mbytes) {
> +             size_t uchunk, mchunk;
> +
> +             ptr += maddr & ~PAGE_MASK;
> +             mchunk = min_t(size_t, mbytes,
> +                             PAGE_SIZE - (maddr & ~PAGE_MASK));
> +             uchunk = min(ubytes, mchunk);
> +
> +             if (uchunk) {
> +                     /* For file based kexec, source pages are in kernel 
> memory */
> +                     if (image->file_mode)
> +                             memcpy(ptr, kbuf, uchunk);
> +                     else
> +                             result = copy_from_user(ptr, buf, uchunk);
> +                     ubytes -= uchunk;
> +                     if (image->file_mode)
> +                             kbuf += uchunk;
> +                     else
> +                             buf += uchunk;
> +             }
> +
> +             if (result) {
> +                     result = -EFAULT;
> +                     goto out;
> +             }
> +
> +             ptr    += mchunk;
> +             maddr  += mchunk;
> +             mbytes -= mchunk;
> +
> +             cond_resched();
> +     }
> +out:
> +     return result;
> +}
> +
>
> ...
>
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
>
> ...
>
> @@ -632,6 +635,38 @@ static int kexec_walk_resources(struct kexec_buf *kbuf,
>               return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
>  }
>  
> +static int kexec_alloc_contig(struct kexec_buf *kbuf)
> +{
> +     u32 pages = (u32)(kbuf->memsz >> PAGE_SHIFT);

I don't think the cast is needed?

> +     unsigned long mem;
> +     struct page *p;
> +
> +     if (kbuf->image->no_cma)
> +             return -EPERM;
> +
> +     p = dma_alloc_from_contiguous(NULL, pages, get_order(kbuf->buf_align), 
> true);

dma_alloc_from_contiguous()'s `count' arg is size_t.  Making `pages'
size_t seems best here.  (And nr_pages would be a better identifier!)


> +     if (!p)
> +             return -EADDRNOTAVAIL;

EADDRNOTAVAIL is a networking thing.  People will be surprised to see
kexec returning networking errors.  Perhaps choose something more
appropriate?

> +     pr_debug("allocated %d DMA pages at 0x%lx", pages, page_to_boot_pfn(p));
> +
> +     mem = page_to_boot_pfn(p) << PAGE_SHIFT;
> +
> +     if (kimage_is_destination_range(kbuf->image, mem, mem + kbuf->memsz)) {
> +             /* Our region is already in use by a statically defined one. 
> Bail out. */
> +             pr_debug("CMA overlaps existing mem: 0x%lx+0x%lx\n", mem, 
> kbuf->memsz);
> +             dma_release_from_contiguous(NULL, p, pages);
> +             return -EADDRNOTAVAIL;
> +     }
> +
> +     kbuf->mem = page_to_boot_pfn(p) << PAGE_SHIFT;
> +     kbuf->cma = p;
> +
> +     arch_kexec_post_alloc_pages(page_address(p), pages, 0);
> +
> +     return 0;
> +}
> +
>
> ...
>

Reply via email to