Re: [PATCH v3 1/2] kho: add support for preserving vmalloc allocations

Pratyush Yadav Sat, 20 Sep 2025 18:07:28 -0700

Hi Mike,

On Mon, Sep 15 2025, Mike Rapoport wrote:

> On Mon, Sep 08, 2025 at 08:12:59PM +0200, Pratyush Yadav wrote:
>> On Mon, Sep 08 2025, Mike Rapoport wrote:
>> 
>> > From: "Mike Rapoport (Microsoft)" <[email protected]>
>> >
>> > A vmalloc allocation is preserved using binary structure similar to
>> > global KHO memory tracker. It's a linked list of pages where each page
>> > is an array of physical address of pages in vmalloc area.
>> >
>> > kho_preserve_vmalloc() hands out the physical address of the head page
>> > to the caller. This address is used as the argument to
>> > kho_vmalloc_restore() to restore the mapping in the vmalloc address
>> > space and populate it with the preserved pages.
>> >
>> > Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
[...]
>> > +  while (chunk) {
>> > +          struct page *page;
>> > +
>> > +          for (int i = 0; i < chunk->hdr.num_elms; i++) {
>> > +                  phys_addr_t phys = chunk->phys[i];
>> > +
>> > +                  for (int j = 0; j < (1 << order); j++) {
>> > +                          page = phys_to_page(phys);
>> > +                          kho_restore_page(page, 0);
>> > +                          pages[idx++] = page;
>> 
>> This can buffer-overflow if the previous kernel was buggy and added too
>> many pages. Perhaps keep check for this?
>
> You mean it added more than total_pages?
> But the preserve part adds exactly vm->nr_pages, so once we get it right
> what bugs do you expect here? 

The main reason is defensive programming. My mental model is to treat
the data from the previous kernel as external input, and so we should
always validate it. The kernel that did the preserve and the kernel that
does the restore are very different, and between them either may have a
bug or some incompatibility/disagreement on the data format. Catching
these inconsistencies and failing gracefully is a lot better than
silently corrupting memory and having hard to catch bugs. No one plans
to add bugs, but they always show up somehow :-)

And we do a lot of that KHO already. See the FDT parsing in memblock
preservation for example.

        p_start = fdt_getprop(fdt, offset, "start", &len_start);
        p_size = fdt_getprop(fdt, offset, "size", &len_size);
        if (!p_start || len_start != sizeof(*p_start) || !p_size ||
            len_size != sizeof(*p_size)) {
                return false;
        }
        [...]

It checks that FDT contains sane data or errors out otherwise. Similar
checks exist in other places too.

>  
>> > +                          phys += PAGE_SIZE;
>> > +                  }
>> > +          }
>> > +
>> > +          page = virt_to_page(chunk);
>> > +          chunk = KHOSER_LOAD_PTR(chunk->hdr.next);
>> > +          kho_restore_page(page, 0);
>> > +          __free_page(page);
>> > +  }
>> > +
>> > +  area = __get_vm_area_node(nr * PAGE_SIZE, align, shift, flags,
>> > +                            VMALLOC_START, VMALLOC_END, NUMA_NO_NODE,
>> > +                            GFP_KERNEL, __builtin_return_address(0));
>> > +  if (!area)
>> > +          goto err_free_pages_array;
>> > +
>> > +  addr = (unsigned long)area->addr;
>> > +  size = get_vm_area_size(area);
>> > +  err = vmap_pages_range(addr, addr + size, PAGE_KERNEL, pages, shift);
>> > +  if (err)
>> > +          goto err_free_vm_area;
>> > +
>> > +  return area->addr;
>> 
>> You should free the pages array before returning here.
>
> Why? They get into vm->pages.

Do they? I don't see any path in vmap_pages_range() taking a reference
to the pages array. They use the array, but never take a reference to
it.

The core of vmap_pages_range() is in __vmap_pages_range_noflush(). That
function has two paths:

        if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
                        page_shift == PAGE_SHIFT)
                return vmap_small_pages_range_noflush(addr, end, prot, pages);

        for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
                int err;

                err = vmap_range_noflush(addr, addr + (1UL << page_shift),
                                        page_to_phys(pages[i]), prot,
                                        page_shift);
                if (err)
                        return err;

                addr += 1UL << page_shift;
        }

        return 0;

The latter path (the for loop) _uses_ pages but never takes a reference
to it. The former, vmap_small_pages_range_noflush(), goes through a the
sequence of functions vmap_pages_{p4d,pud,pmd}_range(), eventually
landing in vmap_pages_pte_range(). That function's only use of the pages
array is the below:

        do {
                struct page *page = pages[*nr];
                [...]
        } while (pte++, addr += PAGE_SIZE, addr != end);

So I don't see anything setting area->pages here.

Note that vmap() does set area->pages if given the VM_MAP_PUT_PAGES
flag. It also calls vmap_pages_range() first and then sets area->pages
and area->nr_pages. So if you want to mimic that behaviour, I think you
have to explicitly make these assignments in kho_restore_vmalloc().

Similarly for __vmalloc_area_node(), it does area->pages =
__vmalloc_node_noprof() and then does vmap_pages_range() (also sets
area->nr_pages).

Side note: would it be a better idea to teach vmap() to take higher
order pages instead, to make this path simpler? I don't know the subtle
differences between these mapping primitives to know if that makes
sense. I seems to do pretty much what this function is doing with the
exception of only dealing with 0-order pages.

-- 
Regards,
Pratyush Yadav

Re: [PATCH v3 1/2] kho: add support for preserving vmalloc allocations

Reply via email to