Hi Mike, On Mon, Sep 15 2025, Mike Rapoport wrote:
> On Mon, Sep 08, 2025 at 08:12:59PM +0200, Pratyush Yadav wrote: >> On Mon, Sep 08 2025, Mike Rapoport wrote: >> >> > From: "Mike Rapoport (Microsoft)" <[email protected]> >> > >> > A vmalloc allocation is preserved using binary structure similar to >> > global KHO memory tracker. It's a linked list of pages where each page >> > is an array of physical address of pages in vmalloc area. >> > >> > kho_preserve_vmalloc() hands out the physical address of the head page >> > to the caller. This address is used as the argument to >> > kho_vmalloc_restore() to restore the mapping in the vmalloc address >> > space and populate it with the preserved pages. >> > >> > Signed-off-by: Mike Rapoport (Microsoft) <[email protected]> [...] >> > + while (chunk) { >> > + struct page *page; >> > + >> > + for (int i = 0; i < chunk->hdr.num_elms; i++) { >> > + phys_addr_t phys = chunk->phys[i]; >> > + >> > + for (int j = 0; j < (1 << order); j++) { >> > + page = phys_to_page(phys); >> > + kho_restore_page(page, 0); >> > + pages[idx++] = page; >> >> This can buffer-overflow if the previous kernel was buggy and added too >> many pages. Perhaps keep check for this? > > You mean it added more than total_pages? > But the preserve part adds exactly vm->nr_pages, so once we get it right > what bugs do you expect here? The main reason is defensive programming. My mental model is to treat the data from the previous kernel as external input, and so we should always validate it. The kernel that did the preserve and the kernel that does the restore are very different, and between them either may have a bug or some incompatibility/disagreement on the data format. Catching these inconsistencies and failing gracefully is a lot better than silently corrupting memory and having hard to catch bugs. No one plans to add bugs, but they always show up somehow :-) And we do a lot of that KHO already. See the FDT parsing in memblock preservation for example. p_start = fdt_getprop(fdt, offset, "start", &len_start); p_size = fdt_getprop(fdt, offset, "size", &len_size); if (!p_start || len_start != sizeof(*p_start) || !p_size || len_size != sizeof(*p_size)) { return false; } [...] It checks that FDT contains sane data or errors out otherwise. Similar checks exist in other places too. > >> > + phys += PAGE_SIZE; >> > + } >> > + } >> > + >> > + page = virt_to_page(chunk); >> > + chunk = KHOSER_LOAD_PTR(chunk->hdr.next); >> > + kho_restore_page(page, 0); >> > + __free_page(page); >> > + } >> > + >> > + area = __get_vm_area_node(nr * PAGE_SIZE, align, shift, flags, >> > + VMALLOC_START, VMALLOC_END, NUMA_NO_NODE, >> > + GFP_KERNEL, __builtin_return_address(0)); >> > + if (!area) >> > + goto err_free_pages_array; >> > + >> > + addr = (unsigned long)area->addr; >> > + size = get_vm_area_size(area); >> > + err = vmap_pages_range(addr, addr + size, PAGE_KERNEL, pages, shift); >> > + if (err) >> > + goto err_free_vm_area; >> > + >> > + return area->addr; >> >> You should free the pages array before returning here. > > Why? They get into vm->pages. Do they? I don't see any path in vmap_pages_range() taking a reference to the pages array. They use the array, but never take a reference to it. The core of vmap_pages_range() is in __vmap_pages_range_noflush(). That function has two paths: if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || page_shift == PAGE_SHIFT) return vmap_small_pages_range_noflush(addr, end, prot, pages); for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) { int err; err = vmap_range_noflush(addr, addr + (1UL << page_shift), page_to_phys(pages[i]), prot, page_shift); if (err) return err; addr += 1UL << page_shift; } return 0; The latter path (the for loop) _uses_ pages but never takes a reference to it. The former, vmap_small_pages_range_noflush(), goes through a the sequence of functions vmap_pages_{p4d,pud,pmd}_range(), eventually landing in vmap_pages_pte_range(). That function's only use of the pages array is the below: do { struct page *page = pages[*nr]; [...] } while (pte++, addr += PAGE_SIZE, addr != end); So I don't see anything setting area->pages here. Note that vmap() does set area->pages if given the VM_MAP_PUT_PAGES flag. It also calls vmap_pages_range() first and then sets area->pages and area->nr_pages. So if you want to mimic that behaviour, I think you have to explicitly make these assignments in kho_restore_vmalloc(). Similarly for __vmalloc_area_node(), it does area->pages = __vmalloc_node_noprof() and then does vmap_pages_range() (also sets area->nr_pages). Side note: would it be a better idea to teach vmap() to take higher order pages instead, to make this path simpler? I don't know the subtle differences between these mapping primitives to know if that makes sense. I seems to do pretty much what this function is doing with the exception of only dealing with 0-order pages. -- Regards, Pratyush Yadav
