On Mon, 31 Mar 2025 at 10:38, Steven Rostedt <rost...@goodmis.org> wrote: > > I just did this to be robust in case what was passed in was not aligned. In > all my use cases, it is.
I really think that ALIGN() is absolutely the opposite of robust. It silently just makes incorrect values generate incorrect code. Robust is dealing correctly with incorrect values, not making it incorrect data just create even more incorrect data. > OK, so I did copy this from fs/pstore/ram_core.c as this does basically the > same thing as pstore. And it looks like pstore should be updated too. Please just stop copying code from random drivers or filesystems. Christ. I've said this before: core kernel code has higher quality requirements than random drivers (or even random filesystems). You can't copy crazy incorrect snippets from random sources and put it in core kernel code. > This is due to what the "reserve_mem" kernel command line feature gives > back. It reserves physical memory that is not part of the kernel memory > allocator (it also works with memmap ie. memmap=12M$0x284500000, which also > requires physical memory addresses that are not part of the memory > allocator). Yeah, so in that case turning those into "struct page *" is horrendously wrong, because you've literally taken it away from the memory management. If it's not part of the kernel memory allocator, it does not necessarily have a "struct page *" associated with it. Using a pointer to 'struct page' and passing it off is just fundamentally more than a bug: it's seriously and deeply broken. It's probably much more obviously broken if the physical addresses came from a PCI device, and this all just happens to work because it's actually real RAM and we ended up having a 'struct page []' for it for some random reason. But the generral rule is that if all you have physical RAM addresses, you can use them as phys_addr_t (or turn them into pfn's, which is just the page index of the physical address). I think that *completely* bogus code worked only entirely due to luck - the 'struct page *' was never actually dereferenced, and it got turned back into a pfn and then a page table entry by the vmap code using purely address arithmetic (page_to_phys() and page_to_pfn()). So it probably only ever used pointer calculations, although some of those actually can end up depending on dereferencing 'struct page' (to find the base zone of the page, to find the mapping). Add even with just pointer games, that should actually have triggered a warning in vmap_pages_pte_range() from this: if (WARN_ON(!pfn_valid(page_to_pfn(page)))) which I suspect means that we actually *do* end up having 'struct page' backing for those.. Looking closer... I think what happened is that reserve_mem() actually does end up giving you pages that have the 'struct page' backing store even if they aren't mapped. Which makes the 'struct page' pointer stuff work, and those pages end up working as 'reserved pages'. And I suspect that happens exactly because we had users that mis-used these 'struct page *' things, so it might even be intentional. Or it might just be because that memory *has* been in the E280 memory map before it was reserved, and the 'strict page' arrays may have been sized for the original E280 information, not the "after stealing" information. I didn't go look into that early memory reservation code, but I would not be surprised if there have been fixups to work around bugs like this. I haven't had to look at it in ages, but maybe the bootmem code knows that bootmem allocations then might be given back to the VM and need to have those 'struct page' backing store things Or - and this is also entirely possible - maybe you were just very lucky indeed because the code to allocate the 'struct page' regions ends up intentionally over-allocating and rounding things up. Because of how the buddy allocator works, each 'struct page' array needs to be rounded up to the maximum page allocation, so I think we always allocate the page arrays of a memory zone up to at least PAGE_SIZE << NR_PAGE_ORDERS boundaries (so 8MB-aligned, I think). So there's fundamnetally some slop in there. Anyway, if you allocated them as physical addresses outside of the regular MM code, you should not use 'struct page' in *any* form. You really can't just turn any random physical address into a page, unless it came from the page allocator. And yes, I'm not at all surprised that we'd have other mis-users of this. We have some very historical code that depends on reserved pages going back all the way to linux-0.01 I suspect, because things like the original VGA code knew that the physical addresses were in the BIOS hole region that was backed by 'struct page'. Linus