Hi Jason, Thank you very much for your feedback again.
On Mon, Oct 6, 2025 at 7:14 AM Jason Gunthorpe <[email protected]> wrote: > > #define KHO_FDT_COMPATIBLE "kho-v1" > > We don't bump this? Will do. Will be "kho-v2". > > > -#define PROP_PRESERVED_MEMORY_MAP "preserved-memory-map" > > +#define PROP_PRESERVED_PAGE_RADIX_TREE "preserved-page-radix-tree" > > #define PROP_SUB_FDT "fdt" > > I'de really like to see all of these sorts of definitions in some > structured ABI header not open coded all over the place.. Do you think `include/linux/kexec_handover.h` is the appropriate place, or would you prefer a new, dedicated ABI header (e.g., in `include/uapi/linux/`) for all KHO-related FDT constants? > > > */ > > +struct kho_radix_tree { > > + unsigned long table[PAGE_SIZE / sizeof(unsigned long)]; > > This should be phys_addr_t. > > > +}; > > You dropped the macros so now we don't know these are actually > pointers to 'struct kho_radix_tree' > Agreed. Will change `u64` according to Pasha's comment. And we use explicit casts like `(u64)virt_to_phys(new_tree)` and `(struct kho_radix_tree *)phys_to_virt(table_entry)` in the current series. I believe this, along with the `u64` type, makes it clear that the table stores physical addresses. > > +static int kho_radix_tree_max_depth(void) > > +{ > > + int page_offset_bit_num = BITS_PER_LONG - PAGE_SHIFT; > > + int order_bit_num = ilog2(__roundup_pow_of_two(page_offset_bit_num)); > > + int bitmap_bit_num = PAGE_SHIFT + ilog2(BITS_PER_BYTE); > > + int table_bit_num = ilog2(PAGE_SIZE / sizeof(unsigned long)); > > + int table_level_num = DIV_ROUND_UP(page_offset_bit_num - > > + bitmap_bit_num + order_bit_num, > > + table_bit_num); > > All should be unsigned int. Below I suggest to put it in an enum and > use different names.. And since the function is constant it can just > be an enum TOP_LEVEL too. > Yes I did think of returning a const for `kho_radix_tree_max_depth()`. I think using enums is a better idea and I can place all above values as enums. > > + */ > > +static unsigned long kho_radix_encode(unsigned long pa, unsigned int order) > > pa is phys_addr_t in the kernel, never unsigned long. > > If you want to make it all dynamic then this should be phys_addr_t Should this also be `u64`, or we stay with `phys_addr_t` for all page addresses? > > > +{ > > + unsigned long h = 1UL << (BITS_PER_LONG - PAGE_SHIFT - order); > > And this BITS_PER_LONG is confused, it is BITS_PER_PHYS_ADDR_T which > may not exist. > > Use an enum ORDER_0_LG2 maybe I prefer `KHO_RADIX_ORDER_0_BIT_POS` (defined as `BITS_PER_LONG - PAGE_SHIFT`) over `ORDER_0_LG2`, as I think the latter is a bit hard to understand, what do you think? This constant, along with others, will be placed in the enum. > > > + unsigned long l = pa >> (PAGE_SHIFT + order); > > > > + return h | l; > > +} > > > > +static unsigned long kho_radix_decode(unsigned long encoded, unsigned int > > *order) > > Returns phys_addr_t > > > { > > - void *elm, *res; > > + unsigned long order_bit = fls64(encoded); > > unsigned int > > > + unsigned long pa; > > phys_addr_t > > > + *order = BITS_PER_LONG - PAGE_SHIFT - order_bit + 1; > > ORDER_0_LG2 > > > + pa = encoded << (PAGE_SHIFT + *order); > > I'd add a comment that the shift always discards order. > > > + return pa; > > +} > > > > +static unsigned long kho_radix_get_index(unsigned long encoded, int level) > > unsigned int level > > > +{ > > + int table_bit_num = ilog2(PAGE_SIZE / sizeof(unsigned long)); > > + int bitmap_bit_num = PAGE_SHIFT + ilog2(BITS_PER_BYTE); > > Stick all the constants that kho_radix_tree_max_depth() are computing > in an enum instead of recomputing them.. > > > + unsigned long mask; > > + int s; > > unsigned for all of these. > > > + > > + if (level == 1) { > > I think the math is easier if level 0 == bitmap.. > > > + s = 0; > > + mask = (1UL << bitmap_bit_num) - 1; > > + } else { > > + s = ((level - 2) * table_bit_num) + bitmap_bit_num; > > eg here you are doing level-2 which is a bit weird only because of the > arbitary choice to make level=1 be the bitmap. > > I'd also use some different names > > table_bit_num == TABLE_SIZE_LG2 > BITMAP_BIT_NUM = BITMAP_SIZE_LG2 > > Log2 designates the value is 1<<LG2 Good point on the level numbering, we'll switch to 0-based where level 0 is the bitmap. The modulo operations you suggested play nicely with the 0-based numbering too, thanks. Will also update the names and put them in enum. > > > + mask = (1UL << table_bit_num) - 1; > > } > > > > - return elm; > > + return (encoded >> s) & mask; > > It is just: > > return encoded % (1 << BITMAP_SIZE_LG2); > return (encoded >> s) % (1 << TABLE_SIZE_LG2); > > The compiler is smart enough to choose bit logic if that is the > fastest option and the above is more readable. > > > +static int kho_radix_set_bitmap(struct kho_bitmap_table *bit_tlb, unsigned > > long offset) > > { > > + if (!bit_tlb || > > + offset >= PAGE_SIZE * BITS_PER_BYTE) > > + return -EINVAL; > > > > + set_bit(offset, bit_tlb->bitmaps); > > set_bit is an atomic, you want __set_bit() > > > + return 0; > > +} > > > > +static int kho_radix_preserve_page(unsigned long pa, unsigned int order) > > phys_addr_t > > > +{ > > + unsigned long encoded = kho_radix_encode(pa, order); > > + int num_tree_level = kho_radix_tree_max_depth(); > > kho_radix_tree_max_depth() is constant, stick it in an enum with the > rest of them. > > > + struct kho_radix_tree *current_tree, *new_tree; > > + struct kho_bitmap_table *bitmap_table; > > + int err = 0; > > + int i, idx; > > various unsigned int. > > > > > + down_write(&kho_radix_tree_root_sem); > > > > + current_tree = kho_radix_tree_root; > > > > + /* Go from high levels to low levels */ > > + for (i = num_tree_level; i >= 1; i--) { > > + idx = kho_radix_get_index(encoded, i); > > + > > + if (i == 1) { > > + bitmap_table = (struct kho_bitmap_table > > *)current_tree; > > + err = kho_radix_set_bitmap(bitmap_table, idx); > > + goto out; > > + } > > + > > + if (!current_tree->table[idx]) { > > + new_tree = kho_alloc_radix_tree(); > > + if (!new_tree) { > > + err = -ENOMEM; > > + goto out; > > + } > > + > > + current_tree->table[idx] = > > + (unsigned long)virt_to_phys(new_tree); > > current_tree = new_tree > > + } > > else > > > + > > + current_tree = (struct kho_radix_tree *) > > + phys_to_virt(current_tree->table[idx]); > > } > > + > > +out: > > + up_write(&kho_radix_tree_root_sem); > > + return err; > > } > > > > +static int kho_radix_walk_bitmaps(struct kho_bitmap_table *bit_tlb, > > + unsigned long offset, > > phys_addr_t > > > + kho_radix_tree_walk_callback_t cb) > > { > > + unsigned long encoded = offset << (PAGE_SHIFT + ilog2(BITS_PER_BYTE)); > > + unsigned long *bitmap = (unsigned long *)bit_tlb; > > + int err = 0; > > + int i; > > > > + for_each_set_bit(i, bitmap, PAGE_SIZE * BITS_PER_BYTE) { > > + err = cb(encoded | i); > > + if (err) > > + return err; > > + } > > > > + return 0; > > +} > > > > +static int kho_radix_walk_trees(struct kho_radix_tree *root, int level, > > unsigned int > > > + unsigned long offset, > > phys_addr_t. I would call this start not offset.. > > > + kho_radix_tree_walk_callback_t cb) > > +{ > > + int level_shift = ilog2(PAGE_SIZE / sizeof(unsigned long)); > > + struct kho_radix_tree *next_tree; > > + unsigned long encoded, i; > > + int err = 0; > > > > + if (level == 1) { > > + encoded = offset; > > + return kho_radix_walk_bitmaps((struct kho_bitmap_table *)root, > > + encoded, cb); > > Better to do this in the caller a few lines below But the caller is in a different tree level? Should we only walk the bitmaps at the lowest level? > > > + } > > > > > + for (i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++) { > > + if (root->table[i]) { > > + encoded = offset << level_shift | i; > > This doesn't seem right.. > > The argument to the walker should be the starting encoded of the table > it is about to walk. > > Since everything always starts at 0 it should always be > start | (i << level_shift) > > ? You're right that this line might not be immediately intuitive. The var `level_shift` (which is constant value 9 here) is applied to the *accumulated* `offset` from the parent level. Let's consider an example of a preserved page at physical address `0x1000`, which encodes to `0x10000000000001` (bit 52 is set for order 0, bit 0 is set for page 1). If we were to use `start | (i << level_shift)` where `level_shift` is a constant 9, and `start` is the value from the parent call: - At Level 6, `start` is `0`. `i` is 2 as bit 51:59 = 2. Result: `0 | (2 << 9) = 0x400`. This is passed to Level 5. - At Level 5, `start` is `0x400`, `i` is 0 as bit 50:42 = 0. Result: `0x400 | (0 << 9) = 0x400`. This is passed to Level 4. - At Level 4, `start` is `0x400`, `i` is 0 as bit 33:41 = 0. Result: `0x400 | (0 << 9) = 0x400`. And so on. As you can see, the encoded value is not correctly reconstructed. This will work if the `level_shift` represents the total shift from the LSB for each specific level, but it is not the case here. I will, however, add a detailed comment to `kho_radix_walk_trees` to clarify this logic. I also agree to change the name of `offset` to make it more clearer, how about `base_encoded`, or do you still prefer `start`? > > > + next_tree = (struct kho_radix_tree *) > > + phys_to_virt(root->table[i]); > > + err = kho_radix_walk_trees(next_tree, level - 1, > > encoded, cb); > > if (err) > > return err; > > } > > } > > > > + return 0; > > +} > > > > +static int kho_memblock_reserve(phys_addr_t pa, int order) > > +{ > > + int sz = 1 << (order + PAGE_SHIFT); > > + struct page *page = phys_to_page(pa); > > + > > + memblock_reserve(pa, sz); > > + memblock_reserved_mark_noinit(pa, sz); > > + page->private = order; > > > > return 0; > > } > > > > +static int kho_radix_walk_trees_callback(unsigned long encoded) > > +{ > > + unsigned int order; > > + unsigned long pa; > > + > > + pa = kho_radix_decode(encoded, &order); > > + > > + return kho_memblock_reserve(pa, order); > > +} > > + > > +struct kho_serialization { > > + struct page *fdt; > > + struct list_head fdt_list; > > + struct dentry *sub_fdt_dir; > > +}; > > + > > +static int __kho_preserve_order(unsigned long pfn, unsigned int order) > > +{ > > + unsigned long pa = PFN_PHYS(pfn); > > phys_addr_t > > Jason Will do the update in the next patch version. Thanks again. -- Jason Miu
