On Thu, 21 May 2026 15:20:29 -0700, [email protected] wrote: > On Thu, 21 May 2026 12:01:17 +0800 "Li Zhe" <[email protected]> wrote: > > > memmap_init_zone_device() can spend a substantial amount of time > > initializing large ZONE_DEVICE ranges because it repeats nearly > > identical struct page setup for every PFN. > > > > This series reduces that overhead in seven steps. > > Cool, thanks, we all love speedups. > > > The first patch factors the reusable pieces out of > > __init_zone_device_page() so later patches can share the same logic > > without changing the existing slow path. > > > > The second patch adds set_page_section_from_pfn(), so generic callers > > can update section bits from a PFN without open-coding > > SECTION_IN_PAGE_FLAGS checks. > > > > The third patch adds a template-based fast path for ZONE_DEVICE head > > pages. Instead of rebuilding the same struct page state for every PFN, > > it prepares a reusable page template once and copies it to each > > destination page. > > > > The fourth patch extends the same template-based approach to compound > > tails, so pfns_per_compound > 1 can also benefit from the fast path. > > > > The fifth patch introduces memcpy_streaming() and > > memcpy_streaming_drain() as a generic interface for write-once > > streaming copies, with a memcpy() fallback for architectures that do > > not provide a specialized backend. > > > > The sixth patch extends x86 memcpy_flushcache() small fixed-size > > fastpaths so struct-page-sized streaming copies can stay on the inline > > path. > > > > The last patch switches the zone-device template-copy path over to > > memcpy_streaming(). It refreshes PFN-dependent fields in the reusable > > template before each copy, keeps pageblock-aligned PFNs on regular > > memcpy(), and drains streaming stores before later normal stores update > > overlapping or dependent metadata. > > > > The optimized path is disabled when the page_ref_set tracepoint is > > enabled, sanitized builds remain on the slow path so their > > instrumented stores are preserved, and the fast path falls back to the > > existing slow path if sizeof(struct page) is not an integral number of > > u64 words. > > > > Testing > > ======= > > > > Tests were run in a VM on an Intel Ice Lake server. > > > > Two PMEM configurations were used: > > - a 100 GB fsdax namespace configured with map=dev, which exercises > > the nd_pmem rebind path (pfns_per_compound == 1) > > - a 100 GB devdax namespace configured with align=2097152, which > > exercises the dax_pmem rebind path (pfns_per_compound > 1) > > > > For each configuration, the corresponding driver was unbound and > > rebound 30 times. Memmap initialization latency was collected from the > > pr_debug() output of memmap_init_zone_device(). > > > > The first bind is reported separately, and the average of subsequent > > rebinds is used as the steady-state result. > > How closely does this workload resemble any real-world user workload?
Not directly. The unbind/rebind loop is mainly a controlled and repeatable way to measure the memmap_init_zone_device() path with minimal unrelated noise. > > Performance > > =========== > > > > nd_pmem rebind, 100 GB fsdax namespace, map=dev > > Base(v7.1-rc3): > > First binding: 1486 ms > > Average of subsequent rebinds: 273.52 ms > > With patches 1-3 applied: > > First binding: 1422 ms > > Average of subsequent rebinds: 245.73 ms > > Full series: > > First binding: 1389 ms > > Average of subsequent rebinds: 111.08 ms > > > > dax_pmem rebind, 100 GB devdax namespace, align=2097152 > > Base(v7.1-rc3): > > First binding: 1515 ms > > Average of subsequent rebinds: 313.45 ms > > With patches 1-4 applied: > > First binding: 1422 ms > > Average of subsequent rebinds: 256.56 ms > > Full series: > > First binding: 1294 ms > > Average of subsequent rebinds: 110.24 ms > > The improvements appear to range between "modest" and "large", but what > I'd like to understand is how frequently real-world users are using > these operations in real-world workloads. > > IOW, (and this is always the bottom line), how valuable is this > patchset to our users? This is not a steady-state data-path optimization. Its value is in pmem bring-up paths, and in our deployment we do have scenarios where multiple pmem devices are hotplugged , so reducing this latency is useful in practice for us. > > mm: factor zone-device page init helpers out of > > __init_zone_device_page > > mm: add a set_page_section_from_pfn() helper > > mm: add a template-based fast path for zone-device page init > > mm: extend the template fast path to zone-device compound tails > > string: introduce memcpy_streaming() helpers > > x86/string: extend memcpy_flushcache() fixed-size fastpaths > > mm: use memcpy_streaming() in zone-device template copies > > > > arch/x86/include/asm/string_64.h | 100 +++++++++++++--- > > include/linux/mm.h | 19 ++- > > include/linux/string.h | 18 +++ > > mm/mm_init.c | 198 +++++++++++++++++++++++++++---- > > 4 files changed, 294 insertions(+), 41 deletions(-) > > I won't take any action at this stage - let's await reviewer input. If > none is forthcoming then please remind me and I'll figure out what to > do. > > The ever-present reviewer called "Sashiko" has thoughts to offer: > > > https://sashiko.dev/#/patchset/[email protected] > > Please take a look, decide if there's useful material in there. There is useful material there, mainly around patches 5 and 6. The memcpy_streaming() x86 backend should be narrower, and the expanded memcpy_flushcache() small-copy fastpath should keep naturally aligned cases only and preserve forward movnti store order. I'll address those points in the next revision and rerun the benchmarks. Thanks, Zhe

