On Thu, 21 May 2026 12:01:17 +0800 "Li Zhe" <[email protected]> wrote:

> memmap_init_zone_device() can spend a substantial amount of time
> initializing large ZONE_DEVICE ranges because it repeats nearly
> identical struct page setup for every PFN.
> 
> This series reduces that overhead in seven steps.

Cool, thanks, we all love speedups.

> The first patch factors the reusable pieces out of
> __init_zone_device_page() so later patches can share the same logic
> without changing the existing slow path.
> 
> The second patch adds set_page_section_from_pfn(), so generic callers
> can update section bits from a PFN without open-coding
> SECTION_IN_PAGE_FLAGS checks.
> 
> The third patch adds a template-based fast path for ZONE_DEVICE head
> pages. Instead of rebuilding the same struct page state for every PFN,
> it prepares a reusable page template once and copies it to each
> destination page.
> 
> The fourth patch extends the same template-based approach to compound
> tails, so pfns_per_compound > 1 can also benefit from the fast path.
> 
> The fifth patch introduces memcpy_streaming() and
> memcpy_streaming_drain() as a generic interface for write-once
> streaming copies, with a memcpy() fallback for architectures that do
> not provide a specialized backend.
> 
> The sixth patch extends x86 memcpy_flushcache() small fixed-size
> fastpaths so struct-page-sized streaming copies can stay on the inline
> path.
> 
> The last patch switches the zone-device template-copy path over to
> memcpy_streaming(). It refreshes PFN-dependent fields in the reusable
> template before each copy, keeps pageblock-aligned PFNs on regular
> memcpy(), and drains streaming stores before later normal stores update
> overlapping or dependent metadata.
> 
> The optimized path is disabled when the page_ref_set tracepoint is
> enabled, sanitized builds remain on the slow path so their
> instrumented stores are preserved, and the fast path falls back to the
> existing slow path if sizeof(struct page) is not an integral number of
> u64 words.
> 
> Testing
> =======
> 
> Tests were run in a VM on an Intel Ice Lake server.
> 
> Two PMEM configurations were used:
>   - a 100 GB fsdax namespace configured with map=dev, which exercises
>     the nd_pmem rebind path (pfns_per_compound == 1)
>   - a 100 GB devdax namespace configured with align=2097152, which
>     exercises the dax_pmem rebind path (pfns_per_compound > 1)
> 
> For each configuration, the corresponding driver was unbound and
> rebound 30 times. Memmap initialization latency was collected from the
> pr_debug() output of memmap_init_zone_device().
> 
> The first bind is reported separately, and the average of subsequent
> rebinds is used as the steady-state result.

How closely does this workload resemble any real-world user workload?

> Performance
> ===========
> 
> nd_pmem rebind, 100 GB fsdax namespace, map=dev
>   Base(v7.1-rc3):
>     First binding: 1486 ms
>     Average of subsequent rebinds: 273.52 ms
>   With patches 1-3 applied:
>     First binding: 1422 ms
>     Average of subsequent rebinds: 245.73 ms
>   Full series:
>     First binding: 1389 ms
>     Average of subsequent rebinds: 111.08 ms
> 
> dax_pmem rebind, 100 GB devdax namespace, align=2097152
>   Base(v7.1-rc3):
>     First binding: 1515 ms
>     Average of subsequent rebinds: 313.45 ms
>   With patches 1-4 applied:
>     First binding: 1422 ms
>     Average of subsequent rebinds: 256.56 ms
>   Full series:
>     First binding: 1294 ms
>     Average of subsequent rebinds: 110.24 ms

The improvements appear to range between "modest" and "large", but what
I'd like to understand is how frequently real-world users are using
these operations in real-world workloads.

IOW, (and this is always the bottom line), how valuable is this
patchset to our users?

>   mm: factor zone-device page init helpers out of
>     __init_zone_device_page
>   mm: add a set_page_section_from_pfn() helper
>   mm: add a template-based fast path for zone-device page init
>   mm: extend the template fast path to zone-device compound tails
>   string: introduce memcpy_streaming() helpers
>   x86/string: extend memcpy_flushcache() fixed-size fastpaths
>   mm: use memcpy_streaming() in zone-device template copies
> 
>  arch/x86/include/asm/string_64.h | 100 +++++++++++++---
>  include/linux/mm.h               |  19 ++-
>  include/linux/string.h           |  18 +++
>  mm/mm_init.c                     | 198 +++++++++++++++++++++++++++----
>  4 files changed, 294 insertions(+), 41 deletions(-)

I won't take any action at this stage - let's await reviewer input.  If
none is forthcoming then please remind me and I'll figure out what to
do.

The ever-present reviewer called "Sashiko" has thoughts to offer:

        
https://sashiko.dev/#/patchset/[email protected]

Please take a look, decide if there's useful material in there.


Reply via email to