Re: [PATCH v2 0/7] mm: speed up ZONE_DEVICE memmap initialization

Li Zhe Fri, 22 May 2026 00:51:29 -0700

On Thu, 21 May 2026 15:20:29 -0700, [email protected] wrote:

> On Thu, 21 May 2026 12:01:17 +0800 "Li Zhe" <[email protected]> wrote:
> 
> > memmap_init_zone_device() can spend a substantial amount of time
> > initializing large ZONE_DEVICE ranges because it repeats nearly
> > identical struct page setup for every PFN.
> >
> > This series reduces that overhead in seven steps.
> 
> Cool, thanks, we all love speedups.
> 
> > The first patch factors the reusable pieces out of
> > __init_zone_device_page() so later patches can share the same logic
> > without changing the existing slow path.
> >
> > The second patch adds set_page_section_from_pfn(), so generic callers
> > can update section bits from a PFN without open-coding
> > SECTION_IN_PAGE_FLAGS checks.
> >
> > The third patch adds a template-based fast path for ZONE_DEVICE head
> > pages. Instead of rebuilding the same struct page state for every PFN,
> > it prepares a reusable page template once and copies it to each
> > destination page.
> >
> > The fourth patch extends the same template-based approach to compound
> > tails, so pfns_per_compound > 1 can also benefit from the fast path.
> >
> > The fifth patch introduces memcpy_streaming() and
> > memcpy_streaming_drain() as a generic interface for write-once
> > streaming copies, with a memcpy() fallback for architectures that do
> > not provide a specialized backend.
> >
> > The sixth patch extends x86 memcpy_flushcache() small fixed-size
> > fastpaths so struct-page-sized streaming copies can stay on the inline
> > path.
> >
> > The last patch switches the zone-device template-copy path over to
> > memcpy_streaming(). It refreshes PFN-dependent fields in the reusable
> > template before each copy, keeps pageblock-aligned PFNs on regular
> > memcpy(), and drains streaming stores before later normal stores update
> > overlapping or dependent metadata.
> >
> > The optimized path is disabled when the page_ref_set tracepoint is
> > enabled, sanitized builds remain on the slow path so their
> > instrumented stores are preserved, and the fast path falls back to the
> > existing slow path if sizeof(struct page) is not an integral number of
> > u64 words.
> >
> > Testing
> > =======
> >
> > Tests were run in a VM on an Intel Ice Lake server.
> >
> > Two PMEM configurations were used:
> >   - a 100 GB fsdax namespace configured with map=dev, which exercises
> >     the nd_pmem rebind path (pfns_per_compound == 1)
> >   - a 100 GB devdax namespace configured with align=2097152, which
> >     exercises the dax_pmem rebind path (pfns_per_compound > 1)
> >
> > For each configuration, the corresponding driver was unbound and
> > rebound 30 times. Memmap initialization latency was collected from the
> > pr_debug() output of memmap_init_zone_device().
> >
> > The first bind is reported separately, and the average of subsequent
> > rebinds is used as the steady-state result.
> 
> How closely does this workload resemble any real-world user workload?


Not directly. The unbind/rebind loop is mainly a controlled and
repeatable way to measure the memmap_init_zone_device() path with minimal
unrelated noise.

> > Performance
> > ===========
> >
> > nd_pmem rebind, 100 GB fsdax namespace, map=dev
> >   Base(v7.1-rc3):
> >     First binding: 1486 ms
> >     Average of subsequent rebinds: 273.52 ms
> >   With patches 1-3 applied:
> >     First binding: 1422 ms
> >     Average of subsequent rebinds: 245.73 ms
> >   Full series:
> >     First binding: 1389 ms
> >     Average of subsequent rebinds: 111.08 ms
> >
> > dax_pmem rebind, 100 GB devdax namespace, align=2097152
> >   Base(v7.1-rc3):
> >     First binding: 1515 ms
> >     Average of subsequent rebinds: 313.45 ms
> >   With patches 1-4 applied:
> >     First binding: 1422 ms
> >     Average of subsequent rebinds: 256.56 ms
> >   Full series:
> >     First binding: 1294 ms
> >     Average of subsequent rebinds: 110.24 ms
> 
> The improvements appear to range between "modest" and "large", but what
> I'd like to understand is how frequently real-world users are using
> these operations in real-world workloads.
> 
> IOW, (and this is always the bottom line), how valuable is this
> patchset to our users?

This is not a steady-state data-path optimization. Its value is in pmem
bring-up paths, and in our deployment we do have scenarios where
multiple pmem devices are hotplugged , so reducing this latency is useful
in practice for us.

> >   mm: factor zone-device page init helpers out of
> >     __init_zone_device_page
> >   mm: add a set_page_section_from_pfn() helper
> >   mm: add a template-based fast path for zone-device page init
> >   mm: extend the template fast path to zone-device compound tails
> >   string: introduce memcpy_streaming() helpers
> >   x86/string: extend memcpy_flushcache() fixed-size fastpaths
> >   mm: use memcpy_streaming() in zone-device template copies
> >
> >  arch/x86/include/asm/string_64.h | 100 +++++++++++++---
> >  include/linux/mm.h               |  19 ++-
> >  include/linux/string.h           |  18 +++
> >  mm/mm_init.c                     | 198 +++++++++++++++++++++++++++----
> >  4 files changed, 294 insertions(+), 41 deletions(-)
> 
> I won't take any action at this stage - let's await reviewer input.  If
> none is forthcoming then please remind me and I'll figure out what to
> do.
> 
> The ever-present reviewer called "Sashiko" has thoughts to offer:
> 
>       
> https://sashiko.dev/#/patchset/[email protected]
> 
> Please take a look, decide if there's useful material in there.

There is useful material there, mainly around patches 5 and 6.

The memcpy_streaming() x86 backend should be narrower, and the expanded
memcpy_flushcache() small-copy fastpath should keep naturally aligned
cases only and preserve forward movnti store order.

I'll address those points in the next revision and rerun the benchmarks.

Thanks,
Zhe

Re: [PATCH v2 0/7] mm: speed up ZONE_DEVICE memmap initialization

Reply via email to