This series reduces latency in paths which create or recreate
ZONE_DEVICE memmaps. Pmem hotplug is a concrete example.
memmap_init_zone_device() can spend a substantial amount of time
initializing large ZONE_DEVICE ranges because it repeats nearly
identical struct page setup for every PFN.
The series addresses that overhead in eight patches.
The first patch updates a stale comment in __init_zone_device_page() so
it matches the current ZONE_DEVICE refcount rules.
The second patch factors the reusable pieces out of
__init_zone_device_page() so later patches can share the same logic
without changing the existing slow path.
The third patch adds set_page_section_from_pfn(), so generic callers
can update section bits from a PFN without open-coding
SECTION_IN_PAGE_FLAGS handling.
The fourth patch adds a template-based fast path for ZONE_DEVICE head
pages. Instead of rebuilding the same struct page state for every PFN,
it prepares one reusable head-page template through the existing slow
path, refreshes the PFN-dependent fields before each copy, and then
copies that template into the destination page.
The fifth patch extends the same template-based approach to compound
tails, so pfns_per_compound > 1 can also benefit from the fast path.
The sixth patch introduces memcpy_streaming() and
memcpy_streaming_drain() as a generic interface for write-once copies,
with a memcpy() fallback for architectures that do not provide a
specialized backend, or for transfers which an architecture-specific
backend cannot safely handle with non-temporal stores.
The seventh patch extends x86 memcpy_flushcache() small fixed-size
fastpaths for naturally aligned copies, so common struct-page-sized
streaming copies can stay on the inline movnti path while preserving
forward store order.
The last patch switches the zone-device template-copy path over to
memcpy_streaming(). It keeps pageblock-aligned PFNs on regular memcpy()
so pageblock setup can immediately read page metadata back through the
usual helpers, and it drains streaming stores before later normal
stores update overlapping or dependent compound-page metadata.
The optimized path is disabled when the page_ref_set tracepoint is
enabled, and sanitized builds remain on the slow path so their
instrumented stores are preserved.
Testing
=======
Tests were run in a VM on an Intel Ice Lake server.
Two PMEM configurations were used:
- a 100 GB fsdax namespace configured with map=dev, which exercises
the nd_pmem rebind path (pfns_per_compound == 1)
- a 100 GB devdax namespace configured with align=2097152, which
exercises the dax_pmem rebind path (pfns_per_compound > 1)
For each configuration, the corresponding driver was unbound and
rebound 30 times. Memmap initialization latency was collected from the
pr_debug() output of memmap_init_zone_device().
The first bind is reported separately, and the average of subsequent
rebinds is used as the repeat-run result.
Performance
===========
nd_pmem rebind, 100 GB fsdax namespace, map=dev
Base(v7.1-rc3):
First binding: 1486 ms
Average of subsequent rebinds: 273.52 ms
With patches 1-4 applied:
First binding: 1422 ms
Average of subsequent rebinds: 246.65 ms
Full series:
First binding: 1285 ms
Average of subsequent rebinds: 114.31 ms
dax_pmem rebind, 100 GB devdax namespace, align=2097152
Base(v7.1-rc3):
First binding: 1515 ms
Average of subsequent rebinds: 313.45 ms
With patches 1-5 applied:
First binding: 1422 ms
Average of subsequent rebinds: 240.42 ms
Full series:
First binding: 1331 ms
Average of subsequent rebinds: 99.37 ms
Li Zhe (8):
mm: fix stale ZONE_DEVICE refcount comment
mm: factor zone-device page init helpers out of
__init_zone_device_page
mm: add a set_page_section_from_pfn() helper
mm: add a template-based fast path for zone-device page init
mm: extend the template fast path to zone-device compound tails
string: introduce memcpy_streaming() helpers
x86/string: extend memcpy_flushcache() fixed-size fastpaths
mm: use memcpy_streaming() in zone-device template copies
arch/x86/include/asm/string_64.h | 157 ++++++++++++++++++++++---
include/linux/mm.h | 19 ++-
include/linux/string.h | 20 ++++
mm/mm_init.c | 195 +++++++++++++++++++++++++++----
4 files changed, 352 insertions(+), 39 deletions(-)
---
v2: https://lore.kernel.org/all/[email protected]/
v1: https://lore.kernel.org/all/[email protected]/
Changelogs:
v2->v3:
- Add a leading comment-only cleanup patch to fix the stale ZONE_DEVICE
refcount comment, so later refactoring patches no longer carry that
stale wording forward.
- Rename the refcount-policy helper to pagemap_resets_refcount() and
make it a bool predicate. Suggested by Mike Rapoport.
- Update the reusable template in place from the first template
fast-path patch onward, so later patches only change the copy
primitive instead of introducing an intermediate post-copy fixup
step. Suggested by Mike Rapoport.
- Clean up helper indentation and use early continue in the
compound-tail loop. Suggested by Mike Rapoport.
- Narrow the x86 memcpy_streaming() backend to transfers that can stay
entirely on the non-temporal store path, keeping zero-length and
unaligned cases on memcpy(). Suggested by Andrew Morton.
- Restrict the x86 memcpy_flushcache() small-copy fastpath to naturally
aligned cases, add volatile/memory clobbers, and preserve forward
movnti store order. Suggested by Andrew Morton.
- Drop the obsolete struct page size/alignment eligibility check from
the template fast path now that the intermediate patches use
memcpy() and the streaming backend has a safe fallback.
- Refresh the benchmark results.
v1->v2:
- Move the pageblock-helper split into patch 1, and add a dedicated
set_page_section_from_pfn() helper so generic callers no longer
open-code SECTION_IN_PAGE_FLAGS handling. Suggested by Mike Rapoport.
- Drop the v1 32-bit gating and document instead that CONFIG_ZONE_DEVICE
is 64-bit only because it depends on MEMORY_HOTPLUG. Suggested by
Mike Rapoport.
- Replace the v1 BUILD_BUG_ON() struct page layout checks with a runtime
fast-path eligibility check that falls back to the slow path when the
layout is unsuitable. Suggested by Mike Rapoport.
- Rename the template fast-path helpers to zone_device_* names for
clarity. Suggested by Mike Rapoport.
- Replace the v1 open-coded arch_optimize_store_u64()/drain() approach
with a generic memcpy_streaming()/memcpy_streaming_drain() interface,
and move the x86 optimization under memcpy_flushcache(). Suggested by
Alistair Popple.
- Split the old v1 streaming-copy patch into three patches: introduce
the generic helper, extend the x86 backend, and then switch mm over
to the new interface. This is part of the memcpy-based rework
suggested by Alistair Popple.
- Refresh the performance section and report whole-series results as
series-level numbers.
- Fix a missing memcpy_streaming_drain() in the compound-page
initialization path, following review feedback from Andrew Morton.
--
2.20.1