Changes since v5[9]:
* Keep @dev on the previous line to improve readability on
patch 5 (Christoph Hellwig)
* Document is_static() function to clarify what are static and
dynamic dax regions in patch 7 (Christoph Hellwig)
* Deduce @f_mapping and @pgmap from vmf->vma->vm_file to reduce
the number of arguments of set_{page,compound}_mapping() in last
patch (Christoph Hellwig)
* Factor out @mapping initialization to a separate helper ([new] patch 8)
and rename set_page_mapping() to dax_set_mapping() in the process.
* Remove set_compound_mapping() and instead adjust dax_set_mapping()
to handle @vmemmap_shift case on the last patch. This greatly
simplifies the last patch, and addresses a similar comment by Christoph
on having an earlier return. No functional change on the changes
to dax_set_mapping compared to its earlier version so I retained
Dan's Rb on last patch.
* Initialize the mapping prior to inserting the PTE/PMD/PUD as opposed
to after the fact. ([new] patch 9, Jason Gunthorpe)
Patches 8 and 9 are new (small cleanups) in v6.
Patches 6 - 9 are the ones missing Rb tags.
---
This series converts device-dax to use compound pages, and moves away from the
'struct page per basepage on PMD/PUD' that is done today. Doing so, 1) unlocks
a few noticeable improvements on unpin_user_pages() and makes device-dax+altmap
case 4x times faster in pinning (numbers below and in last patch) 2) as
mentioned in various other threads it's one important step towards cleaning up
ZONE_DEVICE refcounting.
I've split the compound pages on devmap part from the rest based on recent
discussions on devmap pending and future work planned[5][6]. There is consensus
that device-dax should be using compound pages to represent its PMD/PUDs just
like HugeTLB and THP, and that leads to less specialization of the dax parts.
I will pursue the rest of the work in parallel once this part is merged,
particular the GUP-{slow,fast} improvements [7] and the tail struct page
deduplication memory savings part[8].
To summarize what the series does:
Patch 1: Prepare hwpoisoning to work with dax compound pages.
Patches 2-3: Split the current utility function of prep_compound_page()
into head and tail and use those two helpers where appropriate to take
advantage of caches being warm after __init_single_page(). This is used
when initializing zone device when we bring up device-dax namespaces.
Patches 4-10: Add devmap support for compound pages in device-dax.
memmap_init_zone_device() initialize its metadata as compound pages, and it
introduces a new devmap property known as vmemmap_shift which
outlines how the vmemmap is structured (defaults to base pages as done today).
The property describe the page order of the metadata essentially.
While at it do a few cleanups in device-dax in patches 5-9.
Finally enable device-dax usage of devmap @vmemmap_shift to a value
based on its own @align property. @vmemmap_shift returns 0 by default (which
is today's case of base pages in devmap, like fsdax or the others) and the
usage of compound devmap is optional. Starting with device-dax (*not* fsdax) we
enable it by default. There are a few pinning improvements particular on the
unpinning case and altmap, as well as unpin_user_page_range_dirty_lock() being
just as effective as THP/hugetlb[0] pages.
$ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w
(pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms
[altmap]
(pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms
put:~71ms
$ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w
(pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms
[altmap with -m 127004]
(pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec
put:~563ms
Tested on x86 with 1Tb+ of pmem (alongside registering it with RDMA with and
without altmap), alongside gup_test selftests with dynamic dax regions and
static dax regions. Coupled with ndctl unit tests for dynamic dax devices
that exercise all of this. Note, for dynamic dax regions I had to revert
commit 8aa83e6395 ("x86/setup: Call early_reserve_memory() earlier"), it
is a known issue that this commit broke efi_fake_mem=.
Patches apply on top of linux-next tag next-20211124 (commit 4b74e088fef6).
Thanks for all the review so far.
As always, Comments and suggestions very much appreciated!
Older Changelog,
v4[4] -> v5[9]:
* Remove patches 8-14 as they will go in 2 separate (parallel) series;
* Rename @geometry to @vmemmap_shift (Christoph Hellwig)
* Make @vmemmap_shift an order rather than nr of pages (Christoph Hellwig)
* Consequently remove helper pgmap_geometry_order() as it's no longer
needed, in place of accessing directly the structure member [Patch 4 and 8]
* Rename pgmap_geometry() to pgmap_vmemmap_nr() in patches 4 and 8;
* Remove usage of pgmap_geometry() in favour for testing
@vmemmap_shift for non-zero directly directly in patch 8;
* Patch 5 is new for using `struct_size()` (Dan Williams)
* Add a 'static_dev_dax()' helper for testing pgmap == NULL handling
for dynamic dax devices.
* Expand patch 6 to be explicitly on those !pgmap cases, and replace
those with static_dev_dax().
* Add performance numbers on patch 8 on gup/pin_user_pages() numbers with
this series.
* Massage commit description to remove mentions of @geometry.
* Add Dan's Reviewed-by on patch 8 (Dan Williams)
v3[3] -> v4[4]:
* Collect Dan's Reviewed-by on patches 1-5,8,9,11
* Collect Muchun Reviewed-by on patch 1,2,11
* Reorder patches to first introduce compound pages in ZONE_DEVICE with
device-dax (for pmem) as first user (patches 1-8) followed by implementing
the sparse-vmemmap changes for minimize struct page overhead for devmap
(patches 9-14)
* Eliminate remnant @align references to use @geometry (Dan)
* Convert mentions of 'compound pagemap' to 'compound devmap' throughout
the series to avoid confusions of this work conflicting/referring to
anything Folio or pagemap related.
* Delete pgmap_pfn_geometry() on patch 4
and rework other patches to use pgmap_geometry() instead (Dan)
* Convert @geometry to be a number of pages rather than page size in patch 4
(Dan)
* Make pgmap_geometry() more readable (Christoph)
* Simplify pgmap refcount pfn computation in memremap_pages() (Christoph)
* Rework memmap_init_compound() in patch 4 to use the same style as
memmap_init_zone_device i.e. iterating over PFNs, rather than struct pages
(Dan)
* Add comment on devmap prep_compound_head callsite explaining why it needs
to be used after first+second tail pages have been initialized (Dan, Jane)
* Initialize tail page refcount to zero in patch 4
* Make sure pfn_next() iterate over compound pages (rather than base page) in
patch 4 to tackle the zone_device elevated page refcount.
[ Note these last two bullet points above are unneeded once this patch is
merged:
https://lore.kernel.org/linux-mm/[email protected]/
]
* Remove usage of ternary operator when computing @end in gup_device_huge() in
patch 8 (Dan)
* Remove pinned_head variable in patch 8
* Remove put_dev_pagemap() need for compound case as that is now fixed for the
general case
in patch 8
* Switch to PageHead() instead of PageCompound() as we only work with either
base pages
or head pages in patch 8 (Matthew)
* Fix kdoc of @altmap and improve kdoc for @pgmap in patch 9 (Dan)
* Fix up missing return in vmemmap_populate_address() in patch 10
* Change error handling style in all patches (Dan)
* Change title of vmemmap_dedup.rst to be more representative of the purpose
in patch 12 (Dan)
* Move some of the section and subsection tail page reuse code into helpers
reuse_compound_section() and compound_section_tail_page() for readability in
patch 12 (Dan)
* Commit description fixes for clearity in various patches (Dan)
* Add pgmap_geometry_order() helper and
drop unneeded geometry_size, order variables in patch 12
* Drop unneeded byte based computation to be PFN in patch 12
* Handle the dynamic dax region properly when ensuring a stable dev_dax->pgmap
in patch 6.
* Add a compound_nr_pages() helper and use it in memmap_init_zone_device to
calculate
the number of unique struct pages to initialize depending on @altmap existence
in patch 13 (Dan)
* Add compound_section_tail_huge_page() for the tail page PMD reuse in patch
14 (Dan)
* Reword cover letter.
v2 -> v3[3]:
* Collect Mike's Ack on patch 2 (Mike)
* Collect Naoya's Reviewed-by on patch 1 (Naoya)
* Rename compound_pagemaps.rst doc page (and its mentions) to
vmemmap_dedup.rst (Mike, Muchun)
* Rebased to next-20210714
v1[1] -> v2[2]:
(New patches 7, 10, 11)
* Remove occurences of 'we' in the commit descriptions (now for real) [Dan]
* Add comment on top of compound_head() for fsdax (Patch 1) [Dan]
* Massage commit descriptions of cleanup/refactor patches to reflect [Dan]
that it's in preparation for bigger infra in sparse-vmemmap. (Patch 2,3,5)
[Dan]
* Greatly improve all commit messages in terms of grammar/wording and
clearity. [Dan]
* Rename variable/helpers from dev_pagemap::align to @geometry, reflecting
tht it's not the same thing as dev_dax->align, Patch 4 [Dan]
* Move compound page init logic into separate memmap_init_compound() helper,
Patch 4 [Dan]
* Simplify patch 9 as a result of having compound initialization differently
[Dan]
* Rename @pfn_align variable in memmap_init_zone_device to @pfns_per_compound
[Dan]
* Rename Subject of patch 6 [Dan]
* Move hugetlb_vmemmap.c comment block to Documentation/vm Patch 7 [Dan]
* Add some type-safety to @block and use 'struct page *' rather than
void, Patch 8 [Dan]
* Add some comments to less obvious parts on 1G compound page case, Patch 8
[Dan]
* Remove vmemmap lookup function in place of
pmd_off_k() + pte_offset_kernel() given some guarantees on section onlining
serialization, Patch 8
* Add a comment to get_page() mentioning where/how it is, Patch 8 freed [Dan]
* Add docs about device-dax usage of tail dedup technique in newly added
compound_pagemaps.rst doc entry.
* Add cleanup patch for device-dax for ensuring dev_dax::pgmap is always set
[Dan]
* Add cleanup patch for device-dax for using ALIGN() [Dan]
* Store pinned head in separate @pinned_head variable and fix error case,
patch 13 [Dan]
* Add comment on difference of @next value for PageCompound(), patch 13 [Dan]
* Move PUD compound page to be last patch [Dan]
* Add vmemmap layout for PUD compound geometry in compound_pagemaps.rst doc,
patch 14 [Dan]
* Rebased to next-20210617
RFC[0] -> v1:
(New patches 1-3, 5-8 but the diffstat isn't that different)
* Fix hwpoisoning of devmap pages reported by Jane (Patch 1 is new in v1)
* Fix/Massage commit messages to be more clear and remove the 'we' occurences
(Dan, John, Matthew)
* Use pfn_align to be clear it's nr of pages for @align value (John, Dan)
* Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of
pgmap->align;
* Remove the gup_device_compound_huge special path and have the same code
work both ways while special casing when devmap page is compound (Jason,
John)
* Avoid usage of vmemmap_populate_basepages() and introduce a first class
loop that doesn't care about passing an altmap for memmap reuse. (Dan)
* Completely rework the vmemmap_populate_compound() to avoid the
sparse_add_section
hack into passing block across sparse_add_section calls. It's a lot easier to
follow and more explicit in what it does.
* Replace the vmemmap refactoring with adding a @pgmap argument and moving
parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a
result)
* Add PMD tail page vmemmap area reuse for 1GB pages. (Patch 8 is new)
* Improve memmap_init_zone_device() to initialize compound pages when
struct pages are cache warm. That lead to a even further speed up further
from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones
as a result (Dan)
* Remove PGMAP_COMPOUND and use @align as the property to detect whether
or not to reuse vmemmap areas (Dan)
[0]
https://lore.kernel.org/linux-mm/[email protected]/
[1]
https://lore.kernel.org/linux-mm/[email protected]/
[2]
https://lore.kernel.org/linux-mm/[email protected]/
[3]
https://lore.kernel.org/linux-mm/[email protected]/
[4]
https://lore.kernel.org/linux-mm/[email protected]/
[5] https://lore.kernel.org/linux-mm/[email protected]/
[6]
https://lore.kernel.org/linux-mm/[email protected]/
[7]
https://lore.kernel.org/linux-mm/[email protected]/
[8]
https://lore.kernel.org/linux-mm/[email protected]/
[9]
https://lore.kernel.org/linux-mm/[email protected]/
Joao Martins (10):
memory-failure: fetch compound_head after pgmap_pfn_valid()
mm/page_alloc: split prep_compound_page into head and tail subparts
mm/page_alloc: refactor memmap_init_zone_device() page init
mm/memremap: add ZONE_DEVICE support for compound pages
device-dax: use ALIGN() for determining pgoff
device-dax: use struct_size()
device-dax: ensure dev_dax->pgmap is valid for dynamic devices
device-dax: factor out page mapping initialization
device-dax: set mapping prior to vmf_insert_pfn{,_pmd,pud}()
device-dax: compound devmap support
drivers/dax/bus.c | 32 +++++++++
drivers/dax/bus.h | 1 +
drivers/dax/device.c | 92 +++++++++++++++++---------
include/linux/memremap.h | 11 ++++
mm/memory-failure.c | 6 ++
mm/memremap.c | 12 ++--
mm/page_alloc.c | 138 +++++++++++++++++++++++++++------------
7 files changed, 212 insertions(+), 80 deletions(-)
--
2.17.2