Re: [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
On 08.11.19 19:29, Dan Williams wrote: On Fri, Nov 8, 2019 at 2:22 AM David Hildenbrand wrote: On 08.11.19 08:14, David Hildenbrand wrote: On 08.11.19 06:09, Dan Williams wrote: On Thu, Nov 7, 2019 at 2:07 PM David Hildenbrand wrote: On 07.11.19 19:22, David Hildenbrand wrote: Am 07.11.2019 um 16:40 schrieb Dan Williams : On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand wrote: Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap (and don't have ZONE_DEVICE memory). Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Alex Williamson Cc: Cornelia Huck Signed-off-by: David Hildenbrand --- drivers/vfio/vfio_iommu_type1.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 2ada8e6cdb88..f8ce8c408ba8 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) */ static bool is_invalid_reserved_pfn(unsigned long pfn) { - if (pfn_valid(pfn)) - return PageReserved(pfn_to_page(pfn)); + struct page *page = pfn_to_online_page(pfn); Ugh, I just realized this is not a safe conversion until pfn_to_online_page() is moved over to subsection granularity. As it stands it will return true for any ZONE_DEVICE pages that share a section with boot memory. That should not happen right now and I commented back when you introduced subsection support that I don’t want to have ZONE_DEVICE mixed with online pages in a section. Having memory block devices that partially span ZONE_DEVICE would be ... really weird. With something like pfn_active() - as discussed - we could at least make this check work - but I am not sure if we really want to go down that path. In the worst case, some MB of RAM are lost ... I guess this needs more thought. I just realized the "boot memory" part. Is that a real thing? IOW, can we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat have doubts that this would work ... One of the real world failure cases that started the subsection effect is that Persistent Memory collides with System RAM on a 64MB boundary on shipping platforms. System RAM ends on a 64MB boundary and due to a lack of memory controller resources PMEM is mapped contiguously at the end of that boundary. Some more details in the subsection cover letter / changelogs [1] [2]. It's not sufficient to just lose some memory, that's the broken implementation that lead to the subsection work because the lost memory may change from one boot to the next and software can't reliably inject a padding that conforms to the x86 128MB section constraint. Thanks, I thought it was mostly for weird alignment where other parts of the section are basically "holes" and not memory. Yes, it is a real bug that ZONE_DEVICE pages fall into sections that are marked SECTION_IS_ONLINE. Suffice to say I think we need your pfn_active() to get subsection granularity pfn_to_online_page() before PageReserved() can be removed. I agree that we have to fix this. I don't like ZONE_DEVICE pages falling into memory device blocks (e.g., cannot get offlined), but I guess that train is gone :) As long as it's not for memory hotplug, I can most probably live with this. Also, I'd like to get Michals opinion on this and the pfn_active() approach, but I can understand he's busy. This patch set can wait, I won't be working next week besides reading/writing mails either way. Is anybody looking into the pfn_active() thingy? I wonder if we should do something like this right now to fix this (exclude the false positive ZONE_DEVICE pages we could have within an online section, which was not possible before subsection hotplug): diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 384ffb3d69ab..490a9e9358b3 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -30,6 +30,8 @@ struct vmem_altmap; if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \ pfn_valid_within(___pfn)) \ ___page = pfn_to_page(___pfn); \ + if (unlikely(___page && is_zone_device_page(___page))) \ + ___page = NULL;\ ___page; \ }) Yeah, it's another is_zone_device_page(), but it should not be racy here, as
Re: [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
On 08.11.19 08:14, David Hildenbrand wrote: On 08.11.19 06:09, Dan Williams wrote: On Thu, Nov 7, 2019 at 2:07 PM David Hildenbrand wrote: On 07.11.19 19:22, David Hildenbrand wrote: Am 07.11.2019 um 16:40 schrieb Dan Williams : On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand wrote: Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap (and don't have ZONE_DEVICE memory). Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Alex Williamson Cc: Cornelia Huck Signed-off-by: David Hildenbrand --- drivers/vfio/vfio_iommu_type1.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 2ada8e6cdb88..f8ce8c408ba8 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) */ static bool is_invalid_reserved_pfn(unsigned long pfn) { - if (pfn_valid(pfn)) - return PageReserved(pfn_to_page(pfn)); + struct page *page = pfn_to_online_page(pfn); Ugh, I just realized this is not a safe conversion until pfn_to_online_page() is moved over to subsection granularity. As it stands it will return true for any ZONE_DEVICE pages that share a section with boot memory. That should not happen right now and I commented back when you introduced subsection support that I don’t want to have ZONE_DEVICE mixed with online pages in a section. Having memory block devices that partially span ZONE_DEVICE would be ... really weird. With something like pfn_active() - as discussed - we could at least make this check work - but I am not sure if we really want to go down that path. In the worst case, some MB of RAM are lost ... I guess this needs more thought. I just realized the "boot memory" part. Is that a real thing? IOW, can we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat have doubts that this would work ... One of the real world failure cases that started the subsection effect is that Persistent Memory collides with System RAM on a 64MB boundary on shipping platforms. System RAM ends on a 64MB boundary and due to a lack of memory controller resources PMEM is mapped contiguously at the end of that boundary. Some more details in the subsection cover letter / changelogs [1] [2]. It's not sufficient to just lose some memory, that's the broken implementation that lead to the subsection work because the lost memory may change from one boot to the next and software can't reliably inject a padding that conforms to the x86 128MB section constraint. Thanks, I thought it was mostly for weird alignment where other parts of the section are basically "holes" and not memory. Yes, it is a real bug that ZONE_DEVICE pages fall into sections that are marked SECTION_IS_ONLINE. Suffice to say I think we need your pfn_active() to get subsection granularity pfn_to_online_page() before PageReserved() can be removed. I agree that we have to fix this. I don't like ZONE_DEVICE pages falling into memory device blocks (e.g., cannot get offlined), but I guess that train is gone :) As long as it's not for memory hotplug, I can most probably live with this. Also, I'd like to get Michals opinion on this and the pfn_active() approach, but I can understand he's busy. This patch set can wait, I won't be working next week besides reading/writing mails either way. Is anybody looking into the pfn_active() thingy? I wonder if we should do something like this right now to fix this (exclude the false positive ZONE_DEVICE pages we could have within an online section, which was not possible before subsection hotplug): diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 384ffb3d69ab..490a9e9358b3 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -30,6 +30,8 @@ struct vmem_altmap; if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \ pfn_valid_within(___pfn)) \ ___page = pfn_to_page(___pfn); \ + if (unlikely(___page && is_zone_device_page(___page))) \ + ___page = NULL;\ ___page; \ }) Yeah, it's another is_zone_device_page(), but it should not be racy here, as we want to exclude, not include ZONE_DEVICE. I don't have time to look into this right now
Re: [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
On 08.11.19 06:09, Dan Williams wrote: On Thu, Nov 7, 2019 at 2:07 PM David Hildenbrand wrote: On 07.11.19 19:22, David Hildenbrand wrote: Am 07.11.2019 um 16:40 schrieb Dan Williams : On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand wrote: Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap (and don't have ZONE_DEVICE memory). Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Alex Williamson Cc: Cornelia Huck Signed-off-by: David Hildenbrand --- drivers/vfio/vfio_iommu_type1.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 2ada8e6cdb88..f8ce8c408ba8 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) */ static bool is_invalid_reserved_pfn(unsigned long pfn) { - if (pfn_valid(pfn)) - return PageReserved(pfn_to_page(pfn)); + struct page *page = pfn_to_online_page(pfn); Ugh, I just realized this is not a safe conversion until pfn_to_online_page() is moved over to subsection granularity. As it stands it will return true for any ZONE_DEVICE pages that share a section with boot memory. That should not happen right now and I commented back when you introduced subsection support that I don’t want to have ZONE_DEVICE mixed with online pages in a section. Having memory block devices that partially span ZONE_DEVICE would be ... really weird. With something like pfn_active() - as discussed - we could at least make this check work - but I am not sure if we really want to go down that path. In the worst case, some MB of RAM are lost ... I guess this needs more thought. I just realized the "boot memory" part. Is that a real thing? IOW, can we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat have doubts that this would work ... One of the real world failure cases that started the subsection effect is that Persistent Memory collides with System RAM on a 64MB boundary on shipping platforms. System RAM ends on a 64MB boundary and due to a lack of memory controller resources PMEM is mapped contiguously at the end of that boundary. Some more details in the subsection cover letter / changelogs [1] [2]. It's not sufficient to just lose some memory, that's the broken implementation that lead to the subsection work because the lost memory may change from one boot to the next and software can't reliably inject a padding that conforms to the x86 128MB section constraint. Thanks, I thought it was mostly for weird alignment where other parts of the section are basically "holes" and not memory. Yes, it is a real bug that ZONE_DEVICE pages fall into sections that are marked SECTION_IS_ONLINE. Suffice to say I think we need your pfn_active() to get subsection granularity pfn_to_online_page() before PageReserved() can be removed. I agree that we have to fix this. I don't like ZONE_DEVICE pages falling into memory device blocks (e.g., cannot get offlined), but I guess that train is gone :) As long as it's not for memory hotplug, I can most probably live with this. Also, I'd like to get Michals opinion on this and the pfn_active() approach, but I can understand he's busy. This patch set can wait, I won't be working next week besides reading/writing mails either way. Is anybody looking into the pfn_active() thingy? [1]: https://lore.kernel.org/linux-mm/156092349300.979959.17603710711957735135.st...@dwillia2-desk3.amr.corp.intel.com/ [2]: https://lore.kernel.org/linux-mm/156092354368.979959.6232443923440952359.st...@dwillia2-desk3.amr.corp.intel.com/ -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
On 07.11.19 19:22, David Hildenbrand wrote: Am 07.11.2019 um 16:40 schrieb Dan Williams : On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand wrote: Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap (and don't have ZONE_DEVICE memory). Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Alex Williamson Cc: Cornelia Huck Signed-off-by: David Hildenbrand --- drivers/vfio/vfio_iommu_type1.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 2ada8e6cdb88..f8ce8c408ba8 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) */ static bool is_invalid_reserved_pfn(unsigned long pfn) { - if (pfn_valid(pfn)) - return PageReserved(pfn_to_page(pfn)); + struct page *page = pfn_to_online_page(pfn); Ugh, I just realized this is not a safe conversion until pfn_to_online_page() is moved over to subsection granularity. As it stands it will return true for any ZONE_DEVICE pages that share a section with boot memory. That should not happen right now and I commented back when you introduced subsection support that I don’t want to have ZONE_DEVICE mixed with online pages in a section. Having memory block devices that partially span ZONE_DEVICE would be ... really weird. With something like pfn_active() - as discussed - we could at least make this check work - but I am not sure if we really want to go down that path. In the worst case, some MB of RAM are lost ... I guess this needs more thought. I just realized the "boot memory" part. Is that a real thing? IOW, can we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat have doubts that this would work ... -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
> Am 07.11.2019 um 16:40 schrieb Dan Williams : > > On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand wrote: >> >> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to >> change that. >> >> KVM has this weird use case that you can map anything from /dev/mem >> into the guest. pfn_valid() is not a reliable check whether the memmap >> was initialized and can be touched. pfn_to_online_page() makes sure >> that we have an initialized memmap (and don't have ZONE_DEVICE memory). >> >> Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make >> sure the function produces the same result once we stop setting ZONE_DEVICE >> pages PG_reserved. >> >> Cc: Alex Williamson >> Cc: Cornelia Huck >> Signed-off-by: David Hildenbrand >> --- >> drivers/vfio/vfio_iommu_type1.c | 10 -- >> 1 file changed, 8 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/vfio/vfio_iommu_type1.c >> b/drivers/vfio/vfio_iommu_type1.c >> index 2ada8e6cdb88..f8ce8c408ba8 100644 >> --- a/drivers/vfio/vfio_iommu_type1.c >> +++ b/drivers/vfio/vfio_iommu_type1.c >> @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long >> npage, bool async) >> */ >> static bool is_invalid_reserved_pfn(unsigned long pfn) >> { >> - if (pfn_valid(pfn)) >> - return PageReserved(pfn_to_page(pfn)); >> + struct page *page = pfn_to_online_page(pfn); > > Ugh, I just realized this is not a safe conversion until > pfn_to_online_page() is moved over to subsection granularity. As it > stands it will return true for any ZONE_DEVICE pages that share a > section with boot memory. That should not happen right now and I commented back when you introduced subsection support that I don’t want to have ZONE_DEVICE mixed with online pages in a section. Having memory block devices that partially span ZONE_DEVICE would be ... really weird. With something like pfn_active() - as discussed - we could at least make this check work - but I am not sure if we really want to go down that path. In the worst case, some MB of RAM are lost ... I guess this needs more thought. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
On 06.11.19 01:08, Dan Williams wrote: On Tue, Nov 5, 2019 at 4:03 PM Sean Christopherson wrote: On Tue, Nov 05, 2019 at 03:43:29PM -0800, Dan Williams wrote: On Tue, Nov 5, 2019 at 3:30 PM Dan Williams wrote: On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson wrote: On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote: On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand wrote: The scarier code (for me) is transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte(), as I don't at all understand the interaction between THP and _PAGE_DEVMAP. The x86 KVM MMU code is one of the ugliest code I know (sorry, but it had to be said :/ ). Luckily, this should be independent of the PG_reserved thingy AFAIKs. Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte() are honoring kvm_is_reserved_pfn(), so again I'm missing where the page count gets mismanaged and leads to the reported hang. When mapping pages into the guest, KVM gets the page via gup(), which increments the page count for ZONE_DEVICE pages. But KVM puts the page using kvm_release_pfn_clean(), which skips put_page() if PageReserved() and so never puts its reference to ZONE_DEVICE pages. Oh, yeah, that's busted. Ugh, it's extra busted because every other gup user in the kernel tracks the pages resulting from gup and puts them (put_page()) when they are done. KVM wants to forget about whether it did a gup to get the page and optionally trigger put_page() based purely on the pfn. Outside of VFIO device assignment that needs pages pinned for DMA, why does KVM itself need to pin pages? If pages are pinned over a return to userspace that needs to be a FOLL_LONGTERM gup. Short answer, KVM pins the page to ensure correctness with respect to the primary MMU invalidating the associated host virtual address, e.g. when the page is being migrated or unmapped from host userspace. The main use of gup() is to handle guest page faults and map pages into the guest, i.e. into KVM's secondary MMU. KVM uses gup() to both get the PFN and to temporarily pin the page. The pin is held just long enough to guaranteed that any invalidation via the mmu_notifier will be stalled until after KVM finishes installing the page into the secondary MMU, i.e. the pin is short-term and not held across a return to userspace or entry into the guest. When a subsequent mmu_notifier invalidation occurs, KVM pulls the PFN from the secondary MMU and uses that to update accessed and dirty bits in the host. There are a few other KVM flows that eventually call into gup(), but those are "traditional" short-term pins and use put_page() directly. Ok, I was misinterpreting the effect of the bug with what KVM is using the reference to do. To your other point: But David's proposed fix for the above refcount bug is to omit the patch so that KVM no longer treats ZONE_DEVICE pages as reserved. That seems like the right thing to do, including for thp_adjust(), e.g. it would naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is mapped with a huge page (2mb or above) in the host. The only hiccup is figuring out how to correctly transfer the reference. That might not be the only hiccup. There's currently no such thing as huge pages for ZONE_DEVICE, there are huge *mappings* (pmd and pud), but the result of pfn_to_page() on such a mapping does not yield a huge 'struct page'. It seems there are other paths in KVM that assume that more typical page machinery is active like SetPageDirty() based on kvm_is_reserved_pfn(). While I told David that I did not want to see more usage of is_zone_device_page(), this patch below (untested) seems a cleaner path with less surprises: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4df0aa6b8e5c..fbea17c1810c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1831,7 +1831,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean); void kvm_release_pfn_clean(kvm_pfn_t pfn) { - if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) + if ((!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) || + (pfn_valid(pfn) && is_zone_device_page(pfn_to_page(pfn put_page(pfn_to_page(pfn)); } EXPORT_SYMBOL_GPL(kvm_release_pfn_clean); I had the same thought, but I do wonder about the kvm_get_pfn() users, e.g.,: hva_to_pfn_remapped(): r = follow_pfn(vma, addr, ); ... kvm_get_pfn(pfn); ... We would not take a reference for ZONE_DEVICE, but later drop one reference via kvm_release_pfn_clean(). IOW, kvm_get_pfn() gets *really* dangerous to use. I can't tell if this can happen right now. We do have 3 users of kvm_get_pfn() that we have to audit before this change. Also, we should add a comment to kvm_get_pfn() that it should never be used with possible ZONE_DEVICE pages. Also, we should add a comment to kvm_release_pfn_clean(), describing why we treat ZONE_DEVIC
Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
I think I know what's going wrong: Pages that are pinned via gfn_to_pfn() and friends take a references, however are often released via kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()... E.g., in arch/x86/kvm/x86.c:reexecute_instruction() ... pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa)); ... kvm_release_pfn_clean(pfn); void kvm_release_pfn_clean(kvm_pfn_t pfn) { if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) put_page(pfn_to_page(pfn)); } This function makes perfect sense as the counterpart for kvm_get_pfn(): void kvm_get_pfn(kvm_pfn_t pfn) { if (!kvm_is_reserved_pfn(pfn)) get_page(pfn_to_page(pfn)); } As all ZONE_DEVICE pages are currently reserved, pages pinned via gfn_to_pfn() and friends will often not see a put_page() AFAIKS. Assuming gup() takes a reference for ZONE_DEVICE pages, yes, this is a KVM bug. Yes, it does take a reference AFAIKs. E.g., mm/gup.c:gup_pte_range(): ... if (pte_devmap(pte)) { if (unlikely(flags & FOLL_LONGTERM)) goto pte_unmap; pgmap = get_dev_pagemap(pte_pfn(pte), pgmap); if (unlikely(!pgmap)) { undo_dev_pagemap(nr, nr_start, pages); goto pte_unmap; } } else if (pte_special(pte)) goto pte_unmap; VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); head = try_get_compound_head(page, 1); try_get_compound_head() will increment the reference count. Now, my patch does not change that, the result of kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would probably be a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and friends, after you successfully pinned the pages. (not sure if that's the right thing to do but you're the expert) b) To not use kvm_release_pfn_clean() and friends on pages that were definitely pinned. This is already KVM's intent, i.e. the purpose of the PageReserved() check is simply to avoid putting a non-existent reference. The problem is that KVM assumes pages with PG_reserved set are never pinned, which AFAICT was true when the code was first added. (talking to myself, sorry) Thinking again, dropping this patch from this series could effectively also fix that issue. E.g., kvm_release_pfn_clean() and friends would always do a put_page() if "pfn_valid() and !PageReserved()", so after patch 9 also on ZONDE_DEVICE pages. Yeah, this appears to be the correct fix. But it would have side effects that might not be desired. E.g.,: 1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be the right thing to do). This should be ok, at least on x86. There are only three users of kvm_pfn_to_page(). Two of those are on allocations that are controlled by KVM and are guaranteed to be vanilla MAP_ANONYMOUS. The third is on guest memory when running a nested guest, and in that case supporting ZONE_DEVICE memory is desirable, i.e. KVM should play nice with a guest that is backed by ZONE_DEVICE memory. 2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be okay) This is ok from a KVM perspective. What about void kvm_get_pfn(kvm_pfn_t pfn) { if (!kvm_is_reserved_pfn(pfn)) get_page(pfn_to_page(pfn)); } Is a pure get_page() sufficient in case of ZONE_DEVICE? (asking because of the live references obtained via get_dev_pagemap(pte_pfn(pte), pgmap) in mm/gup.c:gup_pte_range() somewhat confuse me :) ) The scarier code (for me) is transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte(), as I don't at all understand the interaction between THP and _PAGE_DEVMAP. The x86 KVM MMU code is one of the ugliest code I know (sorry, but it had to be said :/ ). Luckily, this should be independent of the PG_reserved thingy AFAIKs. -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH v1 02/10] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
On 05.11.19 02:37, Dan Williams wrote: On Thu, Oct 24, 2019 at 5:10 AM David Hildenbrand wrote: Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap (and don't have ZONE_DEVICE memory). Rewrite kvm_is_mmio_pfn() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Paolo Bonzini Cc: "Radim Krčmář" Cc: Sean Christopherson Cc: Vitaly Kuznetsov Cc: Wanpeng Li Cc: Jim Mattson Cc: Joerg Roedel Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: KarimAllah Ahmed Cc: Michal Hocko Cc: Dan Williams Signed-off-by: David Hildenbrand --- arch/x86/kvm/mmu.c | 29 + 1 file changed, 17 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 24c23c66b226..f03089a336de 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2962,20 +2962,25 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, static bool kvm_is_mmio_pfn(kvm_pfn_t pfn) { + struct page *page = pfn_to_online_page(pfn); + + /* +* ZONE_DEVICE pages are never online. Online pages that are reserved +* either indicate the zero page or MMIO pages. +*/ + if (page) + return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)); + + /* +* Anything with a valid (but not online) memmap could be ZONE_DEVICE. +* Treat only UC/UC-/WC pages as MMIO. You might clarify that ZONE_DEVICE pages that belong to WB cacheable pmem have the correct memtype established by devm_memremap_pages(). /* * Anything with a valid (but not online) memmap could be ZONE_DEVICE. * Treat only UC/UC-/WC pages as MMIO. devm_memremap_pages() established * the correct memtype for WB cacheable ZONE_DEVICE pages. */ Thanks! Other than that, feel free to add: Reviewed-by: Dan Williams -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH v1 09/10] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap
On 04.11.19 23:44, Boris Ostrovsky wrote: On 10/24/19 8:09 AM, David Hildenbrand wrote: diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c index 4f2e78a5e4db..af69f057913a 100644 --- a/drivers/xen/balloon.c +++ b/drivers/xen/balloon.c @@ -374,6 +374,13 @@ static void xen_online_page(struct page *page, unsigned int order) mutex_lock(_mutex); for (i = 0; i < size; i++) { p = pfn_to_page(start_pfn + i); + /* +* TODO: The core used to mark the pages reserved. Most probably +* we can stop doing that now. However, especially +* alloc_xenballooned_pages() left PG_reserved set +* on pages that can get mapped to user space. +*/ + __SetPageReserved(p); I suspect this is not needed. Pages can get into balloon either from here or from non-hotplug path (e.g. decrease_reservation()) and so when we get a page from the balloon we would get a random page that may or may not have Reserved bit set. Yeah, I also think it is fine. If you agree, I'll drop this hunk and add details to the patch description. -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
On 05.11.19 10:49, David Hildenbrand wrote: On 05.11.19 10:17, David Hildenbrand wrote: On 05.11.19 05:38, Dan Williams wrote: On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand wrote: Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap (and don't have ZONE_DEVICE memory). Rewrite kvm_is_reserved_pfn() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Paolo Bonzini Cc: "Radim Krčmář" Cc: Michal Hocko Cc: Dan Williams Cc: KarimAllah Ahmed Signed-off-by: David Hildenbrand --- virt/kvm/kvm_main.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e9eb666eb6e8..9d18cc67d124 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, bool kvm_is_reserved_pfn(kvm_pfn_t pfn) { - if (pfn_valid(pfn)) - return PageReserved(pfn_to_page(pfn)); + struct page *page = pfn_to_online_page(pfn); + /* +* We treat any pages that are not online (not managed by the buddy) +* as reserved - this includes ZONE_DEVICE pages and pages without +* a memmap (e.g., mapped via /dev/mem). +*/ + if (page) + return PageReserved(page); return true; } So after this all the pfn_valid() usage in kvm_main.c is replaced with pfn_to_online_page()? Looks correct to me. However, I'm worried that kvm is taking reference on ZONE_DEVICE pages through some other path resulting in this: https://lore.kernel.org/linux-nvdimm/20190919154708.ga24...@angband.pl/ I'll see if this patch set modulates or maintains that failure mode. I'd assume that the behavior is unchanged. Ithink we get a reference to these ZONE_DEVICE pages via __get_user_pages_fast() and friends in hva_to_pfn_fast() and friends in virt/kvm/kvm_main.c I think I know what's going wrong: Pages that are pinned via gfn_to_pfn() and friends take a references, however are often released via kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()... E.g., in arch/x86/kvm/x86.c:reexecute_instruction() ... pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa)); ... kvm_release_pfn_clean(pfn); void kvm_release_pfn_clean(kvm_pfn_t pfn) { if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) put_page(pfn_to_page(pfn)); } This function makes perfect sense as the counterpart for kvm_get_pfn(): void kvm_get_pfn(kvm_pfn_t pfn) { if (!kvm_is_reserved_pfn(pfn)) get_page(pfn_to_page(pfn)); } As all ZONE_DEVICE pages are currently reserved, pages pinned via gfn_to_pfn() and friends will often not see a put_page() AFAIKS. Now, my patch does not change that, the result of kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would probably be a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and friends, after you successfully pinned the pages. (not sure if that's the right thing to do but you're the expert) b) To not use kvm_release_pfn_clean() and friends on pages that were definitely pinned. (talking to myself, sorry) Thinking again, dropping this patch from this series could effectively also fix that issue. E.g., kvm_release_pfn_clean() and friends would always do a put_page() if "pfn_valid() and !PageReserved()", so after patch 9 also on ZONDE_DEVICE pages. But it would have side effects that might not be desired. E.g.,: 1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be the right thing to do). 2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be okay) -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
On 05.11.19 10:17, David Hildenbrand wrote: On 05.11.19 05:38, Dan Williams wrote: On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand wrote: Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap (and don't have ZONE_DEVICE memory). Rewrite kvm_is_reserved_pfn() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Paolo Bonzini Cc: "Radim Krčmář" Cc: Michal Hocko Cc: Dan Williams Cc: KarimAllah Ahmed Signed-off-by: David Hildenbrand --- virt/kvm/kvm_main.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e9eb666eb6e8..9d18cc67d124 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, bool kvm_is_reserved_pfn(kvm_pfn_t pfn) { - if (pfn_valid(pfn)) - return PageReserved(pfn_to_page(pfn)); + struct page *page = pfn_to_online_page(pfn); + /* +* We treat any pages that are not online (not managed by the buddy) +* as reserved - this includes ZONE_DEVICE pages and pages without +* a memmap (e.g., mapped via /dev/mem). +*/ + if (page) + return PageReserved(page); return true; } So after this all the pfn_valid() usage in kvm_main.c is replaced with pfn_to_online_page()? Looks correct to me. However, I'm worried that kvm is taking reference on ZONE_DEVICE pages through some other path resulting in this: https://lore.kernel.org/linux-nvdimm/20190919154708.ga24...@angband.pl/ I'll see if this patch set modulates or maintains that failure mode. I'd assume that the behavior is unchanged. Ithink we get a reference to these ZONE_DEVICE pages via __get_user_pages_fast() and friends in hva_to_pfn_fast() and friends in virt/kvm/kvm_main.c I think I know what's going wrong: Pages that are pinned via gfn_to_pfn() and friends take a references, however are often released via kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()... E.g., in arch/x86/kvm/x86.c:reexecute_instruction() ... pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa)); ... kvm_release_pfn_clean(pfn); void kvm_release_pfn_clean(kvm_pfn_t pfn) { if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) put_page(pfn_to_page(pfn)); } This function makes perfect sense as the counterpart for kvm_get_pfn(): void kvm_get_pfn(kvm_pfn_t pfn) { if (!kvm_is_reserved_pfn(pfn)) get_page(pfn_to_page(pfn)); } As all ZONE_DEVICE pages are currently reserved, pages pinned via gfn_to_pfn() and friends will often not see a put_page() AFAIKS. Now, my patch does not change that, the result of kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would probably be a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and friends, after you successfully pinned the pages. (not sure if that's the right thing to do but you're the expert) b) To not use kvm_release_pfn_clean() and friends on pages that were definitely pinned. -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH v1 01/10] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
On 05.11.19 02:30, Dan Williams wrote: On Thu, Oct 24, 2019 at 5:10 AM David Hildenbrand wrote: Our onlining/offlining code is unnecessarily complicated. Only memory blocks added during boot can have holes (a range that is not IORESOURCE_SYSTEM_RAM). Hotplugged memory never has holes (e.g., see add_memory_resource()). All boot memory is alread online. s/alread/already/ Thanks. ...also perhaps clarify "already online" by what point in time and why that is relevant. For example a description of the difference between the SetPageReserved() in the bootmem path and the one in the hotplug path. Will add. Therefore, when we stop allowing to offline memory blocks with holes, we implicitly no longer have to deal with onlining memory blocks with holes. Maybe an explicit reference of the code areas that deal with holes would help to back up that assertion. Certainly it would have saved me some time for the review. I can add a reference to the onlining code that will only online pages that don't fall into memory holes. This allows to simplify the code. For example, we no longer have to worry about marking pages that fall into memory holes PG_reserved when onlining memory. We can stop setting pages PG_reserved. ...but not for bootmem, right? Yes, bootmem is not changed. (especially, early allocations and memory holes are marked PG_reserved - basically everything not given to the buddy after boot) Offlining memory blocks added during boot is usually not guranteed to work s/guranteed/guaranteed/ Thanks. either way (unmovable data might have easily ended up on that memory during boot). So stopping to do that should not really hurt (+ people are not even aware of a setup where that used to work Maybe put a "Link: https://lkml.kernel.org/r/$msg_id; to that discussion? and that the existing code still works correctly with memory holes). For the use case of offlining memory to unplug DIMMs, we should see no change. (holes on DIMMs would be weird). However, less memory can be offlined than was theoretically allowed previously, so I don't understand the "we should see no change" comment. I still agree that's a price worth paying to get the code cleanups and if someone screams we can look at adding it back, but the fact that it was already fragile seems decent enough protection. Hotplugged DIMMs (add_memory()) have no holes and will therefore see no change. Please note that hardware errors (PG_hwpoison) are not memory holes and not affected by this change when offlining. Cc: Andrew Morton Cc: Michal Hocko Cc: Oscar Salvador Cc: Pavel Tatashin Cc: Dan Williams Cc: Anshuman Khandual Signed-off-by: David Hildenbrand --- mm/memory_hotplug.c | 26 -- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 561371ead39a..8d81730cf036 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct memory_notify *arg) node_clear_state(node, N_MEMORY); } +static int count_system_ram_pages_cb(unsigned long start_pfn, +unsigned long nr_pages, void *data) +{ + unsigned long *nr_system_ram_pages = data; + + *nr_system_ram_pages += nr_pages; + return 0; +} + static int __ref __offline_pages(unsigned long start_pfn, unsigned long end_pfn) { - unsigned long pfn, nr_pages; + unsigned long pfn, nr_pages = 0; unsigned long offlined_pages = 0; int ret, node, nr_isolate_pageblock; unsigned long flags; @@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn, mem_hotplug_begin(); + /* +* Don't allow to offline memory blocks that contain holes. +* Consecuently, memory blocks with holes can never get onlined s/Consecuently/Consequently/ Thanks. +* (hotplugged memory has no holes and all boot memory is online). +* This allows to simplify the onlining/offlining code quite a lot. +*/ The last sentence of this comment makes sense in the context of this patch, but I don't think it stands by itself in the code base after the fact. The person reading the comment can't see the simplifications because the code is already gone. I'd clarify it to talk about why it is safe to not mess around with PG_Reserved in the hotplug path because of this check. I'll think of something. It's not just the PG_reserved handling but the whole pfn_valid()/walk_system_ram_range() handling that can be simplified. After those clarifications you can add: Reviewed-by: Dan Williams Thanks! -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
On 05.11.19 05:38, Dan Williams wrote: On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand wrote: Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap (and don't have ZONE_DEVICE memory). Rewrite kvm_is_reserved_pfn() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Paolo Bonzini Cc: "Radim Krčmář" Cc: Michal Hocko Cc: Dan Williams Cc: KarimAllah Ahmed Signed-off-by: David Hildenbrand --- virt/kvm/kvm_main.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e9eb666eb6e8..9d18cc67d124 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, bool kvm_is_reserved_pfn(kvm_pfn_t pfn) { - if (pfn_valid(pfn)) - return PageReserved(pfn_to_page(pfn)); + struct page *page = pfn_to_online_page(pfn); + /* +* We treat any pages that are not online (not managed by the buddy) +* as reserved - this includes ZONE_DEVICE pages and pages without +* a memmap (e.g., mapped via /dev/mem). +*/ + if (page) + return PageReserved(page); return true; } So after this all the pfn_valid() usage in kvm_main.c is replaced with pfn_to_online_page()? Looks correct to me. However, I'm worried that kvm is taking reference on ZONE_DEVICE pages through some other path resulting in this: https://lore.kernel.org/linux-nvdimm/20190919154708.ga24...@angband.pl/ I'll see if this patch set modulates or maintains that failure mode. I'd assume that the behavior is unchanged. Ithink we get a reference to these ZONE_DEVICE pages via __get_user_pages_fast() and friends in hva_to_pfn_fast() and friends in virt/kvm/kvm_main.c -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
On 24.10.19 14:09, David Hildenbrand wrote: This is the result of a recent discussion with Michal ([1], [2]). Right now we set all pages PG_reserved when initializing hotplugged memmaps. This includes ZONE_DEVICE memory. In case of system memory, PG_reserved is cleared again when onlining the memory, in case of ZONE_DEVICE memory never. In ancient times, we needed PG_reserved, because there was no way to tell whether the memmap was already properly initialized. We now have SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE memory is already initialized deferred, and there shouldn't be a visible change in that regard. One of the biggest fears were side effects. I went ahead and audited all users of PageReserved(). The details can be found in "mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap". This patch set adapts all relevant users of PageReserved() to keep the existing behavior in respect to ZONE_DEVICE pages. The biggest part part that needs changes is KVM, to keep the existing behavior (that's all I care about in this series). Note that this series is able to rely completely on pfn_to_online_page(). No new is_zone_device_page() calles are introduced (as requested by Dan). We are currently discussing a way to mark also ZONE_DEVICE memmaps as active/initialized - pfn_active() - and lightweight locking to make sure memmaps remain active (e.g., using RCU). We might later be able to convert some suers of pfn_to_online_page() to pfn_active(). Details can be found in [3], however, this represents yet another cleanup/fix we'll perform on top of this cleanup. I only gave it a quick test with DIMMs on x86-64, but didn't test the ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Also, I didn't test the KVM parts (especially with ZONE_DEVICE pages or no memmap at all). Compile-tested on x86-64 and PPC. Jeff Moyer ran some NVDIMM test cases for me (thanks!!!), including xfstests, pmdk, and ndctl. No regressions found. I will run some KVM tests, especially NDIMM passthrough, but will have to setup a test environment first. I would appreciate some review in the meantime. :) -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
On 23.10.19 09:26, David Hildenbrand wrote: On 22.10.19 23:54, Dan Williams wrote: Hi David, Thanks for tackling this! Thanks for having a look :) [...] I am probably a little bit too careful (but I don't want to break things). In most places (besides KVM and vfio that are nuts), the pfn_to_online_page() check could most probably be avoided by a is_zone_device_page() check. However, I usually get suspicious when I see a pfn_valid() check (especially after I learned that people mmap parts of /dev/mem into user space, including memory without memmaps. Also, people could memmap offline memory blocks this way :/). As long as this does not hurt performance, I think we should rather do it the clean way. I'm concerned about using is_zone_device_page() in places that are not known to already have a reference to the page. Here's an audit of current usages, and the ones I think need to cleaned up. The "unsafe" ones do not appear to have any protections against the device page being removed (get_dev_pagemap()). Yes, some of these were added by me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device pages into anonymous memory paths and I'm not up to speed on how it guarantees 'struct page' validity vs device shutdown without using get_dev_pagemap(). smaps_pmd_entry(): unsafe put_devmap_managed_page(): safe, page reference is held is_device_private_page(): safe? gpu driver manages private page lifetime is_pci_p2pdma_page(): safe, page reference is held uncharge_page(): unsafe? HMM add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page() soft_offline_page(): unsafe remove_migration_pte(): unsafe? HMM move_to_new_page(): unsafe? HMM migrate_vma_pages() and helpers: unsafe? HMM try_to_unmap_one(): unsafe? HMM __put_page(): safe release_pages(): safe I'm hoping all the HMM ones can be converted to is_device_private_page() directlly and have that routine grow a nice comment about how it knows it can always safely de-reference its @page argument. For the rest I'd like to propose that we add a facility to determine ZONE_DEVICE by pfn rather than page. The most straightforward why I can think of would be to just add another bitmap to mem_section_usage to indicate if a subsection is ZONE_DEVICE or not. (it's a somewhat unrelated bigger discussion, but we can start discussing it in this thread) I dislike this for three reasons a) It does not protect against any races, really, it does not improve things. b) We do have the exact same problem with pfn_to_online_page(). As long as we don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy. c) We mix in ZONE specific stuff into the core. It should be "just another zone" What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87) 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE 2. Convert SECTION_IS_ACTIVE to a subsection bitmap 3. Introduce pfn_active() that checks against the subsection bitmap 4. Once the memmap was initialized / prepared, set the subsection active (similar to SECTION_IS_ONLINE in the buddy right now) 5. Before the memmap gets invalidated, set the subsection inactive (similar to SECTION_IS_ONLINE in the buddy right now) 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE Dan, I am suspecting that you want a pfn_to_zone() that will not touch the memmap, because it could potentially (altmap) lie on slow memory, right? A modification might make this possible (but I am not yet sure if we want a less generic MM implementation just to fine tune slow memmap access here) 1. Keep SECTION_IS_ONLINE as it is with the same semantics 2. Introduce a subsection bitmap to record active ("initialized memmap") PFNs. E.g., also set it when setting sections online. 3. Introduce pfn_active() that checks against the subsection bitmap 4. Once the memmap was initialized / prepared, set the subsection active (similar to SECTION_IS_ONLINE in the buddy right now) 5. Before the memmap gets invalidated, set the subsection inactive (similar to SECTION_IS_ONLINE in the buddy right now) 5. pfn_to_online_page() = pfn_active() && section == SECTION_IS_ONLINE (or keep it as is, depends on the RCU locking we eventually implement) 6. pfn_to_device_page() = pfn_active() && section != SECTION_IS_ONLINE 7. use pfn_active() whenever we don't care about the zone. Again, not really a friend of that, it hardcodes ZONE_DEVICE vs. !ZONE_DEVICE. When we do a random "pfn_to_page()" (e.g., a pfn walker) we really want to touch the memmap right away either way. So we can also directly read the zone from it. I really do prefer right now a more generic implementation. -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v1 09/10] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap
Everything should be prepared to stop setting pages PG_reserved when initializing the memmap on memory hotplug. Most importantly, we stop marking ZONE_DEVICE pages PG_reserved. a) We made sure that any code that relied on PG_reserved to detect ZONE_DEVICE memory will no longer rely on PG_reserved (especially, by relying on pfn_to_online_page() for now). Details can be found below. b) We made sure that memory blocks with holes cannot be offlined and therefore also not onlined. We have quite some code that relies on memory holes being marked PG_reserved. This is now not an issue anymore. generic_online_page() still calls __free_pages_core(), which performs __ClearPageReserved(p). AFAIKS, this should not hurt. It is worth nothing that the users of online_page_callback_t might see a change. E.g., until now, pages not freed to the buddy by the HyperV balloonm were set PG_reserved until freed via generic_online_page(). Now, they would look like ordinarily allocated pages (refcount == 1). This callback is used by the XEN balloon and the HyperV balloon. To not introduce any silent errors, keep marking the pages PG_reserved. We can most probably stop doing that, but have to double check if there are issues (e.g., offlining code aborts right away in has_unmovable_pages() when it runs into a PageReserved(page)) Update the documentation at various places in the MM core. There are three PageReserved() users that might be affected by this change. - drivers/staging/gasket/gasket_page_table.c:gasket_release_page() -> We might (unlikely) set SetPageDirty() on a ZONE_DEVICE page -> I assume "we don't care" - drivers/staging/kpc2000/kpc_dma/fileops.c:transfer_complete_cb() -> We might (unlikely) set SetPageDirty() on a ZONE_DEVICE page -> I assume "we don't care" - mm/usercopy.c: check_page_span() -> According to Dan, non-HMM ZONE_DEVICE usage excluded this code since commit 52f476a323f9 ("libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead") -> It is unclear whether we rally cared about ZONE_DEVICE here (HMM) or simply about "PG_reserved". The worst thing that could happen is a false negative with CONFIG_HARDENED_USERCOPY we should be able to identify easily. -> There is a discussion to rip out that code completely -> I assume "not relevant" / "we don't care" I audited the other PageReserved() users. They don't affect ZONE_DEVICE: - mm/page_owner.c:pagetypeinfo_showmixedcount_print() -> Never called for ZONE_DEVICE, (+ pfn_to_online_page(pfn)) - mm/page_owner.c:init_pages_in_zone() -> Never called for ZONE_DEVICE (!populated_zone(zone)) - mm/page_ext.c:free_page_ext() -> Only a BUG_ON(PageReserved(page)), not relevant - mm/page_ext.c:has_unmovable_pages() -> Not releveant for ZONE_DEVICE - mm/page_ext.c:pfn_range_valid_contig() -> pfn_to_online_page() already guards us - mm/mempolicy.c:queue_pages_pte_range() -> vm_normal_page() checks against pte_devmap() - mm/memory-failure.c:hwpoison_user_mappings() -> Not reached via memory_failure() due to pfn_to_online_page() -> Also not reached indirectly via memory_failure_hugetlb() - mm/hugetlb.c:gather_bootmem_prealloc() -> Only a WARN_ON(PageReserved(page)), not relevant - kernel/power/snapshot.c:saveable_highmem_page() -> pfn_to_online_page() already guards us - kernel/power/snapshot.c:saveable_page() -> pfn_to_online_page() already guards us - fs/proc/task_mmu.c:can_gather_numa_stats() -> vm_normal_page() checks against pte_devmap() - fs/proc/task_mmu.c:can_gather_numa_stats_pmd -> vm_normal_page_pmd() checks against pte_devmap() - fs/proc/page.c:stable_page_flags() -> The reserved bit is simply copied, irrelevant - drivers/firmware/memmap.c:release_firmware_map_entry() -> really only a check to detect bootmem. Not relevant for ZONE_DEVICE - arch/ia64/kernel/mca_drv.c - arch/mips/mm/init.c - arch/mips/mm/ioremap.c - arch/nios2/mm/ioremap.c - arch/parisc/mm/ioremap.c - arch/sparc/mm/tlb.c - arch/xtensa/mm/cache.c -> No ZONE_DEVICE support - arch/powerpc/mm/init_64.c:vmemmap_free() -> Special-cases memmap on altmap -> Only a check for bootmem - arch/x86/kernel/alternative.c:__text_poke() -> Only a WARN_ON(!PageReserved(pages[0])) to verify it is bootmem - arch/x86/mm/init_64.c -> Only a check for bootmem Cc: "K. Y. Srinivasan" Cc: Haiyang Zhang Cc: Stephen Hemminger Cc: Sasha Levin Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: Andrew Morton Cc: Alexander Duyck Cc: Pavel Tatashin Cc: Vlastimil Babka Cc: Johannes Weiner Cc: Anthony Yznaga Cc: Michal Hocko Cc: Oscar Salvador Cc: Dan Williams Cc: Mel Gorman Cc: Mike Rapoport Cc: Anshuman Khandual Cc: Matt Sickler Cc: Kees Cook Suggested-by: Michal Hocko Signed-off
[PATCH v1 08/10] x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. Rewrite __ioremap_check_ram() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Signed-off-by: David Hildenbrand --- arch/x86/mm/ioremap.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c index a39dcdb5ae34..db6913b48edf 100644 --- a/arch/x86/mm/ioremap.c +++ b/arch/x86/mm/ioremap.c @@ -77,10 +77,17 @@ static unsigned int __ioremap_check_ram(struct resource *res) start_pfn = (res->start + PAGE_SIZE - 1) >> PAGE_SHIFT; stop_pfn = (res->end + 1) >> PAGE_SHIFT; if (stop_pfn > start_pfn) { - for (i = 0; i < (stop_pfn - start_pfn); ++i) - if (pfn_valid(start_pfn + i) && - !PageReserved(pfn_to_page(start_pfn + i))) + for (i = 0; i < (stop_pfn - start_pfn); ++i) { + struct page *page; +/* + * We treat any pages that are not online (not managed + * by the buddy) as not being RAM. This includes + * ZONE_DEVICE pages. + */ + page = pfn_to_online_page(start_pfn + i); + if (page && !PageReserved(page)) return IORES_MAP_SYSTEM_RAM; + } } return 0; -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v1 07/10] powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. Rewrite maybe_pte_to_page() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Christophe Leroy Cc: "Aneesh Kumar K.V" Cc: Allison Randal Cc: Nicholas Piggin Cc: Thomas Gleixner Signed-off-by: David Hildenbrand --- arch/powerpc/mm/pgtable.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index e3759b69f81b..613c98fa7dc0 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -55,10 +55,12 @@ static struct page *maybe_pte_to_page(pte_t pte) unsigned long pfn = pte_pfn(pte); struct page *page; - if (unlikely(!pfn_valid(pfn))) - return NULL; - page = pfn_to_page(pfn); - if (PageReserved(page)) + /* +* We reject any pages that are not online (not managed by the buddy). +* This includes ZONE_DEVICE pages. +*/ + page = pfn_to_online_page(pfn); + if (unlikely(!page || PageReserved(page))) return NULL; return page; } -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v1 10/10] mm/usercopy.c: Update comment in check_page_span() regarding ZONE_DEVICE
ZONE_DEVICE (a.k.a. device memory) is no longer marked PG_reserved. Update the comment. While at it, make it match what the code is acutally doing (reject vs. accept). Cc: Kees Cook Cc: Andrew Morton Cc: "Isaac J. Manjarres" Cc: "Matthew Wilcox (Oracle)" Cc: Qian Cai Cc: Thomas Gleixner Signed-off-by: David Hildenbrand --- mm/usercopy.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/usercopy.c b/mm/usercopy.c index 660717a1ea5c..80f254024c97 100644 --- a/mm/usercopy.c +++ b/mm/usercopy.c @@ -199,9 +199,9 @@ static inline void check_page_span(const void *ptr, unsigned long n, return; /* -* Reject if range is entirely either Reserved (i.e. special or -* device memory), or CMA. Otherwise, reject since the object spans -* several independently allocated pages. +* Accept if the range is entirely either Reserved ("special") or +* CMA. Otherwise, reject since the object spans several independently +* allocated pages. */ is_reserved = PageReserved(page); is_cma = is_migrate_cma_page(page); -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v1 06/10] powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. Rewrite hash_page_do_lazy_icache() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: "Aneesh Kumar K.V" Cc: Christophe Leroy Cc: Nicholas Piggin Cc: Andrew Morton Cc: Mike Rapoport Cc: YueHaibing Signed-off-by: David Hildenbrand --- arch/powerpc/mm/book3s64/hash_utils.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c index 6c123760164e..a1566039e747 100644 --- a/arch/powerpc/mm/book3s64/hash_utils.c +++ b/arch/powerpc/mm/book3s64/hash_utils.c @@ -1084,13 +1084,15 @@ void hash__early_init_mmu_secondary(void) */ unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap) { - struct page *page; + struct page *page = pfn_to_online_page(pte_pfn(pte)); - if (!pfn_valid(pte_pfn(pte))) + /* +* We ignore any pages that are not online (not managed by the buddy). +* This includes ZONE_DEVICE pages. +*/ + if (!page) return pp; - page = pte_page(pte); - /* page is dirty */ if (!test_bit(PG_arch_1, >flags) && !PageReserved(page)) { if (trap == 0x400) { -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap (and don't have ZONE_DEVICE memory). Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Alex Williamson Cc: Cornelia Huck Signed-off-by: David Hildenbrand --- drivers/vfio/vfio_iommu_type1.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 2ada8e6cdb88..f8ce8c408ba8 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) */ static bool is_invalid_reserved_pfn(unsigned long pfn) { - if (pfn_valid(pfn)) - return PageReserved(pfn_to_page(pfn)); + struct page *page = pfn_to_online_page(pfn); + /* +* We treat any pages that are not online (not managed by the buddy) +* as reserved - this includes ZONE_DEVICE pages and pages without +* a memmap (e.g., mapped via /dev/mem). +*/ + if (page) + return PageReserved(page); return true; } -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v1 05/10] powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap (and don't have ZONE_DEVICE memory). Rewrite kvmppc_book3s_instantiate_page() similar to kvm_is_reserved_pfn() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Paul Mackerras Cc: Benjamin Herrenschmidt Cc: Michael Ellerman Signed-off-by: David Hildenbrand --- arch/powerpc/kvm/book3s_64_mmu_radix.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index 2d415c36a61d..05397c0561fc 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -801,12 +801,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu, writing, upgrade_p); if (is_error_noslot_pfn(pfn)) return -EFAULT; - page = NULL; - if (pfn_valid(pfn)) { - page = pfn_to_page(pfn); - if (PageReserved(page)) - page = NULL; - } + /* +* We treat any pages that are not online (not managed by the +* buddy) as reserved - this includes ZONE_DEVICE pages and +* pages without a memmap (e.g., mapped via /dev/mem). +*/ + page = pfn_to_online_page(pfn); + if (page && PageReserved(page)) + page = NULL; } /* -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap (and don't have ZONE_DEVICE memory). Rewrite kvm_is_reserved_pfn() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Paolo Bonzini Cc: "Radim Krčmář" Cc: Michal Hocko Cc: Dan Williams Cc: KarimAllah Ahmed Signed-off-by: David Hildenbrand --- virt/kvm/kvm_main.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e9eb666eb6e8..9d18cc67d124 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, bool kvm_is_reserved_pfn(kvm_pfn_t pfn) { - if (pfn_valid(pfn)) - return PageReserved(pfn_to_page(pfn)); + struct page *page = pfn_to_online_page(pfn); + /* +* We treat any pages that are not online (not managed by the buddy) +* as reserved - this includes ZONE_DEVICE pages and pages without +* a memmap (e.g., mapped via /dev/mem). +*/ + if (page) + return PageReserved(page); return true; } -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
This is the result of a recent discussion with Michal ([1], [2]). Right now we set all pages PG_reserved when initializing hotplugged memmaps. This includes ZONE_DEVICE memory. In case of system memory, PG_reserved is cleared again when onlining the memory, in case of ZONE_DEVICE memory never. In ancient times, we needed PG_reserved, because there was no way to tell whether the memmap was already properly initialized. We now have SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE memory is already initialized deferred, and there shouldn't be a visible change in that regard. One of the biggest fears were side effects. I went ahead and audited all users of PageReserved(). The details can be found in "mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap". This patch set adapts all relevant users of PageReserved() to keep the existing behavior in respect to ZONE_DEVICE pages. The biggest part part that needs changes is KVM, to keep the existing behavior (that's all I care about in this series). Note that this series is able to rely completely on pfn_to_online_page(). No new is_zone_device_page() calles are introduced (as requested by Dan). We are currently discussing a way to mark also ZONE_DEVICE memmaps as active/initialized - pfn_active() - and lightweight locking to make sure memmaps remain active (e.g., using RCU). We might later be able to convert some suers of pfn_to_online_page() to pfn_active(). Details can be found in [3], however, this represents yet another cleanup/fix we'll perform on top of this cleanup. I only gave it a quick test with DIMMs on x86-64, but didn't test the ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Also, I didn't test the KVM parts (especially with ZONE_DEVICE pages or no memmap at all). Compile-tested on x86-64 and PPC. Based on next/master. The current version (kept updated) can be found at: https://github.com/davidhildenbrand/linux.git online_reserved_cleanup RFC -> v1: - Dropped "staging/gasket: Prepare gasket_release_page() for PG_reserved changes" - Dropped "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes" - Converted "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes" to "mm/usercopy.c: Update comment in check_page_span() regarding ZONE_DEVICE" - No new users of is_zone_device_page() are introduced. - Rephrased comments and patch descriptions. [1] https://lkml.org/lkml/2019/10/21/736 [2] https://lkml.org/lkml/2019/10/21/1034 [3] https://www.spinics.net/lists/linux-mm/msg194112.html Cc: Michal Hocko Cc: Dan Williams Cc: kvm-...@vger.kernel.org Cc: linuxppc-...@lists.ozlabs.org Cc: k...@vger.kernel.org Cc: linux-hyp...@vger.kernel.org Cc: de...@driverdev.osuosl.org Cc: xen-de...@lists.xenproject.org Cc: x...@kernel.org Cc: Alexander Duyck David Hildenbrand (10): mm/memory_hotplug: Don't allow to online/offline memory blocks with holes KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for PG_reserved changes powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved changes powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap mm/usercopy.c: Update comment in check_page_span() regarding ZONE_DEVICE arch/powerpc/kvm/book3s_64_mmu_radix.c | 14 + arch/powerpc/mm/book3s64/hash_utils.c | 10 +++--- arch/powerpc/mm/pgtable.c | 10 +++--- arch/x86/kvm/mmu.c | 29 ++--- arch/x86/mm/ioremap.c | 13 ++-- drivers/hv/hv_balloon.c| 6 drivers/vfio/vfio_iommu_type1.c| 10 -- drivers/xen/balloon.c | 7 + include/linux/page-flags.h | 8 + mm/memory_hotplug.c| 43 +++--- mm/page_alloc.c| 11 --- mm/usercopy.c | 6 ++-- virt/kvm/kvm_main.c| 10 -- 13 files changed, 111 insertions(+), 66 deletions(-) -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v1 02/10] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap (and don't have ZONE_DEVICE memory). Rewrite kvm_is_mmio_pfn() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Paolo Bonzini Cc: "Radim Krčmář" Cc: Sean Christopherson Cc: Vitaly Kuznetsov Cc: Wanpeng Li Cc: Jim Mattson Cc: Joerg Roedel Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: KarimAllah Ahmed Cc: Michal Hocko Cc: Dan Williams Signed-off-by: David Hildenbrand --- arch/x86/kvm/mmu.c | 29 + 1 file changed, 17 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 24c23c66b226..f03089a336de 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2962,20 +2962,25 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, static bool kvm_is_mmio_pfn(kvm_pfn_t pfn) { + struct page *page = pfn_to_online_page(pfn); + + /* +* ZONE_DEVICE pages are never online. Online pages that are reserved +* either indicate the zero page or MMIO pages. +*/ + if (page) + return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)); + + /* +* Anything with a valid (but not online) memmap could be ZONE_DEVICE. +* Treat only UC/UC-/WC pages as MMIO. +*/ if (pfn_valid(pfn)) - return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) && - /* -* Some reserved pages, such as those from NVDIMM -* DAX devices, are not for MMIO, and can be mapped -* with cached memory type for better performance. -* However, the above check misconceives those pages -* as MMIO, and results in KVM mapping them with UC -* memory type, which would hurt the performance. -* Therefore, we check the host memory type in addition -* and only treat UC/UC-/WC pages as MMIO. -*/ - (!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn)); + return !pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn); + /* +* Any RAM that has no memmap (e.g., mapped via /dev/mem) is not MMIO. +*/ return !e820__mapped_raw_any(pfn_to_hpa(pfn), pfn_to_hpa(pfn + 1) - 1, E820_TYPE_RAM); -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v1 01/10] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
Our onlining/offlining code is unnecessarily complicated. Only memory blocks added during boot can have holes (a range that is not IORESOURCE_SYSTEM_RAM). Hotplugged memory never has holes (e.g., see add_memory_resource()). All boot memory is alread online. Therefore, when we stop allowing to offline memory blocks with holes, we implicitly no longer have to deal with onlining memory blocks with holes. This allows to simplify the code. For example, we no longer have to worry about marking pages that fall into memory holes PG_reserved when onlining memory. We can stop setting pages PG_reserved. Offlining memory blocks added during boot is usually not guranteed to work either way (unmovable data might have easily ended up on that memory during boot). So stopping to do that should not really hurt (+ people are not even aware of a setup where that used to work and that the existing code still works correctly with memory holes). For the use case of offlining memory to unplug DIMMs, we should see no change. (holes on DIMMs would be weird). Please note that hardware errors (PG_hwpoison) are not memory holes and not affected by this change when offlining. Cc: Andrew Morton Cc: Michal Hocko Cc: Oscar Salvador Cc: Pavel Tatashin Cc: Dan Williams Cc: Anshuman Khandual Signed-off-by: David Hildenbrand --- mm/memory_hotplug.c | 26 -- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 561371ead39a..8d81730cf036 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct memory_notify *arg) node_clear_state(node, N_MEMORY); } +static int count_system_ram_pages_cb(unsigned long start_pfn, +unsigned long nr_pages, void *data) +{ + unsigned long *nr_system_ram_pages = data; + + *nr_system_ram_pages += nr_pages; + return 0; +} + static int __ref __offline_pages(unsigned long start_pfn, unsigned long end_pfn) { - unsigned long pfn, nr_pages; + unsigned long pfn, nr_pages = 0; unsigned long offlined_pages = 0; int ret, node, nr_isolate_pageblock; unsigned long flags; @@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn, mem_hotplug_begin(); + /* +* Don't allow to offline memory blocks that contain holes. +* Consecuently, memory blocks with holes can never get onlined +* (hotplugged memory has no holes and all boot memory is online). +* This allows to simplify the onlining/offlining code quite a lot. +*/ + walk_system_ram_range(start_pfn, end_pfn - start_pfn, _pages, + count_system_ram_pages_cb); + if (nr_pages != end_pfn - start_pfn) { + ret = -EINVAL; + reason = "memory holes"; + goto failed_removal; + } + /* This makes hotplug much easier...and readable. we assume this for now. .*/ if (!test_pages_in_a_zone(start_pfn, end_pfn, _start, @@ -1472,7 +1495,6 @@ static int __ref __offline_pages(unsigned long start_pfn, zone = page_zone(pfn_to_page(valid_start)); node = zone_to_nid(zone); - nr_pages = end_pfn - start_pfn; /* set above range as isolated */ ret = start_isolate_page_range(start_pfn, end_pfn, -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
On 24.10.19 05:53, Anshuman Khandual wrote: On 10/22/2019 10:42 PM, David Hildenbrand wrote: Our onlining/offlining code is unnecessarily complicated. Only memory blocks added during boot can have holes. Hotplugged memory never has holes. That memory is already online. Why hot plugged memory at runtime cannot have holes (e.g a semi bad DIMM). Important: HWPoison != memory hole A memory hole is memory that is not "IORESOURCE_SYSRAM". These pages are currently marked PG_reserved. Such holes are sometimes used for mapping something into kernel space. Some archs use the PG_reserved to detect the memory hole ("not ram") and ignore the memmap. Poisoned pages are marked PG_hwpoison. Currently, do we just abort adding that memory block if there are holes ? There is no interface to do that. E.g., have a look at add_memory() add_memory_resource(). You can only pass one memory resource (that is all IORESOURCE_SYSRAM | IORESOURCE_BUSY) Hotplugging memory with holes is not supported (nor can I imagine a use case for that). When we stop allowing to offline memory blocks with holes, we implicitly stop to online memory blocks with holes. Reducing hotplug support for memory blocks with holes just to simplify the code. Is it worth ? Me and Michal are not aware of a users, not even aware of a use case. Keeping code around that nobody really needs that limits cleanups, no thanks. Similar to us not supporting to offline memory blocks that span multiple nodes/zones. E.g., have a look at the isolation code. It is full of code that jumps over memory holes (start_isolate_page_range() -> __first_valid_page()). That made sense for our complicated memory offlining code, but it is actually harmful when dealing with alloc_contig_range(). Allocation never wants to jump over memory holes. After this patch, we can just fail hard on any memory hole we detect, instead of ignoring it (or special-casing it). This allows to simplify the code. For example, we no longer have to worry about marking pages that fall into memory holes PG_reserved when onlining memory. We can stop setting pages PG_reserved. Could not there be any other way of tracking these holes if not the page reserved bit. In the memory section itself and corresponding struct pages just remained poisoned ? Just wondering, might be all wrong here. Of course there could be ways (e.g., using PG_offline eventually), but it boils down to us having to deal with it in onlining/offlining code. And that is some handling nobody really seems to need. Offlining memory blocks added during boot is usually not guranteed to work either way. So stopping to do that (if anybody really used and tested That guarantee does not exist right now because how boot memory could have been used after boot not from a limitation of the memory hot remove itself. Yep. However, Michal and I are not even aware of a setup that would made this work and guarantee that the existing code actually still is able to deal with holes. Are you? this over the years) should not really hurt. For the use case of offlining memory to unplug DIMMs, we should see no change. (holes on DIMMs would be weird) Holes on DIMM could be due to HW errors affecting only parts of it. By not Again, HW errors != holes. We have PG_hwpoison for that. allowing such DIMM's hot add and remove, we are definitely reducing the scope of overall hotplug functionality. Is code simplification in itself is worth this reduction in functionality ? What you describe is not affected. Thanks! -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
On 23.10.19 21:39, Dan Williams wrote: > On Wed, Oct 23, 2019 at 10:28 AM David Hildenbrand wrote: >> >>>> I dislike this for three reasons >>>> >>>> a) It does not protect against any races, really, it does not improve >>>> things. >>>> b) We do have the exact same problem with pfn_to_online_page(). As long as >>>> we >>>>don't hold the memory hotplug lock, memory can get offlined and remove >>>> any time. Racy. >>> >>> True, we need to solve that problem too. That seems to want something >>> lighter weight than the hotplug lock that can be held over pfn lookups >>> + use rather than requiring a page lookup in paths where it's not >>> clear that a page reference would prevent unplug. >>> >>>> c) We mix in ZONE specific stuff into the core. It should be "just another >>>> zone" >>> >>> Not sure I grok this when the RFC is sprinkling zone-specific >>> is_zone_device_page() throughout the core? >> >> Most users should not care about the zone. pfn_active() would be enough >> in most situations, especially most PFN walkers - "this memmap is valid >> and e.g., contains a valid zone ...". > > Oh, I see, you're saying convert most users to pfn_active() (and some > TBD rcu locking), but only pfn_to_online_page() users would need the > zone lookup? I can get on board with that. I guess my answer to that is simple: If we only care about "is this memmap safe to touch", use pfn_active() (well, with pfn_valid_within() similar as done in pfn_to_online_page() due to memory holes, but these are details - e.g., pfn_active() can check against pfn_valid_within() right away internally). (+locking TBD to make sure it remains active) However, if we want to special case in addition on zones (!ZONE_DEVICE (a.k.a., onlined via memory blocks, managed by the buddy), ZONE_DEVICE, whatever might come in the future, ...), also access the zone stored in the memmap. E.g., by using pfn_to_online_page(). > >> >>> >>>> >>>> What I propose instead (already discussed in >>>> https://lkml.org/lkml/2019/10/10/87) >>> >>> Sorry I missed this earlier... >>> >>>> >>>> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE >>>> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap >>>> 3. Introduce pfn_active() that checks against the subsection bitmap >>>> 4. Once the memmap was initialized / prepared, set the subsection active >>>>(similar to SECTION_IS_ONLINE in the buddy right now) >>>> 5. Before the memmap gets invalidated, set the subsection inactive >>>>(similar to SECTION_IS_ONLINE in the buddy right now) >>>> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE >>>> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE >>> >>> This does not seem to reduce any complexity because it still requires >>> a pfn to zone lookup at the end of the process. >>> >>> I.e. converting pfn_to_online_page() to use a new pfn_active() >>> subsection map plus looking up the zone from pfn_to_page() is more >>> steps than just doing a direct pfn to zone lookup. What am I missing? >> >> That a real "pfn to zone" lookup without going via the struct page will >> require to have more than just a single bitmap. IMHO, keeping the >> information at a single place (memmap) is the clean thing to do (not >> replicating it somewhere else). Going via the memmap might not be as >> fast as a direct lookup, but do we actually care? We are already looking >> at "random PFNs we are not sure if there is a valid memmap". > > True, we only care about the validity of the check, and as you pointed > out moving the check to the pfn level does not solve the validity > race. It needs a lock. Let's call pfn_active() "a pfn that is active in the system and has an initialized memmap, which contains sane values" (valid memmap sounds like pfn_valid(), which is actually "there is a memmap which might contain garbage"). Yes we need some sort of lightweight locking as discussed. [...] >>>> However, I think we also have to clarify if we need the change in 3 at all. >>>> It comes from >>>> >>>> commit f5509cc18daa7f82bcc553be70df2117c8eedc16 >>>> Author: Kees Cook >>>> Date: Tue Jun 7 11:05:33 2016 -0700 >>>> >>>> mm: Hardened usercopy >>>> >>>> This is the start of porting P
Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
>> I dislike this for three reasons >> >> a) It does not protect against any races, really, it does not improve things. >> b) We do have the exact same problem with pfn_to_online_page(). As long as we >>don't hold the memory hotplug lock, memory can get offlined and remove >> any time. Racy. > > True, we need to solve that problem too. That seems to want something > lighter weight than the hotplug lock that can be held over pfn lookups > + use rather than requiring a page lookup in paths where it's not > clear that a page reference would prevent unplug. > >> c) We mix in ZONE specific stuff into the core. It should be "just another >> zone" > > Not sure I grok this when the RFC is sprinkling zone-specific > is_zone_device_page() throughout the core? Most users should not care about the zone. pfn_active() would be enough in most situations, especially most PFN walkers - "this memmap is valid and e.g., contains a valid zone ...". > >> >> What I propose instead (already discussed in >> https://lkml.org/lkml/2019/10/10/87) > > Sorry I missed this earlier... > >> >> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE >> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap >> 3. Introduce pfn_active() that checks against the subsection bitmap >> 4. Once the memmap was initialized / prepared, set the subsection active >>(similar to SECTION_IS_ONLINE in the buddy right now) >> 5. Before the memmap gets invalidated, set the subsection inactive >>(similar to SECTION_IS_ONLINE in the buddy right now) >> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE >> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE > > This does not seem to reduce any complexity because it still requires > a pfn to zone lookup at the end of the process. > > I.e. converting pfn_to_online_page() to use a new pfn_active() > subsection map plus looking up the zone from pfn_to_page() is more > steps than just doing a direct pfn to zone lookup. What am I missing? That a real "pfn to zone" lookup without going via the struct page will require to have more than just a single bitmap. IMHO, keeping the information at a single place (memmap) is the clean thing to do (not replicating it somewhere else). Going via the memmap might not be as fast as a direct lookup, but do we actually care? We are already looking at "random PFNs we are not sure if there is a valid memmap". >> >> Especially, driver-reserved device memory will not get set active in >> the subsection bitmap. >> >> Now to the race. Taking the memory hotplug lock at random places is ugly. I >> do >> wonder if we can use RCU: > > Ah, yes, exactly what I was thinking above. > >> >> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page(): >> >> /* the memmap is guaranteed to remain active under RCU */ >> rcu_read_lock(); >> if (pfn_active(random_pfn)) { >> page = pfn_to_page(random_pfn); >> ... use the page, stays valid >> } >> rcu_unread_lock(); >> >> Memory offlining/memremap code: >> >> set_subsections_inactive(pfn, nr_pages); /* clears the bit >> atomically */ >> synchronize_rcu(); >> /* all users saw the bitmap update, we can invalide the memmap */ >> remove_pfn_range_from_zone(zone, pfn, nr_pages); > > Looks good to me. > >> >>> I only gave it a quick test with DIMMs on x86-64, but didn't test the ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested on x86-64 and PPC. >>> >>> I'll give it a spin, but I don't think the kernel wants to grow more >>> is_zone_device_page() users. >> >> Let's recap. In this RFC, I introduce a total of 4 (!) users only. >> The other parts can rely on pfn_to_online_page() only. >> >> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes" >> - Basically never used with ZONE_DEVICE. >> - We hold a reference! >> - All it protects is a SetPageDirty(page); >> >> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes" >> - Same as 1. >> >> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes" >> - We come via virt_to_head_page() / virt_to_head_page(), not sure about >> references (I assume this should be fine as we don't come via random >> PFNs) >> - We check that we don't mix Reserved (including device memory) and CMA >> pages when crossing compound pages. >> >> I think we can drop 1. and 2., resulting in a total of 2 new users in >> the same context. I think that is totally tolerable to finally clean >> this up. > > ...but more is_zone_device_page() doesn't "finally clean this up". > Like we discussed above it's the missing locking that's the real > cleanup, the pfn_to_online_page() internals are secondary. It's a different cleanup IMHO. We can't do everything in one shot. But maybe I can drop the is_zone_device_page() parts from this patch and completely rely on pfn_to_online_page(). Yes, that
Re: [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
On 23.10.19 18:25, Kees Cook wrote: > On Wed, Oct 23, 2019 at 10:20:14AM +0200, David Hildenbrand wrote: >> On 22.10.19 19:12, David Hildenbrand wrote: >>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to >>> change that. >>> >>> Let's make sure that the logic in the function won't change. Once we no >>> longer set these pages to reserved, we can rework this function to >>> perform separate checks for ZONE_DEVICE (split from PG_reserved checks). >>> >>> Cc: Kees Cook >>> Cc: Andrew Morton >>> Cc: Kate Stewart >>> Cc: Allison Randal >>> Cc: "Isaac J. Manjarres" >>> Cc: Qian Cai >>> Cc: Thomas Gleixner >>> Signed-off-by: David Hildenbrand >>> --- >>> mm/usercopy.c | 5 +++-- >>> 1 file changed, 3 insertions(+), 2 deletions(-) >>> >>> diff --git a/mm/usercopy.c b/mm/usercopy.c >>> index 660717a1ea5c..a3ac4be35cde 100644 >>> --- a/mm/usercopy.c >>> +++ b/mm/usercopy.c >>> @@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, >>> unsigned long n, >>> * device memory), or CMA. Otherwise, reject since the object spans >>> * several independently allocated pages. >>> */ >>> - is_reserved = PageReserved(page); >>> + is_reserved = PageReserved(page) || is_zone_device_page(page); >>> is_cma = is_migrate_cma_page(page); >>> if (!is_reserved && !is_cma) >>> usercopy_abort("spans multiple pages", NULL, to_user, 0, n); >>> for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) { >>> page = virt_to_head_page(ptr); >>> - if (is_reserved && !PageReserved(page)) >>> + if (is_reserved && !(PageReserved(page) || >>> +is_zone_device_page(page))) >>> usercopy_abort("spans Reserved and non-Reserved pages", >>>NULL, to_user, 0, n); >>> if (is_cma && !is_migrate_cma_page(page)) >>> >> >> @Kees, would it be okay to stop checking against ZONE_DEVICE pages here or >> is there a good rationale behind this? >> >> (I would turn this patch into a simple update of the comment if we agree >> that we don't care) > > There has been work to actually remove the page span checks entirely, > but there wasn't consensus on what the right way forward was. I continue > to leaning toward just dropping it entirely, but Matthew Wilcox has some > alternative ideas that could use some further thought/testing. Thanks for your reply! So, the worst thing that could happen right now, when dropping this patch, is that we would reject some ranges when hardening is on, correct? (sounds like that can easily be found by testing if it is actually relevant) Do you remember if there were real ZONE_DEVICE usecases that required this filter to be in place for PG_reserved pages? -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
On 22.10.19 19:12, David Hildenbrand wrote: Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. Let's make sure that the logic in the function won't change. Once we no longer set these pages to reserved, we can rework this function to perform separate checks for ZONE_DEVICE (split from PG_reserved checks). Cc: Kees Cook Cc: Andrew Morton Cc: Kate Stewart Cc: Allison Randal Cc: "Isaac J. Manjarres" Cc: Qian Cai Cc: Thomas Gleixner Signed-off-by: David Hildenbrand --- mm/usercopy.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/usercopy.c b/mm/usercopy.c index 660717a1ea5c..a3ac4be35cde 100644 --- a/mm/usercopy.c +++ b/mm/usercopy.c @@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, unsigned long n, * device memory), or CMA. Otherwise, reject since the object spans * several independently allocated pages. */ - is_reserved = PageReserved(page); + is_reserved = PageReserved(page) || is_zone_device_page(page); is_cma = is_migrate_cma_page(page); if (!is_reserved && !is_cma) usercopy_abort("spans multiple pages", NULL, to_user, 0, n); for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) { page = virt_to_head_page(ptr); - if (is_reserved && !PageReserved(page)) + if (is_reserved && !(PageReserved(page) || +is_zone_device_page(page))) usercopy_abort("spans Reserved and non-Reserved pages", NULL, to_user, 0, n); if (is_cma && !is_migrate_cma_page(page)) @Kees, would it be okay to stop checking against ZONE_DEVICE pages here or is there a good rationale behind this? (I would turn this patch into a simple update of the comment if we agree that we don't care) -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFC v1 06/12] staging/gasket: Prepare gasket_release_page() for PG_reserved changes
On 22.10.19 19:12, David Hildenbrand wrote: Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. The pages are obtained via get_user_pages_fast(). I assume, these could be ZONE_DEVICE pages. Let's just exclude them as well explicitly. Cc: Rob Springer Cc: Todd Poynor Cc: Ben Chan Cc: Greg Kroah-Hartman Signed-off-by: David Hildenbrand --- drivers/staging/gasket/gasket_page_table.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/staging/gasket/gasket_page_table.c b/drivers/staging/gasket/gasket_page_table.c index f6d715787da8..d43fed58bf65 100644 --- a/drivers/staging/gasket/gasket_page_table.c +++ b/drivers/staging/gasket/gasket_page_table.c @@ -447,7 +447,7 @@ static bool gasket_release_page(struct page *page) if (!page) return false; - if (!PageReserved(page)) + if (!PageReserved(page) && !is_zone_device_page(page)) SetPageDirty(page); put_page(page); @Dan, is SetPageDirty() on ZONE_DEVICE pages bad or do we simply not care? I think that ending up with ZONE_DEVICE pages here is very unlikely. I'd like to drop this (and the next) patch and document why it is okay to do so. -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
On 22.10.19 23:54, Dan Williams wrote: > Hi David, > > Thanks for tackling this! Thanks for having a look :) [...] >> I am probably a little bit too careful (but I don't want to break things). >> In most places (besides KVM and vfio that are nuts), the >> pfn_to_online_page() check could most probably be avoided by a >> is_zone_device_page() check. However, I usually get suspicious when I see >> a pfn_valid() check (especially after I learned that people mmap parts of >> /dev/mem into user space, including memory without memmaps. Also, people >> could memmap offline memory blocks this way :/). As long as this does not >> hurt performance, I think we should rather do it the clean way. > > I'm concerned about using is_zone_device_page() in places that are not > known to already have a reference to the page. Here's an audit of > current usages, and the ones I think need to cleaned up. The "unsafe" > ones do not appear to have any protections against the device page > being removed (get_dev_pagemap()). Yes, some of these were added by > me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device > pages into anonymous memory paths and I'm not up to speed on how it > guarantees 'struct page' validity vs device shutdown without using > get_dev_pagemap(). > > smaps_pmd_entry(): unsafe > > put_devmap_managed_page(): safe, page reference is held > > is_device_private_page(): safe? gpu driver manages private page lifetime > > is_pci_p2pdma_page(): safe, page reference is held > > uncharge_page(): unsafe? HMM > > add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page() > > soft_offline_page(): unsafe > > remove_migration_pte(): unsafe? HMM > > move_to_new_page(): unsafe? HMM > > migrate_vma_pages() and helpers: unsafe? HMM > > try_to_unmap_one(): unsafe? HMM > > __put_page(): safe > > release_pages(): safe > > I'm hoping all the HMM ones can be converted to > is_device_private_page() directlly and have that routine grow a nice > comment about how it knows it can always safely de-reference its @page > argument. > > For the rest I'd like to propose that we add a facility to determine > ZONE_DEVICE by pfn rather than page. The most straightforward why I > can think of would be to just add another bitmap to mem_section_usage > to indicate if a subsection is ZONE_DEVICE or not. (it's a somewhat unrelated bigger discussion, but we can start discussing it in this thread) I dislike this for three reasons a) It does not protect against any races, really, it does not improve things. b) We do have the exact same problem with pfn_to_online_page(). As long as we don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy. c) We mix in ZONE specific stuff into the core. It should be "just another zone" What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87) 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE 2. Convert SECTION_IS_ACTIVE to a subsection bitmap 3. Introduce pfn_active() that checks against the subsection bitmap 4. Once the memmap was initialized / prepared, set the subsection active (similar to SECTION_IS_ONLINE in the buddy right now) 5. Before the memmap gets invalidated, set the subsection inactive (similar to SECTION_IS_ONLINE in the buddy right now) 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE Especially, driver-reserved device memory will not get set active in the subsection bitmap. Now to the race. Taking the memory hotplug lock at random places is ugly. I do wonder if we can use RCU: The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page(): /* the memmap is guaranteed to remain active under RCU */ rcu_read_lock(); if (pfn_active(random_pfn)) { page = pfn_to_page(random_pfn); ... use the page, stays valid } rcu_unread_lock(); Memory offlining/memremap code: set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */ synchronize_rcu(); /* all users saw the bitmap update, we can invalide the memmap */ remove_pfn_range_from_zone(zone, pfn, nr_pages); > >> >> I only gave it a quick test with DIMMs on x86-64, but didn't test the >> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested >> on x86-64 and PPC. > > I'll give it a spin, but I don't think the kernel wants to grow more > is_zone_device_page() users. Let's recap. In this RFC, I introduce a total of 4 (!) users only. The other parts can rely on pfn_to_online_page() only. 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes" - Basically never used with ZONE_DEVICE. - We hold a reference! - All it protects is a SetPageDirty(page); 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes" - Same as 1. 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes" - We
Re: [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes
On 22.10.19 19:55, Matt Sickler wrote: Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. The pages are obtained via get_user_pages_fast(). I assume, these could be ZONE_DEVICE pages. Let's just exclude them as well explicitly. I'm not sure what ZONE_DEVICE pages are, but these pages are normal system RAM, typically HugePages (but not always). ZONE_DEVICE, a.k.a. devmem, are pages that bypass the pagecache (e.g., DAX) completely and will therefore never get swapped. These pages are not managed by any page allocator (especially not the buddy), they are rather "directly mapped device memory". E.g., a NVDIMM. It is mapped into the physical address space similar to ordinary RAM (a DIMM). Any write to such a PFN will directly end up on the target device. In contrast to a DIMM, the memory is persistent accross reboots. Now, if you mmap such an NVDIMM into a user space process, you will end up with ZONE_DEVICE pages as part of the user space mapping (VMA). get_user_pages_fast() on this memory will result in "struct pages" that belong to ZONE_DEVICE. This is where this patch comes into play. This patch makes sure that there is absolutely no change once we stop setting these ZONE_DEVICE pages PG_reserved. E.g., AFAIK, setting a ZONE_DEVICE page dirty does not make too much sense (never swapped). Yes, it might not be a likely setup, however, it is possible. In this series I collect all places that *could* be affected. If that change is really needed has to be decided. I can see that the two staging drivers I have patches for might be able to just live with the change - but then we talked about it and are aware of the change. Thanks! -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH RFC v1 10/12] powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. We could explicitly check for is_zone_device_page(page). But looking at the pfn_valid() check, it seems safer to just use pfn_to_online_page() here, that will skip all ZONE_DEVICE pages right away. Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Christophe Leroy Cc: "Aneesh Kumar K.V" Cc: Allison Randal Cc: Nicholas Piggin Cc: Thomas Gleixner Signed-off-by: David Hildenbrand --- arch/powerpc/mm/pgtable.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index e3759b69f81b..613c98fa7dc0 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -55,10 +55,12 @@ static struct page *maybe_pte_to_page(pte_t pte) unsigned long pfn = pte_pfn(pte); struct page *page; - if (unlikely(!pfn_valid(pfn))) - return NULL; - page = pfn_to_page(pfn); - if (PageReserved(page)) + /* +* We reject any pages that are not online (not managed by the buddy). +* This includes ZONE_DEVICE pages. +*/ + page = pfn_to_online_page(pfn); + if (unlikely(!page || PageReserved(page))) return NULL; return page; } -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH RFC v1 09/12] powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. We could explicitly check for is_zone_device_page(page). But looking at the pfn_valid() check, it seems safer to just use pfn_to_online_page() here, that will skip all ZONE_DEVICE pages right away. Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: "Aneesh Kumar K.V" Cc: Christophe Leroy Cc: Nicholas Piggin Cc: Andrew Morton Cc: Mike Rapoport Cc: YueHaibing Signed-off-by: David Hildenbrand --- arch/powerpc/mm/book3s64/hash_utils.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c index 6c123760164e..a1566039e747 100644 --- a/arch/powerpc/mm/book3s64/hash_utils.c +++ b/arch/powerpc/mm/book3s64/hash_utils.c @@ -1084,13 +1084,15 @@ void hash__early_init_mmu_secondary(void) */ unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap) { - struct page *page; + struct page *page = pfn_to_online_page(pte_pfn(pte)); - if (!pfn_valid(pte_pfn(pte))) + /* +* We ignore any pages that are not online (not managed by the buddy). +* This includes ZONE_DEVICE pages. +*/ + if (!page) return pp; - page = pte_page(pte); - /* page is dirty */ if (!test_bit(PG_arch_1, >flags) && !PageReserved(page)) { if (trap == 0x400) { -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH RFC v1 11/12] x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. We could explicitly check for is_zone_device_page(page). But looking at the pfn_valid() check, it seems safer to just use pfn_to_online_page() here, that will skip all ZONE_DEVICE pages right away. Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Signed-off-by: David Hildenbrand --- arch/x86/mm/ioremap.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c index a39dcdb5ae34..db6913b48edf 100644 --- a/arch/x86/mm/ioremap.c +++ b/arch/x86/mm/ioremap.c @@ -77,10 +77,17 @@ static unsigned int __ioremap_check_ram(struct resource *res) start_pfn = (res->start + PAGE_SIZE - 1) >> PAGE_SHIFT; stop_pfn = (res->end + 1) >> PAGE_SHIFT; if (stop_pfn > start_pfn) { - for (i = 0; i < (stop_pfn - start_pfn); ++i) - if (pfn_valid(start_pfn + i) && - !PageReserved(pfn_to_page(start_pfn + i))) + for (i = 0; i < (stop_pfn - start_pfn); ++i) { + struct page *page; +/* + * We treat any pages that are not online (not managed + * by the buddy) as not being RAM. This includes + * ZONE_DEVICE pages. + */ + page = pfn_to_online_page(start_pfn + i); + if (page && !PageReserved(page)) return IORES_MAP_SYSTEM_RAM; + } } return 0; -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
Our onlining/offlining code is unnecessarily complicated. Only memory blocks added during boot can have holes. Hotplugged memory never has holes. That memory is already online. When we stop allowing to offline memory blocks with holes, we implicitly stop to online memory blocks with holes. This allows to simplify the code. For example, we no longer have to worry about marking pages that fall into memory holes PG_reserved when onlining memory. We can stop setting pages PG_reserved. Offlining memory blocks added during boot is usually not guranteed to work either way. So stopping to do that (if anybody really used and tested this over the years) should not really hurt. For the use case of offlining memory to unplug DIMMs, we should see no change. (holes on DIMMs would be weird) Cc: Andrew Morton Cc: Michal Hocko Cc: Oscar Salvador Cc: Pavel Tatashin Cc: Dan Williams Signed-off-by: David Hildenbrand --- mm/memory_hotplug.c | 26 -- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 561371ead39a..7210f4375279 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct memory_notify *arg) node_clear_state(node, N_MEMORY); } +static int count_system_ram_pages_cb(unsigned long start_pfn, +unsigned long nr_pages, void *data) +{ + unsigned long *nr_system_ram_pages = data; + + *nr_system_ram_pages += nr_pages; + return 0; +} + static int __ref __offline_pages(unsigned long start_pfn, unsigned long end_pfn) { - unsigned long pfn, nr_pages; + unsigned long pfn, nr_pages = 0; unsigned long offlined_pages = 0; int ret, node, nr_isolate_pageblock; unsigned long flags; @@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn, mem_hotplug_begin(); + /* +* We don't allow to offline memory blocks that contain holes +* and consecuently don't allow to online memory blocks that contain +* holes. This allows to simplify the code quite a lot and we don't +* have to mess with PG_reserved pages for memory holes. +*/ + walk_system_ram_range(start_pfn, end_pfn - start_pfn, _pages, + count_system_ram_pages_cb); + if (nr_pages != end_pfn - start_pfn) { + ret = -EINVAL; + reason = "memory holes"; + goto failed_removal; + } + /* This makes hotplug much easier...and readable. we assume this for now. .*/ if (!test_pages_in_a_zone(start_pfn, end_pfn, _start, @@ -1472,7 +1495,6 @@ static int __ref __offline_pages(unsigned long start_pfn, zone = page_zone(pfn_to_page(valid_start)); node = zone_to_nid(zone); - nr_pages = end_pfn - start_pfn; /* set above range as isolated */ ret = start_isolate_page_range(start_pfn, end_pfn, -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH RFC v1 06/12] staging/gasket: Prepare gasket_release_page() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. The pages are obtained via get_user_pages_fast(). I assume, these could be ZONE_DEVICE pages. Let's just exclude them as well explicitly. Cc: Rob Springer Cc: Todd Poynor Cc: Ben Chan Cc: Greg Kroah-Hartman Signed-off-by: David Hildenbrand --- drivers/staging/gasket/gasket_page_table.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/staging/gasket/gasket_page_table.c b/drivers/staging/gasket/gasket_page_table.c index f6d715787da8..d43fed58bf65 100644 --- a/drivers/staging/gasket/gasket_page_table.c +++ b/drivers/staging/gasket/gasket_page_table.c @@ -447,7 +447,7 @@ static bool gasket_release_page(struct page *page) if (!page) return false; - if (!PageReserved(page)) + if (!PageReserved(page) && !is_zone_device_page(page)) SetPageDirty(page); put_page(page); -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH RFC v1 08/12] powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap. Note that ZONE_DEVICE memory is never online (IOW, managed by the buddy). Switching to pfn_to_online_page() keeps the existing behavior for PFNs without a memmap and for ZONE_DEVICE memory. Cc: Paul Mackerras Cc: Benjamin Herrenschmidt Cc: Michael Ellerman Signed-off-by: David Hildenbrand --- arch/powerpc/kvm/book3s_64_mmu_radix.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index 2d415c36a61d..05397c0561fc 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -801,12 +801,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu, writing, upgrade_p); if (is_error_noslot_pfn(pfn)) return -EFAULT; - page = NULL; - if (pfn_valid(pfn)) { - page = pfn_to_page(pfn); - if (PageReserved(page)) - page = NULL; - } + /* +* We treat any pages that are not online (not managed by the +* buddy) as reserved - this includes ZONE_DEVICE pages and +* pages without a memmap (e.g., mapped via /dev/mem). +*/ + page = pfn_to_online_page(pfn); + if (page && PageReserved(page)) + page = NULL; } /* -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH RFC v1 12/12] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap
Everything should be prepared to stop setting pages PG_reserved when initializing the memmap on memory hotplug. Most importantly, we stop marking ZONE_DEVICE pages PG_reserved. a) We made sure that any code that relied on PG_reserved to detect ZONE_DEVICE memory will no longer rely on PG_reserved - either by using pfn_to_online_page() to exclude them right away or by checking against is_zone_device_page(). b) We made sure that memory blocks with holes cannot be offlined and therefore also not onlined. We have quite some code that relies on memory holes being marked PG_reserved. This is now not an issue anymore. generic_online_page() still calls __free_pages_core(), which performs __ClearPageReserved(p). AFAIKS, this should not hurt. It is worth nothing that the users of online_page_callback_t might see a change. E.g., until now, pages not freed to the buddy by the HyperV balloonm were set PG_reserved until freed via generic_online_page(). Now, they would look like ordinarily allocated pages (refcount == 1). This callback is used by the XEN balloon and the HyperV balloon. To not introduce any silent errors, keep marking the pages PG_reserved. We can most probably stop doing that, but have to double check if there are issues (e.g., offlining code aborts right away in has_unmovable_pages() when it runs into a PageReserved(page)) Update the documentation at various places. Cc: "K. Y. Srinivasan" Cc: Haiyang Zhang Cc: Stephen Hemminger Cc: Sasha Levin Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: Andrew Morton Cc: Alexander Duyck Cc: Pavel Tatashin Cc: Vlastimil Babka Cc: Johannes Weiner Cc: Anthony Yznaga Cc: Michal Hocko Cc: Oscar Salvador Cc: Dan Williams Cc: Mel Gorman Cc: Mike Rapoport Cc: Anshuman Khandual Suggested-by: Michal Hocko Signed-off-by: David Hildenbrand --- drivers/hv/hv_balloon.c| 6 ++ drivers/xen/balloon.c | 7 +++ include/linux/page-flags.h | 8 +--- mm/memory_hotplug.c| 17 +++-- mm/page_alloc.c| 11 --- 5 files changed, 21 insertions(+), 28 deletions(-) diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index c722079d3c24..3214b0ef5247 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -670,6 +670,12 @@ static struct notifier_block hv_memory_nb = { /* Check if the particular page is backed and can be onlined and online it. */ static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg) { + /* +* TODO: The core used to mark the pages reserved. Most probably +* we can stop doing that now. +*/ + __SetPageReserved(pg); + if (!has_pfn_is_backed(has, page_to_pfn(pg))) { if (!PageOffline(pg)) __SetPageOffline(pg); diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c index 4f2e78a5e4db..af69f057913a 100644 --- a/drivers/xen/balloon.c +++ b/drivers/xen/balloon.c @@ -374,6 +374,13 @@ static void xen_online_page(struct page *page, unsigned int order) mutex_lock(_mutex); for (i = 0; i < size; i++) { p = pfn_to_page(start_pfn + i); + /* +* TODO: The core used to mark the pages reserved. Most probably +* we can stop doing that now. However, especially +* alloc_xenballooned_pages() left PG_reserved set +* on pages that can get mapped to user space. +*/ + __SetPageReserved(p); balloon_append(p); } mutex_unlock(_mutex); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index f91cb8898ff0..d4f85d866b71 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -30,24 +30,18 @@ * - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying * to read/write these pages might end badly. Don't touch! * - The zero page(s) - * - Pages not added to the page allocator when onlining a section because - * they were excluded via the online_page_callback() or because they are - * PG_hwpoison. * - Pages allocated in the context of kexec/kdump (loaded kernel image, * control pages, vmcoreinfo) * - MMIO/DMA pages. Some architectures don't allow to ioremap pages that are * not marked PG_reserved (as they might be in use by somebody else who does * not respect the caching strategy). - * - Pages part of an offline section (struct pages of offline sections should - * not be trusted as they will be initialized when first onlined). * - MCA pages on ia64 * - Pages holding CPU notes for POWER Firmware Assisted Dump - * - Device memory (e.g. PMEM, DAX, HMM) * Some PG_reserved pages will be excluded from the hibernation image. * PG_reserved does in general not hinder anybody from dumping or swapping * and is no longer required for remap_pfn_range(). ioremap might require it. * Consequently
[PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. The pages are obtained via get_user_pages_fast(). I assume, these could be ZONE_DEVICE pages. Let's just exclude them as well explicitly. Cc: Greg Kroah-Hartman Cc: Vandana BN Cc: "Simon Sandström" Cc: Dan Carpenter Cc: Nishka Dasgupta Cc: Madhumitha Prabakaran Cc: Fabio Estevam Cc: Matt Sickler Cc: Jeremy Sowden Signed-off-by: David Hildenbrand --- drivers/staging/kpc2000/kpc_dma/fileops.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/staging/kpc2000/kpc_dma/fileops.c b/drivers/staging/kpc2000/kpc_dma/fileops.c index cb52bd9a6d2f..457adcc81fe6 100644 --- a/drivers/staging/kpc2000/kpc_dma/fileops.c +++ b/drivers/staging/kpc2000/kpc_dma/fileops.c @@ -212,7 +212,8 @@ void transfer_complete_cb(struct aio_cb_data *acd, size_t xfr_count, u32 flags) BUG_ON(acd->ldev->pldev == NULL); for (i = 0 ; i < acd->page_count ; i++) { - if (!PageReserved(acd->user_pages[i])) { + if (!PageReserved(acd->user_pages[i]) && + !is_zone_device_page(acd->user_pages[i])) { set_page_dirty(acd->user_pages[i]); } } -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH RFC v1 05/12] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap. Note that ZONE_DEVICE memory is never online (IOW, managed by the buddy). Switching to pfn_to_online_page() keeps the existing behavior for PFNs without a memmap and for ZONE_DEVICE memory. They are treated as reserved and the page is not touched (e.g., to set it dirty or accessed). Cc: Alex Williamson Cc: Cornelia Huck Signed-off-by: David Hildenbrand --- drivers/vfio/vfio_iommu_type1.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 2ada8e6cdb88..f8ce8c408ba8 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) */ static bool is_invalid_reserved_pfn(unsigned long pfn) { - if (pfn_valid(pfn)) - return PageReserved(pfn_to_page(pfn)); + struct page *page = pfn_to_online_page(pfn); + /* +* We treat any pages that are not online (not managed by the buddy) +* as reserved - this includes ZONE_DEVICE pages and pages without +* a memmap (e.g., mapped via /dev/mem). +*/ + if (page) + return PageReserved(page); return true; } -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH RFC v1 04/12] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap. Note that ZONE_DEVICE memory is never online (IOW, managed by the buddy). Switching to pfn_to_online_page() keeps the existing behavior for PFNs without a memmap and for ZONE_DEVICE memory. They are treated as reserved and the page is not touched (e.g., to set it dirty or accessed). Cc: Paolo Bonzini Cc: "Radim Krčmář" Cc: Michal Hocko Cc: Dan Williams Cc: KarimAllah Ahmed Signed-off-by: David Hildenbrand --- virt/kvm/kvm_main.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 66a977472a1c..b233d4129014 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, bool kvm_is_reserved_pfn(kvm_pfn_t pfn) { - if (pfn_valid(pfn)) - return PageReserved(pfn_to_page(pfn)); + struct page *page = pfn_to_online_page(pfn); + /* +* We treat any pages that are not online (not managed by the buddy) +* as reserved - this includes ZONE_DEVICE pages and pages without +* a memmap (e.g., mapped via /dev/mem). +*/ + if (page) + return PageReserved(page); return true; } -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH RFC v1 03/12] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap - however, there is no reliable and fast check to detect memmaps that were initialized and are ZONE_DEVICE. Let's rewrite kvm_is_mmio_pfn() so we really only touch initialized memmaps that are guaranteed to not contain garbage. Make sure that RAM without a memmap is still not detected as MMIO and that ZONE_DEVICE that is not UC/UC-/WC is not detected as MMIO. Cc: Paolo Bonzini Cc: "Radim Krčmář" Cc: Sean Christopherson Cc: Vitaly Kuznetsov Cc: Wanpeng Li Cc: Jim Mattson Cc: Joerg Roedel Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: KarimAllah Ahmed Cc: Michal Hocko Cc: Dan Williams Signed-off-by: David Hildenbrand --- arch/x86/kvm/mmu.c | 30 ++ 1 file changed, 18 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 24c23c66b226..795869ffd4bb 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2962,20 +2962,26 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, static bool kvm_is_mmio_pfn(kvm_pfn_t pfn) { + struct page *page = pfn_to_online_page(pfn); + + /* +* Online pages consist of pages managed by the buddy. Especially, +* ZONE_DEVICE pages are never online. Online pages that are reserved +* indicate the zero page and MMIO pages. +*/ + if (page) + return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)); + + /* +* Anything with a valid memmap could be ZONE_DEVICE - or the +* memmap could be uninitialized. Treat only UC/UC-/WC pages as MMIO. +*/ if (pfn_valid(pfn)) - return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) && - /* -* Some reserved pages, such as those from NVDIMM -* DAX devices, are not for MMIO, and can be mapped -* with cached memory type for better performance. -* However, the above check misconceives those pages -* as MMIO, and results in KVM mapping them with UC -* memory type, which would hurt the performance. -* Therefore, we check the host memory type in addition -* and only treat UC/UC-/WC pages as MMIO. -*/ - (!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn)); + return !pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn); + /* +* Any RAM that has no memmap (e.g., mapped via /dev/mem) is not MMIO. +*/ return !e820__mapped_raw_any(pfn_to_hpa(pfn), pfn_to_hpa(pfn + 1) - 1, E820_TYPE_RAM); -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. Let's make sure that the logic in the function won't change. Once we no longer set these pages to reserved, we can rework this function to perform separate checks for ZONE_DEVICE (split from PG_reserved checks). Cc: Kees Cook Cc: Andrew Morton Cc: Kate Stewart Cc: Allison Randal Cc: "Isaac J. Manjarres" Cc: Qian Cai Cc: Thomas Gleixner Signed-off-by: David Hildenbrand --- mm/usercopy.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/usercopy.c b/mm/usercopy.c index 660717a1ea5c..a3ac4be35cde 100644 --- a/mm/usercopy.c +++ b/mm/usercopy.c @@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, unsigned long n, * device memory), or CMA. Otherwise, reject since the object spans * several independently allocated pages. */ - is_reserved = PageReserved(page); + is_reserved = PageReserved(page) || is_zone_device_page(page); is_cma = is_migrate_cma_page(page); if (!is_reserved && !is_cma) usercopy_abort("spans multiple pages", NULL, to_user, 0, n); for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) { page = virt_to_head_page(ptr); - if (is_reserved && !PageReserved(page)) + if (is_reserved && !(PageReserved(page) || +is_zone_device_page(page))) usercopy_abort("spans Reserved and non-Reserved pages", NULL, to_user, 0, n); if (is_cma && !is_migrate_cma_page(page)) -- 2.21.0 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
This series is based on [2], which should pop up in linux/next soon: https://lkml.org/lkml/2019/10/21/1034 This is the result of a recent discussion with Michal ([1], [2]). Right now we set all pages PG_reserved when initializing hotplugged memmaps. This includes ZONE_DEVICE memory. In case of system memory, PG_reserved is cleared again when onlining the memory, in case of ZONE_DEVICE memory never. In ancient times, we needed PG_reserved, because there was no way to tell whether the memmap was already properly initialized. We now have SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE memory is already initialized deferred, and there shouldn't be a visible change in that regard. I remember that some time ago, we already talked about stopping to set ZONE_DEVICE pages PG_reserved on the list, but I never saw any patches. Also, I forgot who was part of the discussion :) One of the biggest fear were side effects. I went ahead and audited all users of PageReserved(). The ones that don't need any care (patches) can be found below. I will double check and hope I am not missing something important. I am probably a little bit too careful (but I don't want to break things). In most places (besides KVM and vfio that are nuts), the pfn_to_online_page() check could most probably be avoided by a is_zone_device_page() check. However, I usually get suspicious when I see a pfn_valid() check (especially after I learned that people mmap parts of /dev/mem into user space, including memory without memmaps. Also, people could memmap offline memory blocks this way :/). As long as this does not hurt performance, I think we should rather do it the clean way. I only gave it a quick test with DIMMs on x86-64, but didn't test the ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested on x86-64 and PPC. Other users of PageReserved() that should be fine: - mm/page_owner.c:pagetypeinfo_showmixedcount_print() -> Never called for ZONE_DEVICE, (+ pfn_to_online_page(pfn)) - mm/page_owner.c:init_pages_in_zone() -> Never called for ZONE_DEVICE (!populated_zone(zone)) - mm/page_ext.c:free_page_ext() -> Only a BUG_ON(PageReserved(page)), not relevant - mm/page_ext.c:has_unmovable_pages() -> Not releveant for ZONE_DEVICE - mm/page_ext.c:pfn_range_valid_contig() -> pfn_to_online_page() already guards us - mm/mempolicy.c:queue_pages_pte_range() -> vm_normal_page() checks against pte_devmap() - mm/memory-failure.c:hwpoison_user_mappings() -> Not reached via memory_failure() due to pfn_to_online_page() -> Also not reached indirectly via memory_failure_hugetlb() - mm/hugetlb.c:gather_bootmem_prealloc() -> Only a WARN_ON(PageReserved(page)), not relevant - kernel/power/snapshot.c:saveable_highmem_page() -> pfn_to_online_page() already guards us - kernel/power/snapshot.c:saveable_page() -> pfn_to_online_page() already guards us - fs/proc/task_mmu.c:can_gather_numa_stats() -> vm_normal_page() checks against pte_devmap() - fs/proc/task_mmu.c:can_gather_numa_stats_pmd -> vm_normal_page_pmd() checks against pte_devmap() - fs/proc/page.c:stable_page_flags() -> The reserved bit is simply copied, irrelevant - drivers/firmware/memmap.c:release_firmware_map_entry() -> really only a check to detect bootmem. Not relevant for ZONE_DEVICE - arch/ia64/kernel/mca_drv.c - arch/mips/mm/init.c - arch/mips/mm/ioremap.c - arch/nios2/mm/ioremap.c - arch/parisc/mm/ioremap.c - arch/sparc/mm/tlb.c - arch/xtensa/mm/cache.c -> No ZONE_DEVICE support - arch/powerpc/mm/init_64.c:vmemmap_free() -> Special-cases memmap on altmap -> Only a check for bootmem - arch/x86/kernel/alternative.c:__text_poke() -> Only a WARN_ON(!PageReserved(pages[0])) to verify it is bootmem - arch/x86/mm/init_64.c -> Only a check for bootmem [1] https://lkml.org/lkml/2019/10/21/736 [2] https://lkml.org/lkml/2019/10/21/1034 Cc: Michal Hocko Cc: Dan Williams Cc: kvm-...@vger.kernel.org Cc: linuxppc-...@lists.ozlabs.org Cc: k...@vger.kernel.org Cc: linux-hyp...@vger.kernel.org Cc: de...@driverdev.osuosl.org Cc: xen-de...@lists.xenproject.org Cc: x...@kernel.org Cc: Alexander Duyck David Hildenbrand (12): mm/memory_hotplug: Don't allow to online/offline memory blocks with holes mm/usercopy.c: Prepare check_page_span() for PG_reserved changes KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes staging/gasket: Prepare gasket_release_page() for PG_reserved changes staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for PG_reserved changes powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved changes powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes x86/mm: Prepare __ior
Re: [PATCH RFCv2 0/4] mm/memory_hotplug: Introduce memory block types
On 20.12.18 14:08, Michal Hocko wrote: > On Thu 20-12-18 13:58:16, David Hildenbrand wrote: >> On 30.11.18 18:59, David Hildenbrand wrote: >>> This is the second approach, introducing more meaningful memory block >>> types and not changing online behavior in the kernel. It is based on >>> latest linux-next. >>> >>> As we found out during dicussion, user space should always handle onlining >>> of memory, in any case. However in order to make smart decisions in user >>> space about if and how to online memory, we have to export more information >>> about memory blocks. This way, we can formulate rules in user space. >>> >>> One such information is the type of memory block we are talking about. >>> This helps to answer some questions like: >>> - Does this memory block belong to a DIMM? >>> - Can this DIMM theoretically ever be unplugged again? >>> - Was this memory added by a balloon driver that will rely on balloon >>> inflation to remove chunks of that memory again? Which zone is advised? >>> - Is this special standby memory on s390x that is usually not automatically >>> onlined? >>> >>> And in short it helps to answer to some extend (excluding zone imbalances) >>> - Should I online this memory block? >>> - To which zone should I online this memory block? >>> ... of course special use cases will result in different anwers. But that's >>> why user space has control of onlining memory. >>> >>> More details can be found in Patch 1 and Patch 3. >>> Tested on x86 with hotplugged DIMMs. Cross-compiled for PPC and s390x. >>> >>> >>> Example: >>> $ udevadm info -q all -a /sys/devices/system/memory/memory0 >>> KERNEL=="memory0" >>> SUBSYSTEM=="memory" >>> DRIVER=="" >>> ATTR{online}=="1" >>> ATTR{phys_device}=="0" >>> ATTR{phys_index}=="" >>> ATTR{removable}=="0" >>> ATTR{state}=="online" >>> ATTR{type}=="boot" >>> ATTR{valid_zones}=="none" >>> $ udevadm info -q all -a /sys/devices/system/memory/memory90 >>> KERNEL=="memory90" >>> SUBSYSTEM=="memory" >>> DRIVER=="" >>> ATTR{online}=="1" >>> ATTR{phys_device}=="0" >>> ATTR{phys_index}=="005a" >>> ATTR{removable}=="1" >>> ATTR{state}=="online" >>> ATTR{type}=="dimm" >>> ATTR{valid_zones}=="Normal" >>> >>> >>> RFC -> RFCv2: >>> - Now also taking care of PPC (somehow missed it :/ ) >>> - Split the series up to some degree (some ideas on how to split up patch 3 >>> would be very welcome) >>> - Introduce more memory block types. Turns out abstracting too much was >>> rather confusing and not helpful. Properly document them. >>> >>> Notes: >>> - I wanted to convert the enum of types into a named enum but this >>> provoked all kinds of different errors. For now, I am doing it just like >>> the other types (e.g. online_type) we are using in that context. >>> - The "removable" property should never have been named like that. It >>> should have been "offlinable". Can we still rename that? E.g. boot memory >>> is sometimes marked as removable ... >>> >> >> >> Any feedback regarding the suggested block types would be very much >> appreciated! > > I still do not like this much to be honest. I just didn't get to think > through this properly. My fear is that this is conflating an actual API > with the current implementation and as such will cause problems in > future. But I haven't really looked into your patches closely so I might > be wrong. Anyway I won't be able to look into it by the end of year. > So I started to think about this again, and I guess somehow exposing an identification of the device driver that added the memory section could be sufficient. E.g. "hyperv", "xen", "acpi", "sclp", "virtio-mem" ... Via separate device driver interfaces, other information about the memory could be exposed. (e.g. for ACPI: which memory devices belong to one physical device). So stuff would not have to centered around /sys/devices/system/memory/ , uglifying it for special cases. We would have to write udev rules to deal with these values, should be easy. If no DRIVER is given, it is simply memory detected and detected during boot. ACPI changing the DRIVER might be tricky (from no DRIVER -> ACPI), but I guess it could be done. Now, the question would be how to get the DRIVER value in there. Adding a bunch of fake device drivers would work, however this might get a little messy ... and then there is unbining and rebinding which can be triggered by userspace. Thinks to care about? Most probably not. -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH v2 3/8] kexec: export PG_offline to VMCOREINFO
On 11.03.19 10:04, Dave Young wrote: > Hi David, > On 11/22/18 at 11:06am, David Hildenbrand wrote: >> Right now, pages inflated as part of a balloon driver will be dumped >> by dump tools like makedumpfile. While XEN is able to check in the >> crash kernel whether a certain pfn is actuall backed by memory in the >> hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of >> other balloon inflated memory will essentially result in zero pages getting >> allocated by the hypervisor and the dump getting filled with this data. >> >> The allocation and reading of zero pages can directly be avoided if a >> dumping tool could know which pages only contain stale information not to >> be dumped. >> >> We now have PG_offline which can be (and already is by virtio-balloon) >> used for marking pages as logically offline. Follow up patches will >> make use of this flag also in other balloon implementations. >> >> Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so >> makedumpfile can directly skip pages that are logically offline and the >> content therefore stale. (we export is as a macro to match how it is >> done for PG_buddy. This way it is clearer that this is not actually a flag >> but only a very specific mapcount value to represent page types). >> >> Please note that this is also helpful for a problem we were seeing under >> Hyper-V: Dumping logically offline memory (pages kept fake offline while >> onlining a section via online_page_callback) would under some condicions >> result in a kernel panic when dumping them. >> >> Cc: Andrew Morton >> Cc: Dave Young >> Cc: "Kirill A. Shutemov" >> Cc: Baoquan He >> Cc: Omar Sandoval >> Cc: Arnd Bergmann >> Cc: Matthew Wilcox >> Cc: Michal Hocko >> Cc: "Michael S. Tsirkin" >> Cc: Lianbo Jiang >> Cc: Borislav Petkov >> Cc: Kazuhito Hagio >> Acked-by: Michael S. Tsirkin >> Acked-by: Dave Young >> Signed-off-by: David Hildenbrand >> --- >> kernel/crash_core.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/kernel/crash_core.c b/kernel/crash_core.c >> index 933cb3e45b98..093c9f917ed0 100644 >> --- a/kernel/crash_core.c >> +++ b/kernel/crash_core.c >> @@ -464,6 +464,8 @@ static int __init crash_save_vmcoreinfo_init(void) >> VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE); >> #ifdef CONFIG_HUGETLB_PAGE >> VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR); >> +#define PAGE_OFFLINE_MAPCOUNT_VALUE (~PG_offline) >> +VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE); >> #endif >> >> arch_crash_save_vmcoreinfo(); > > The patch has been merged, would you mind to send a documentation patch > for the vmcoreinfo, which is added recently in > Documentation/kdump/vmcoreinfo.txt > > A brief description about how this vmcoreinfo field is used is good to > have. > Turns out, it was already documented PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision |PG_head_mask|PAGE_BUDDY_MAPCOUNT_VALUE(~PG_buddy) |PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline) - Page attributes. These flags are used to filter various unnecessary for dumping pages. Thanks! > Thanks > Dave > -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH v2] makedumpfile: exclude pages that are logically offline
On 27.11.18 17:32, Kazuhito Hagio wrote: >> Linux marks pages that are logically offline via a page flag (map count). >> Such pages e.g. include pages infated as part of a balloon driver or >> pages that were not actually onlined when onlining the whole section. >> >> While the hypervisor usually allows to read such inflated memory, we >> basically read and dump data that is completely irrelevant. Also, this >> might result in quite some overhead in the hypervisor. In addition, >> we saw some problems under Hyper-V, whereby we can crash the kernel by >> dumping, when reading memory of a partially onlined memory segment >> (for memory added by the Hyper-V balloon driver). >> >> Therefore, don't read and dump pages that are marked as being logically >> offline. >> >> Signed-off-by: David Hildenbrand > > Thanks for the v2 update. > I'm going to merge this patch after the kernel patches are merged > and it tests fine with the kernel. > > Kazu Hi Kazu, the patches are now upstream. Thanks! -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH] hv_balloon: avoid touching uninitialized struct page during tail onlining
On 07.01.19 14:44, Vitaly Kuznetsov wrote: > David Hildenbrand writes: > >> On 04.01.19 15:19, Vitaly Kuznetsov wrote: >>> Hyper-V memory hotplug protocol has 2M granularity and in Linux x86 we use >>> 128M. To deal with it we implement partial section onlining by registering >>> custom page onlining callback (hv_online_page()). Later, when more memory >>> arrives we try to online the 'tail' (see hv_bring_pgs_online()). >>> >>> It was found that in some cases this 'tail' onlining causes issues: >>> >>> BUG: Bad page state in process kworker/0:2 pfn:109e3a >>> page:e08344278e80 count:0 mapcount:1 mapping: index:0x0 >>> flags: 0xf8000() >>> raw: 000f8000 dead0100 dead0200 >>> raw: >>> page dumped because: nonzero mapcount >>> ... >>> Workqueue: events hot_add_req [hv_balloon] >>> Call Trace: >>> dump_stack+0x5c/0x80 >>> bad_page.cold.112+0x7f/0xb2 >>> free_pcppages_bulk+0x4b8/0x690 >>> free_unref_page+0x54/0x70 >>> hv_page_online_one+0x5c/0x80 [hv_balloon] >>> hot_add_req.cold.24+0x182/0x835 [hv_balloon] >>> ... >>> >>> Turns out that we now have deferred struct page initialization for memory >>> hotplug so e.g. memory_block_action() in drivers/base/memory.c does >>> pages_correctly_probed() check and in that check it avoids inspecting >>> struct pages and checks sections instead. But in Hyper-V balloon driver we >>> do PageReserved(pfn_to_page()) check and this is now wrong. >>> >>> Switch to checking online_section_nr() instead. >>> >>> Signed-off-by: Vitaly Kuznetsov >>> --- >>> drivers/hv/hv_balloon.c | 10 ++ >>> 1 file changed, 6 insertions(+), 4 deletions(-) >>> >>> diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c >>> index 5301fef16c31..7c6349a50ef1 100644 >>> --- a/drivers/hv/hv_balloon.c >>> +++ b/drivers/hv/hv_balloon.c >>> @@ -888,12 +888,14 @@ static unsigned long handle_pg_range(unsigned long >>> pg_start, >>> pfn_cnt -= pgs_ol; >>> /* >>> * Check if the corresponding memory block is already >>> -* online by checking its last previously backed page. >>> -* In case it is we need to bring rest (which was not >>> -* backed previously) online too. >>> +* online. It is possible to observe struct pages still >>> +* being uninitialized here so check section instead. >>> +* In case the section is online we need to bring the >>> +* rest of pfns (which were not backed previously) >>> +* online too. >>> */ >>> if (start_pfn > has->start_pfn && >>> - !PageReserved(pfn_to_page(start_pfn - 1))) >>> + online_section_nr(pfn_to_section_nr(start_pfn))) >>> hv_bring_pgs_online(has, start_pfn, pgs_ol); >>> >>> } >>> >> >> I wonder if you should use pfn_to_online_page() and check for PageOffline(). >> >> (I guess online_section_nr() should also do the trick) > > I'm worried a bit about racing with mm code here as we're not doing > mem_hotplug_begin()/done() so I'd slightly prefer keeping > online_section_nr() (pfn_to_online_page() also uses it but then it gets > to the particular struct page). Moreover, with pfn_to_online_page() we > will be looking at some other pfn - because the start_pfn is definitelly > offline (pre-patch we were looking at start_pfn-1). Just looking at the > whole section seems cleaner. Fine with me. I guess the section can never be offlined as it still contains reserved pages if not fully "fake-onlined" by HV code already. But we could still consider mem_hotplug_begin()/done() as we could have a online section although online_pages() has not completed yet. So we could actually touch some "semi onlined section". Acked-by: David Hildenbrand > > P.S. I still think about bringing mem_hotplug_begin()/done() to > hv_balloon but that's going to be a separate discussion, here I want to > have a small fix backportable to stable. > > Thanks, > -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH] hv_balloon: avoid touching uninitialized struct page during tail onlining
On 04.01.19 15:19, Vitaly Kuznetsov wrote: > Hyper-V memory hotplug protocol has 2M granularity and in Linux x86 we use > 128M. To deal with it we implement partial section onlining by registering > custom page onlining callback (hv_online_page()). Later, when more memory > arrives we try to online the 'tail' (see hv_bring_pgs_online()). > > It was found that in some cases this 'tail' onlining causes issues: > > BUG: Bad page state in process kworker/0:2 pfn:109e3a > page:e08344278e80 count:0 mapcount:1 mapping: index:0x0 > flags: 0xf8000() > raw: 000f8000 dead0100 dead0200 > raw: > page dumped because: nonzero mapcount > ... > Workqueue: events hot_add_req [hv_balloon] > Call Trace: > dump_stack+0x5c/0x80 > bad_page.cold.112+0x7f/0xb2 > free_pcppages_bulk+0x4b8/0x690 > free_unref_page+0x54/0x70 > hv_page_online_one+0x5c/0x80 [hv_balloon] > hot_add_req.cold.24+0x182/0x835 [hv_balloon] > ... > > Turns out that we now have deferred struct page initialization for memory > hotplug so e.g. memory_block_action() in drivers/base/memory.c does > pages_correctly_probed() check and in that check it avoids inspecting > struct pages and checks sections instead. But in Hyper-V balloon driver we > do PageReserved(pfn_to_page()) check and this is now wrong. > > Switch to checking online_section_nr() instead. > > Signed-off-by: Vitaly Kuznetsov > --- > drivers/hv/hv_balloon.c | 10 ++ > 1 file changed, 6 insertions(+), 4 deletions(-) > > diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c > index 5301fef16c31..7c6349a50ef1 100644 > --- a/drivers/hv/hv_balloon.c > +++ b/drivers/hv/hv_balloon.c > @@ -888,12 +888,14 @@ static unsigned long handle_pg_range(unsigned long > pg_start, > pfn_cnt -= pgs_ol; > /* >* Check if the corresponding memory block is already > - * online by checking its last previously backed page. > - * In case it is we need to bring rest (which was not > - * backed previously) online too. > + * online. It is possible to observe struct pages still > + * being uninitialized here so check section instead. > + * In case the section is online we need to bring the > + * rest of pfns (which were not backed previously) > + * online too. >*/ > if (start_pfn > has->start_pfn && > - !PageReserved(pfn_to_page(start_pfn - 1))) > + online_section_nr(pfn_to_section_nr(start_pfn))) > hv_bring_pgs_online(has, start_pfn, pgs_ol); > > } > I wonder if you should use pfn_to_online_page() and check for PageOffline(). (I guess online_section_nr() should also do the trick) -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFCv2 0/4] mm/memory_hotplug: Introduce memory block types
On 20.12.18 14:08, Michal Hocko wrote: > On Thu 20-12-18 13:58:16, David Hildenbrand wrote: >> On 30.11.18 18:59, David Hildenbrand wrote: >>> This is the second approach, introducing more meaningful memory block >>> types and not changing online behavior in the kernel. It is based on >>> latest linux-next. >>> >>> As we found out during dicussion, user space should always handle onlining >>> of memory, in any case. However in order to make smart decisions in user >>> space about if and how to online memory, we have to export more information >>> about memory blocks. This way, we can formulate rules in user space. >>> >>> One such information is the type of memory block we are talking about. >>> This helps to answer some questions like: >>> - Does this memory block belong to a DIMM? >>> - Can this DIMM theoretically ever be unplugged again? >>> - Was this memory added by a balloon driver that will rely on balloon >>> inflation to remove chunks of that memory again? Which zone is advised? >>> - Is this special standby memory on s390x that is usually not automatically >>> onlined? >>> >>> And in short it helps to answer to some extend (excluding zone imbalances) >>> - Should I online this memory block? >>> - To which zone should I online this memory block? >>> ... of course special use cases will result in different anwers. But that's >>> why user space has control of onlining memory. >>> >>> More details can be found in Patch 1 and Patch 3. >>> Tested on x86 with hotplugged DIMMs. Cross-compiled for PPC and s390x. >>> >>> >>> Example: >>> $ udevadm info -q all -a /sys/devices/system/memory/memory0 >>> KERNEL=="memory0" >>> SUBSYSTEM=="memory" >>> DRIVER=="" >>> ATTR{online}=="1" >>> ATTR{phys_device}=="0" >>> ATTR{phys_index}=="" >>> ATTR{removable}=="0" >>> ATTR{state}=="online" >>> ATTR{type}=="boot" >>> ATTR{valid_zones}=="none" >>> $ udevadm info -q all -a /sys/devices/system/memory/memory90 >>> KERNEL=="memory90" >>> SUBSYSTEM=="memory" >>> DRIVER=="" >>> ATTR{online}=="1" >>> ATTR{phys_device}=="0" >>> ATTR{phys_index}=="005a" >>> ATTR{removable}=="1" >>> ATTR{state}=="online" >>> ATTR{type}=="dimm" >>> ATTR{valid_zones}=="Normal" >>> >>> >>> RFC -> RFCv2: >>> - Now also taking care of PPC (somehow missed it :/ ) >>> - Split the series up to some degree (some ideas on how to split up patch 3 >>> would be very welcome) >>> - Introduce more memory block types. Turns out abstracting too much was >>> rather confusing and not helpful. Properly document them. >>> >>> Notes: >>> - I wanted to convert the enum of types into a named enum but this >>> provoked all kinds of different errors. For now, I am doing it just like >>> the other types (e.g. online_type) we are using in that context. >>> - The "removable" property should never have been named like that. It >>> should have been "offlinable". Can we still rename that? E.g. boot memory >>> is sometimes marked as removable ... >>> >> >> >> Any feedback regarding the suggested block types would be very much >> appreciated! > > I still do not like this much to be honest. I just didn't get to think > through this properly. My fear is that this is conflating an actual API > with the current implementation and as such will cause problems in > future. But I haven't really looked into your patches closely so I might > be wrong. Anyway I won't be able to look into it by the end of year. > I guess as long as we have memory block devices and we expect user space to make a decision we will have this API and the involved problems. I am open for alternatives, and as I said, any feedback on how to sort this out will be highly appreciated. I'll be on vacation for the next two weeks, so this can wait. Just wanted to note that I am still interested in feedback :) -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFCv2 0/4] mm/memory_hotplug: Introduce memory block types
On 30.11.18 18:59, David Hildenbrand wrote: > This is the second approach, introducing more meaningful memory block > types and not changing online behavior in the kernel. It is based on > latest linux-next. > > As we found out during dicussion, user space should always handle onlining > of memory, in any case. However in order to make smart decisions in user > space about if and how to online memory, we have to export more information > about memory blocks. This way, we can formulate rules in user space. > > One such information is the type of memory block we are talking about. > This helps to answer some questions like: > - Does this memory block belong to a DIMM? > - Can this DIMM theoretically ever be unplugged again? > - Was this memory added by a balloon driver that will rely on balloon > inflation to remove chunks of that memory again? Which zone is advised? > - Is this special standby memory on s390x that is usually not automatically > onlined? > > And in short it helps to answer to some extend (excluding zone imbalances) > - Should I online this memory block? > - To which zone should I online this memory block? > ... of course special use cases will result in different anwers. But that's > why user space has control of onlining memory. > > More details can be found in Patch 1 and Patch 3. > Tested on x86 with hotplugged DIMMs. Cross-compiled for PPC and s390x. > > > Example: > $ udevadm info -q all -a /sys/devices/system/memory/memory0 > KERNEL=="memory0" > SUBSYSTEM=="memory" > DRIVER=="" > ATTR{online}=="1" > ATTR{phys_device}=="0" > ATTR{phys_index}=="" > ATTR{removable}=="0" > ATTR{state}=="online" > ATTR{type}=="boot" > ATTR{valid_zones}=="none" > $ udevadm info -q all -a /sys/devices/system/memory/memory90 > KERNEL=="memory90" > SUBSYSTEM=="memory" > DRIVER=="" > ATTR{online}=="1" > ATTR{phys_device}=="0" > ATTR{phys_index}=="005a" > ATTR{removable}=="1" > ATTR{state}=="online" > ATTR{type}=="dimm" > ATTR{valid_zones}=="Normal" > > > RFC -> RFCv2: > - Now also taking care of PPC (somehow missed it :/ ) > - Split the series up to some degree (some ideas on how to split up patch 3 > would be very welcome) > - Introduce more memory block types. Turns out abstracting too much was > rather confusing and not helpful. Properly document them. > > Notes: > - I wanted to convert the enum of types into a named enum but this > provoked all kinds of different errors. For now, I am doing it just like > the other types (e.g. online_type) we are using in that context. > - The "removable" property should never have been named like that. It > should have been "offlinable". Can we still rename that? E.g. boot memory > is sometimes marked as removable ... > Any feedback regarding the suggested block types would be very much appreciated! -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFCv2 3/4] mm/memory_hotplug: Introduce and use more memory types
On 04.12.18 10:44, Michal Suchánek wrote: > On Fri, 30 Nov 2018 18:59:21 +0100 > David Hildenbrand wrote: > >> Let's introduce new types for different kinds of memory blocks and use >> them in existing code. As I don't see an easy way to split this up, >> do it in one hunk for now. >> >> acpi: >> Use DIMM or DIMM_UNREMOVABLE depending on hotremove support in the kernel. >> Properly change the type when trying to add memory that was already >> detected and used during boot (so this memory will correctly end up as >> "acpi" in user space). >> >> pseries: >> Use DIMM or DIMM_UNREMOVABLE depending on hotremove support in the kernel. >> As far as I see, handling like in the acpi case for existing blocks is >> not required. >> >> probed memory from user space: >> Use DIMM_UNREMOVABLE as there is no interface to get rid of this code >> again. >> >> hv_balloon,xen/balloon: >> Use BALLOON. As simple as that :) >> >> s390x/sclp: >> Use a dedicated type S390X_STANDBY as this type of memory and it's >> semantics are very s390x specific. >> >> powernv/memtrace: >> Only allow to use BOOT memory for memtrace. I consider this code in >> general dangerous, but we have to keep it working ... most probably just >> a debug feature. > > I don't think it should be arbitrarily restricted like that. > Well code that "randomly" offlines/onlines/removes/adds memory blocks that it does not own (hint: nobody else in the kernel does that), should be restricted to types we can guarantee to work. > Thanks > > Michal > -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFCv2 2/4] mm/memory_hotplug: Replace "bool want_memblock" by "int type"
On 01.12.18 02:50, Wei Yang wrote: > On Fri, Nov 30, 2018 at 06:59:20PM +0100, David Hildenbrand wrote: >> Let's pass a memory block type instead. Pass "MEMORY_BLOCK_NONE" for device >> memory and for now "MEMORY_BLOCK_UNSPECIFIED" for anything else. No >> functional change. > > I would suggest to put more words to this. Sure, makes sense, I'll add more details. Thanks! > > " > Function arch_add_memory()'s last parameter *want_memblock* is used to > determin whether it is necessary to create a corresponding memory block > device. After introducing the memory block type, this patch replaces the > bool type *want_memblock* with memory block type with following rules > for now: > > * Pass "MEMORY_BLOCK_NONE" for device memory > * Pass "MEMORY_BLOCK_UNSPECIFIED" for anything else > > Since this parameter is passed deep to __add_section(), all its > descendents are effected. Below lists those descendents. > > arch_add_memory() > add_pages() > __add_pages() > __add_section() > > " [...] -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFCv2 1/4] mm/memory_hotplug: Introduce memory block types
On 01.12.18 02:25, Wei Yang wrote: > On Fri, Nov 30, 2018 at 06:59:19PM +0100, David Hildenbrand wrote: >> Memory onlining should always be handled by user space, because only user >> space knows which use cases it wants to satisfy. E.g. memory might be >> onlined to the MOVABLE zone even if it can never be removed from the >> system, e.g. to make usage of huge pages more reliable. >> >> However to implement such rules (especially default rules in distributions) >> we need more information about the memory that was added in user space. >> >> E.g. on x86 we want to online memory provided by balloon devices (e.g. >> XEN, Hyper-V) differently (-> will not be unplugged by offlining the whole >> block) than ordinary DIMMs (-> might eventually be unplugged by offlining >> the whole block). This might also become relevat for other architectures. >> >> Also, udev rules right now check if running on s390x and treat all added >> memory blocks as standby memory (-> don't online automatically). As soon as >> we support other memory hotplug mechanism (e.g. virtio-mem) checks would >> have to get more involved (e.g. also check if under KVM) but eventually >> also wrong (e.g. if KVM ever supports standby memory we are doomed). >> >> I decided to allow to specify the type of memory that is getting added >> to the system. Let's start with two types, BOOT and UNSPECIFIED to get the >> basic infrastructure running. We'll introduce and use further types in >> follow-up patches. For now we classify any hotplugged memory temporarily >> as as UNSPECIFIED (which will eventually be dropped later on). >> >> Cc: Greg Kroah-Hartman >> Cc: "Rafael J. Wysocki" >> Cc: Andrew Morton >> Cc: Ingo Molnar >> Cc: Pavel Tatashin >> Cc: Stephen Rothwell >> Cc: Andrew Banman >> Cc: "mike.tra...@hpe.com" >> Cc: Oscar Salvador >> Cc: Dave Hansen >> Cc: Michal Hocko >> Cc: Michal Such??nek >> Cc: Vitaly Kuznetsov >> Cc: Dan Williams >> Cc: Pavel Tatashin >> Cc: Martin Schwidefsky >> Cc: Heiko Carstens >> Signed-off-by: David Hildenbrand >> --- >> drivers/base/memory.c | 38 +++--- >> include/linux/memory.h | 27 +++ >> 2 files changed, 62 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/base/memory.c b/drivers/base/memory.c >> index 0c290f86ab20..17f2985c07c5 100644 >> --- a/drivers/base/memory.c >> +++ b/drivers/base/memory.c >> @@ -381,6 +381,29 @@ static ssize_t show_phys_device(struct device *dev, >> return sprintf(buf, "%d\n", mem->phys_device); >> } >> >> +static ssize_t type_show(struct device *dev, struct device_attribute *attr, >> + char *buf) >> +{ >> +struct memory_block *mem = to_memory_block(dev); >> +ssize_t len = 0; >> + >> +switch (mem->type) { >> +case MEMORY_BLOCK_UNSPECIFIED: >> +len = sprintf(buf, "unspecified\n"); >> +break; >> +case MEMORY_BLOCK_BOOT: >> +len = sprintf(buf, "boot\n"); >> +break; >> +default: >> +len = sprintf(buf, "ERROR-UNKNOWN-%ld\n", >> +mem->state); >> +WARN_ON(1); >> +break; >> +} >> + >> +return len; >> +} >> + >> #ifdef CONFIG_MEMORY_HOTREMOVE >> static void print_allowed_zone(char *buf, int nid, unsigned long start_pfn, >> unsigned long nr_pages, int online_type, >> @@ -442,6 +465,7 @@ static DEVICE_ATTR(phys_index, 0444, >> show_mem_start_phys_index, NULL); >> static DEVICE_ATTR(state, 0644, show_mem_state, store_mem_state); >> static DEVICE_ATTR(phys_device, 0444, show_phys_device, NULL); >> static DEVICE_ATTR(removable, 0444, show_mem_removable, NULL); >> +static DEVICE_ATTR_RO(type); > > This is correct, while looks not consistent with other attributes. > > Not that beautiful :-) I might change the other ones first, too (or keep this one consistent to the existing ones). Thanks! > >> >> /* >> * Block size attribute stuff >> @@ -620,6 +644,7 @@ static struct attribute *memory_memblk_attrs[] = { >> _attr_state.attr, >> _attr_phys_device.attr, >> _attr_removable.attr, >> +_attr_type.attr, >> #ifdef CONFIG_MEMORY_HOTREMOVE >> _attr_valid_zones.attr, >> #endif >> @@ -657,13 +682,17 @@ int register_memory(struct memory_blo
[PATCH RFCv2 3/4] mm/memory_hotplug: Introduce and use more memory types
Let's introduce new types for different kinds of memory blocks and use them in existing code. As I don't see an easy way to split this up, do it in one hunk for now. acpi: Use DIMM or DIMM_UNREMOVABLE depending on hotremove support in the kernel. Properly change the type when trying to add memory that was already detected and used during boot (so this memory will correctly end up as "acpi" in user space). pseries: Use DIMM or DIMM_UNREMOVABLE depending on hotremove support in the kernel. As far as I see, handling like in the acpi case for existing blocks is not required. probed memory from user space: Use DIMM_UNREMOVABLE as there is no interface to get rid of this code again. hv_balloon,xen/balloon: Use BALLOON. As simple as that :) s390x/sclp: Use a dedicated type S390X_STANDBY as this type of memory and it's semantics are very s390x specific. powernv/memtrace: Only allow to use BOOT memory for memtrace. I consider this code in general dangerous, but we have to keep it working ... most probably just a debug feature. Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: "Rafael J. Wysocki" Cc: Len Brown Cc: Greg Kroah-Hartman Cc: "K. Y. Srinivasan" Cc: Haiyang Zhang Cc: Stephen Hemminger Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: Rashmica Gupta Cc: Andrew Morton Cc: Pavel Tatashin Cc: Balbir Singh Cc: Michael Neuling Cc: Nathan Fontenot Cc: YueHaibing Cc: Vasily Gorbik Cc: Ingo Molnar Cc: Stephen Rothwell Cc: "mike.tra...@hpe.com" Cc: Oscar Salvador Cc: Joonsoo Kim Cc: Mathieu Malaterre Cc: Michal Hocko Cc: Arun KS Cc: Andrew Banman Cc: Dave Hansen Cc: Michal Suchánek Cc: Vitaly Kuznetsov Cc: Dan Williams Signed-off-by: David Hildenbrand --- At first I tried to abstract the types quite a lot, but I think there are subtle differences that are worth differentiating. More details about the types can be found in the excessive documentation. It is wort noting that BALLOON_MOVABLE has no user yet, but I have something in mind that might want to make use of that (virtio-mem). Just included it to discuss the general approach. I can drop it from this patch. --- arch/powerpc/platforms/powernv/memtrace.c | 9 ++-- .../platforms/pseries/hotplug-memory.c| 7 ++- drivers/acpi/acpi_memhotplug.c| 16 ++- drivers/base/memory.c | 18 ++- drivers/hv/hv_balloon.c | 3 +- drivers/s390/char/sclp_cmd.c | 3 +- drivers/xen/balloon.c | 2 +- include/linux/memory.h| 47 ++- include/linux/memory_hotplug.h| 6 +-- mm/memory_hotplug.c | 15 +++--- 10 files changed, 104 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c index 248a38ad25c7..5d08db87091e 100644 --- a/arch/powerpc/platforms/powernv/memtrace.c +++ b/arch/powerpc/platforms/powernv/memtrace.c @@ -54,9 +54,9 @@ static const struct file_operations memtrace_fops = { .open = simple_open, }; -static int check_memblock_online(struct memory_block *mem, void *arg) +static int check_memblock_boot_and_online(struct memory_block *mem, void *arg) { - if (mem->state != MEM_ONLINE) + if (mem->type != MEM_BLOCK_BOOT || mem->state != MEM_ONLINE) return -1; return 0; @@ -77,7 +77,7 @@ static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages) u64 end_pfn = start_pfn + nr_pages - 1; if (walk_memory_range(start_pfn, end_pfn, NULL, - check_memblock_online)) + check_memblock_boot_and_online)) return false; walk_memory_range(start_pfn, end_pfn, (void *)MEM_GOING_OFFLINE, @@ -233,7 +233,8 @@ static int memtrace_online(void) ent->mem = 0; } - if (add_memory(ent->nid, ent->start, ent->size)) { + if (add_memory(ent->nid, ent->start, ent->size, + MEMORY_BLOCK_BOOT)) { pr_err("Failed to add trace memory to node %d\n", ent->nid); ret += 1; diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 2a983b5a52e1..5f91359c7993 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -651,7 +651,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index) static int dlpar_add_lmb(struct drmem_lmb *lmb) { unsigned long block_sz; - int nid, rc; + int nid, rc, type = MEMORY_BLOCK_DIMM; if (lmb->flags & DRCONF_MEM_ASSI
[PATCH RFCv2 4/4] mm/memory_hotplug: Drop MEMORY_TYPE_UNSPECIFIED
We now have proper types for all users, we can drop this one. Cc: Greg Kroah-Hartman Cc: "Rafael J. Wysocki" Cc: Andrew Morton Cc: Ingo Molnar Cc: Pavel Tatashin Cc: Stephen Rothwell Cc: Andrew Banman Cc: "mike.tra...@hpe.com" Cc: Oscar Salvador Cc: Dave Hansen Cc: Michal Hocko Cc: Michal Suchánek Cc: Vitaly Kuznetsov Cc: Dan Williams Cc: Pavel Tatashin Signed-off-by: David Hildenbrand --- drivers/base/memory.c | 3 --- include/linux/memory.h | 5 - 2 files changed, 8 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index c5fdca7a3009..a6e524f0ea38 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -388,9 +388,6 @@ static ssize_t type_show(struct device *dev, struct device_attribute *attr, ssize_t len = 0; switch (mem->type) { - case MEMORY_BLOCK_UNSPECIFIED: - len = sprintf(buf, "unspecified\n"); - break; case MEMORY_BLOCK_BOOT: len = sprintf(buf, "boot\n"); break; diff --git a/include/linux/memory.h b/include/linux/memory.h index a3a1e9764805..11679622f743 100644 --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -50,10 +50,6 @@ int set_memory_block_size_order(unsigned int order); * No memory block is to be created (e.g. device memory). Not exposed to * user space. * - * MEMORY_BLOCK_UNSPECIFIED: - * The type of memory block was not further specified when adding the - * memory block. - * * MEMORY_BLOCK_BOOT: * This memory block was added during boot by the basic system. No * specific device driver takes care of this memory block. This memory @@ -103,7 +99,6 @@ int set_memory_block_size_order(unsigned int order); */ enum { MEMORY_BLOCK_NONE = 0, - MEMORY_BLOCK_UNSPECIFIED, MEMORY_BLOCK_BOOT, MEMORY_BLOCK_DIMM, MEMORY_BLOCK_DIMM_UNREMOVABLE, -- 2.17.2 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH RFCv2 1/4] mm/memory_hotplug: Introduce memory block types
Memory onlining should always be handled by user space, because only user space knows which use cases it wants to satisfy. E.g. memory might be onlined to the MOVABLE zone even if it can never be removed from the system, e.g. to make usage of huge pages more reliable. However to implement such rules (especially default rules in distributions) we need more information about the memory that was added in user space. E.g. on x86 we want to online memory provided by balloon devices (e.g. XEN, Hyper-V) differently (-> will not be unplugged by offlining the whole block) than ordinary DIMMs (-> might eventually be unplugged by offlining the whole block). This might also become relevat for other architectures. Also, udev rules right now check if running on s390x and treat all added memory blocks as standby memory (-> don't online automatically). As soon as we support other memory hotplug mechanism (e.g. virtio-mem) checks would have to get more involved (e.g. also check if under KVM) but eventually also wrong (e.g. if KVM ever supports standby memory we are doomed). I decided to allow to specify the type of memory that is getting added to the system. Let's start with two types, BOOT and UNSPECIFIED to get the basic infrastructure running. We'll introduce and use further types in follow-up patches. For now we classify any hotplugged memory temporarily as as UNSPECIFIED (which will eventually be dropped later on). Cc: Greg Kroah-Hartman Cc: "Rafael J. Wysocki" Cc: Andrew Morton Cc: Ingo Molnar Cc: Pavel Tatashin Cc: Stephen Rothwell Cc: Andrew Banman Cc: "mike.tra...@hpe.com" Cc: Oscar Salvador Cc: Dave Hansen Cc: Michal Hocko Cc: Michal Suchánek Cc: Vitaly Kuznetsov Cc: Dan Williams Cc: Pavel Tatashin Cc: Martin Schwidefsky Cc: Heiko Carstens Signed-off-by: David Hildenbrand --- drivers/base/memory.c | 38 +++--- include/linux/memory.h | 27 +++ 2 files changed, 62 insertions(+), 3 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 0c290f86ab20..17f2985c07c5 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -381,6 +381,29 @@ static ssize_t show_phys_device(struct device *dev, return sprintf(buf, "%d\n", mem->phys_device); } +static ssize_t type_show(struct device *dev, struct device_attribute *attr, +char *buf) +{ + struct memory_block *mem = to_memory_block(dev); + ssize_t len = 0; + + switch (mem->type) { + case MEMORY_BLOCK_UNSPECIFIED: + len = sprintf(buf, "unspecified\n"); + break; + case MEMORY_BLOCK_BOOT: + len = sprintf(buf, "boot\n"); + break; + default: + len = sprintf(buf, "ERROR-UNKNOWN-%ld\n", + mem->state); + WARN_ON(1); + break; + } + + return len; +} + #ifdef CONFIG_MEMORY_HOTREMOVE static void print_allowed_zone(char *buf, int nid, unsigned long start_pfn, unsigned long nr_pages, int online_type, @@ -442,6 +465,7 @@ static DEVICE_ATTR(phys_index, 0444, show_mem_start_phys_index, NULL); static DEVICE_ATTR(state, 0644, show_mem_state, store_mem_state); static DEVICE_ATTR(phys_device, 0444, show_phys_device, NULL); static DEVICE_ATTR(removable, 0444, show_mem_removable, NULL); +static DEVICE_ATTR_RO(type); /* * Block size attribute stuff @@ -620,6 +644,7 @@ static struct attribute *memory_memblk_attrs[] = { _attr_state.attr, _attr_phys_device.attr, _attr_removable.attr, + _attr_type.attr, #ifdef CONFIG_MEMORY_HOTREMOVE _attr_valid_zones.attr, #endif @@ -657,13 +682,17 @@ int register_memory(struct memory_block *memory) } static int init_memory_block(struct memory_block **memory, -struct mem_section *section, unsigned long state) +struct mem_section *section, unsigned long state, +int type) { struct memory_block *mem; unsigned long start_pfn; int scn_nr; int ret = 0; + if (type == MEMORY_BLOCK_NONE) + return -EINVAL; + mem = kzalloc(sizeof(*mem), GFP_KERNEL); if (!mem) return -ENOMEM; @@ -675,6 +704,7 @@ static int init_memory_block(struct memory_block **memory, mem->state = state; start_pfn = section_nr_to_pfn(mem->start_section_nr); mem->phys_device = arch_get_memory_phys_device(start_pfn); + mem->type = type; ret = register_memory(mem); @@ -699,7 +729,8 @@ static int add_memory_block(int base_section_nr) if (section_count == 0) return 0; - ret = init_memory_block(, __nr_to_section(section_nr), MEM_ONLINE); + ret = init_memory_block(, __nr_to_section(se
[PATCH RFCv2 2/4] mm/memory_hotplug: Replace "bool want_memblock" by "int type"
Let's pass a memory block type instead. Pass "MEMORY_BLOCK_NONE" for device memory and for now "MEMORY_BLOCK_UNSPECIFIED" for anything else. No functional change. Cc: Tony Luck Cc: Fenghua Yu Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Yoshinori Sato Cc: Rich Felker Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: Greg Kroah-Hartman Cc: "Rafael J. Wysocki" Cc: Andrew Morton Cc: Mike Rapoport Cc: Michal Hocko Cc: Dan Williams Cc: "Kirill A. Shutemov" Cc: Oscar Salvador Cc: Nicholas Piggin Cc: Stephen Rothwell Cc: Christophe Leroy Cc: "Jonathan Neuschäfer" Cc: Mauricio Faria de Oliveira Cc: Vasily Gorbik Cc: Arun KS Cc: Rob Herring Cc: Pavel Tatashin Cc: "mike.tra...@hpe.com" Cc: Joonsoo Kim Cc: Wei Yang Cc: Logan Gunthorpe Cc: "Jérôme Glisse" Cc: "Jan H. Schönherr" Cc: Dave Jiang Cc: Matthew Wilcox Cc: Mathieu Malaterre Signed-off-by: David Hildenbrand --- arch/ia64/mm/init.c| 4 ++-- arch/powerpc/mm/mem.c | 4 ++-- arch/s390/mm/init.c| 4 ++-- arch/sh/mm/init.c | 4 ++-- arch/x86/mm/init_32.c | 4 ++-- arch/x86/mm/init_64.c | 8 drivers/base/memory.c | 11 +++ include/linux/memory.h | 2 +- include/linux/memory_hotplug.h | 12 ++-- kernel/memremap.c | 6 -- mm/memory_hotplug.c| 16 11 files changed, 40 insertions(+), 35 deletions(-) diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c index 904fe55e10fc..408635d2902f 100644 --- a/arch/ia64/mm/init.c +++ b/arch/ia64/mm/init.c @@ -646,13 +646,13 @@ mem_init (void) #ifdef CONFIG_MEMORY_HOTPLUG int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, - bool want_memblock) + int type) { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; int ret; - ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock); + ret = __add_pages(nid, start_pfn, nr_pages, altmap, type); if (ret) printk("%s: Problem encountered in __add_pages() as ret=%d\n", __func__, ret); diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index b3c9ee5c4f78..e394637da270 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -118,7 +118,7 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end) } int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, - bool want_memblock) + int type) { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; @@ -135,7 +135,7 @@ int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap * } flush_inval_dcache_range(start, start + size); - return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock); + return __add_pages(nid, start_pfn, nr_pages, altmap, type); } #ifdef CONFIG_MEMORY_HOTREMOVE diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c index 3e82f66d5c61..ba2c56328e6d 100644 --- a/arch/s390/mm/init.c +++ b/arch/s390/mm/init.c @@ -225,7 +225,7 @@ device_initcall(s390_cma_mem_init); #endif /* CONFIG_CMA */ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, - bool want_memblock) + int type) { unsigned long start_pfn = PFN_DOWN(start); unsigned long size_pages = PFN_DOWN(size); @@ -235,7 +235,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, if (rc) return rc; - rc = __add_pages(nid, start_pfn, size_pages, altmap, want_memblock); + rc = __add_pages(nid, start_pfn, size_pages, altmap, type); if (rc) vmem_remove_mapping(start, size); return rc; diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c index 1a483a008872..5fbb8724e0f2 100644 --- a/arch/sh/mm/init.c +++ b/arch/sh/mm/init.c @@ -419,14 +419,14 @@ void free_initrd_mem(unsigned long start, unsigned long end) #ifdef CONFIG_MEMORY_HOTPLUG int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, - bool want_memblock) + int type) { unsigned long start_pfn = PFN_DOWN(start); unsigned long nr_pages = size >> PAGE_SHIFT; int ret; /* We only have ZONE_NORMAL, so this is easy.. */ - ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock); + ret = __add_pages(nid, start_pfn, nr_pages, altmap, typ
[PATCH RFCv2 0/4] mm/memory_hotplug: Introduce memory block types
This is the second approach, introducing more meaningful memory block types and not changing online behavior in the kernel. It is based on latest linux-next. As we found out during dicussion, user space should always handle onlining of memory, in any case. However in order to make smart decisions in user space about if and how to online memory, we have to export more information about memory blocks. This way, we can formulate rules in user space. One such information is the type of memory block we are talking about. This helps to answer some questions like: - Does this memory block belong to a DIMM? - Can this DIMM theoretically ever be unplugged again? - Was this memory added by a balloon driver that will rely on balloon inflation to remove chunks of that memory again? Which zone is advised? - Is this special standby memory on s390x that is usually not automatically onlined? And in short it helps to answer to some extend (excluding zone imbalances) - Should I online this memory block? - To which zone should I online this memory block? ... of course special use cases will result in different anwers. But that's why user space has control of onlining memory. More details can be found in Patch 1 and Patch 3. Tested on x86 with hotplugged DIMMs. Cross-compiled for PPC and s390x. Example: $ udevadm info -q all -a /sys/devices/system/memory/memory0 KERNEL=="memory0" SUBSYSTEM=="memory" DRIVER=="" ATTR{online}=="1" ATTR{phys_device}=="0" ATTR{phys_index}=="" ATTR{removable}=="0" ATTR{state}=="online" ATTR{type}=="boot" ATTR{valid_zones}=="none" $ udevadm info -q all -a /sys/devices/system/memory/memory90 KERNEL=="memory90" SUBSYSTEM=="memory" DRIVER=="" ATTR{online}=="1" ATTR{phys_device}=="0" ATTR{phys_index}=="005a" ATTR{removable}=="1" ATTR{state}=="online" ATTR{type}=="dimm" ATTR{valid_zones}=="Normal" RFC -> RFCv2: - Now also taking care of PPC (somehow missed it :/ ) - Split the series up to some degree (some ideas on how to split up patch 3 would be very welcome) - Introduce more memory block types. Turns out abstracting too much was rather confusing and not helpful. Properly document them. Notes: - I wanted to convert the enum of types into a named enum but this provoked all kinds of different errors. For now, I am doing it just like the other types (e.g. online_type) we are using in that context. - The "removable" property should never have been named like that. It should have been "offlinable". Can we still rename that? E.g. boot memory is sometimes marked as removable ... David Hildenbrand (4): mm/memory_hotplug: Introduce memory block types mm/memory_hotplug: Replace "bool want_memblock" by "int type" mm/memory_hotplug: Introduce and use more memory types mm/memory_hotplug: Drop MEMORY_TYPE_UNSPECIFIED arch/ia64/mm/init.c | 4 +- arch/powerpc/mm/mem.c | 4 +- arch/powerpc/platforms/powernv/memtrace.c | 9 +-- .../platforms/pseries/hotplug-memory.c| 7 +- arch/s390/mm/init.c | 4 +- arch/sh/mm/init.c | 4 +- arch/x86/mm/init_32.c | 4 +- arch/x86/mm/init_64.c | 8 +-- drivers/acpi/acpi_memhotplug.c| 16 - drivers/base/memory.c | 60 ++-- drivers/hv/hv_balloon.c | 3 +- drivers/s390/char/sclp_cmd.c | 3 +- drivers/xen/balloon.c | 2 +- include/linux/memory.h| 69 ++- include/linux/memory_hotplug.h| 18 ++--- kernel/memremap.c | 6 +- mm/memory_hotplug.c | 29 17 files changed, 194 insertions(+), 56 deletions(-) -- 2.17.2 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
On 27.11.18 17:32, Michal Suchánek wrote: > On Mon, 26 Nov 2018 16:59:14 +0100 > David Hildenbrand wrote: > >> On 26.11.18 15:20, Michal Suchánek wrote: >>> On Mon, 26 Nov 2018 14:33:29 +0100 >>> David Hildenbrand wrote: >>> >>>> On 26.11.18 13:30, David Hildenbrand wrote: >>>>> On 23.11.18 19:06, Michal Suchánek wrote: >>> >>>>>> >>>>>> If we are going to fake the driver information we may as well add the >>>>>> type attribute and be done with it. >>>>>> >>>>>> I think the problem with the patch was more with the semantic than the >>>>>> attribute itself. >>>>>> >>>>>> What is normal, paravirtualized, and standby memory? >>>>>> >>>>>> I can understand DIMM device, baloon device, or whatever mechanism for >>>>>> adding memory you might have. >>>>>> >>>>>> I can understand "memory designated as standby by the cluster >>>>>> administrator". >>>>>> >>>>>> However, DIMM vs baloon is orthogonal to standby and should not be >>>>>> conflated into one property. >>>>>> >>>>>> paravirtualized means nothing at all in relationship to memory type and >>>>>> the desired online policy to me. >>>>> >>>>> Right, so with whatever we come up, it should allow to make a decision >>>>> in user space about >>>>> - if memory is to be onlined automatically >>>> >>>> And I will think about if we really should model standby memory. Maybe >>>> it is really better to have in user space something like (as Dan noted) >>> >>> If it is possible to designate the memory as standby or online in the >>> s390 admin interface and the kernel does have access to this >>> information it makes sense to forward it to userspace (as separate >>> s390-specific property). If not then you need to make some kind of >>> assumption like below and the user can tune the script according to >>> their usecase. >> >> Also true, standby memory really represents a distinct type of memory >> block (memory seems to be there but really isn't). Right now I am >> thinking about something like this (tried to formulate it on a very >> generic level because we can't predict which mechanism might want to >> make use of these types in the future). >> >> >> /* >> * Memory block types allow user space to formulate rules if and how to >> * online memory blocks. The types are exposed to user space as text >> * strings in sysfs. While the typical online strategies are described >> * along with the types, there are use cases where that can differ (e.g. >> * use MOVABLE zone for more reliable huge page usage, use NORMAL zone >> * due to zone imbalance or because memory unplug is not intended). >> * >> * MEMORY_BLOCK_NONE: >> * No memory block is to be created (e.g. device memory). Used internally >> * only. >> * >> * MEMORY_BLOCK_REMOVABLE: >> * This memory block type should be treated as if it can be >> * removed/unplugged from the system again. E.g. there is a hardware >> * interface to unplug such memory. This memory block type is usually >> * onlined to the MOVABLE zone, to e.g. make offlining of it more >> * reliable. Examples include ACPI and PPC DIMMs. >> * >> * MEMORY_BLOCK_UNREMOVABLE: >> * This memory block type should be treated as if it can not be >> * removed/unplugged again. E.g. there is no hardware interface to >> * unplug such memory. This memory block type is usually onlined to >> * the NORMAL zone, as offlining is not beneficial. Examples include boot >> * memory on most architectures and memory added via balloon devices. > > AFAIK baloon device can be inflated as well so this does not really > describe how this memory type works in any meaningful way. Also it > should not be possible to see this kind of memory from userspace. The > baloon driver just takes existing memory that is properly backed, > allocates it for itself, and allows the hypervisor to use it. Thus it > creates the equivalent to s390 standby memory which is not backed in > the VM. When memory is reclaimed from hypervisor the baloon driver > frees it making it available to the VM kernel again. However, the whole > time the memory appears present in the machine and no hotplug events > should be visible unl
Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
On 26.11.18 15:20, Michal Suchánek wrote: > On Mon, 26 Nov 2018 14:33:29 +0100 > David Hildenbrand wrote: > >> On 26.11.18 13:30, David Hildenbrand wrote: >>> On 23.11.18 19:06, Michal Suchánek wrote: > >>>> >>>> If we are going to fake the driver information we may as well add the >>>> type attribute and be done with it. >>>> >>>> I think the problem with the patch was more with the semantic than the >>>> attribute itself. >>>> >>>> What is normal, paravirtualized, and standby memory? >>>> >>>> I can understand DIMM device, baloon device, or whatever mechanism for >>>> adding memory you might have. >>>> >>>> I can understand "memory designated as standby by the cluster >>>> administrator". >>>> >>>> However, DIMM vs baloon is orthogonal to standby and should not be >>>> conflated into one property. >>>> >>>> paravirtualized means nothing at all in relationship to memory type and >>>> the desired online policy to me. >>> >>> Right, so with whatever we come up, it should allow to make a decision >>> in user space about >>> - if memory is to be onlined automatically >> >> And I will think about if we really should model standby memory. Maybe >> it is really better to have in user space something like (as Dan noted) > > If it is possible to designate the memory as standby or online in the > s390 admin interface and the kernel does have access to this > information it makes sense to forward it to userspace (as separate > s390-specific property). If not then you need to make some kind of > assumption like below and the user can tune the script according to > their usecase. Also true, standby memory really represents a distinct type of memory block (memory seems to be there but really isn't). Right now I am thinking about something like this (tried to formulate it on a very generic level because we can't predict which mechanism might want to make use of these types in the future). /* * Memory block types allow user space to formulate rules if and how to * online memory blocks. The types are exposed to user space as text * strings in sysfs. While the typical online strategies are described * along with the types, there are use cases where that can differ (e.g. * use MOVABLE zone for more reliable huge page usage, use NORMAL zone * due to zone imbalance or because memory unplug is not intended). * * MEMORY_BLOCK_NONE: * No memory block is to be created (e.g. device memory). Used internally * only. * * MEMORY_BLOCK_REMOVABLE: * This memory block type should be treated as if it can be * removed/unplugged from the system again. E.g. there is a hardware * interface to unplug such memory. This memory block type is usually * onlined to the MOVABLE zone, to e.g. make offlining of it more * reliable. Examples include ACPI and PPC DIMMs. * * MEMORY_BLOCK_UNREMOVABLE: * This memory block type should be treated as if it can not be * removed/unplugged again. E.g. there is no hardware interface to * unplug such memory. This memory block type is usually onlined to * the NORMAL zone, as offlining is not beneficial. Examples include boot * memory on most architectures and memory added via balloon devices. * * MEMORY_BLOCK_STANDBY: * The memory block type should be treated as if it can be * removed/unplugged again, however the actual memory hot(un)plug is * performed by onlining/offlining. In virtual environments, such memory * is usually added during boot and never removed. Onlining memory will * result in memory getting allocated to a VM. This memory type is usually * not onlined automatically but explicitly by the administrator. One * example is standby memory on s390x. */ > >> >> if (isS390x() && type == "dimm") { >> /* don't online, on s390x system DIMMs are standby memory */ >> } > > Thanks > > Michal > -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
On 26.11.18 13:30, David Hildenbrand wrote: > On 23.11.18 19:06, Michal Suchánek wrote: >> On Fri, 23 Nov 2018 12:13:58 +0100 >> David Hildenbrand wrote: >> >>> On 28.09.18 17:03, David Hildenbrand wrote: >>>> How to/when to online hotplugged memory is hard to manage for >>>> distributions because different memory types are to be treated differently. >>>> Right now, we need complicated udev rules that e.g. check if we are >>>> running on s390x, on a physical system or on a virtualized system. But >>>> there is also sometimes the demand to really online memory immediately >>>> while adding in the kernel and not to wait for user space to make a >>>> decision. And on virtualized systems there might be different >>>> requirements, depending on "how" the memory was added (and if it will >>>> eventually get unplugged again - DIMM vs. paravirtualized mechanisms). >>>> >>>> On the one hand, we have physical systems where we sometimes >>>> want to be able to unplug memory again - e.g. a DIMM - so we have to online >>>> it to the MOVABLE zone optionally. That decision is usually made in user >>>> space. >>>> >>>> On the other hand, we have memory that should never be onlined >>>> automatically, only when asked for by an administrator. Such memory only >>>> applies to virtualized environments like s390x, where the concept of >>>> "standby" memory exists. Memory is detected and added during boot, so it >>>> can be onlined when requested by the admininistrator or some tooling. >>>> Only when onlining, memory will be allocated in the hypervisor. >>>> >>>> But then, we also have paravirtualized devices (namely xen and hyper-v >>>> balloons), that hotplug memory that will never ever be removed from a >>>> system right now using offline_pages/remove_memory. If at all, this memory >>>> is logically unplugged and handed back to the hypervisor via ballooning. >>>> >>>> For paravirtualized devices it is relevant that memory is onlined as >>>> quickly as possible after adding - and that it is added to the NORMAL >>>> zone. Otherwise, it could happen that too much memory in a row is added >>>> (but not onlined), resulting in out-of-memory conditions due to the >>>> additional memory for "struct pages" and friends. MOVABLE zone as well >>>> as delays might be very problematic and lead to crashes (e.g. zone >>>> imbalance). >>>> >>>> Therefore, introduce memory block types and online memory depending on >>>> it when adding the memory. Expose the memory type to user space, so user >>>> space handlers can start to process only "normal" memory. Other memory >>>> block types can be ignored. One thing less to worry about in user space. >>>> >>> >>> So I was looking into alternatives. >>> >>> 1. Provide only "normal" and "standby" memory types to user space. This >>> way user space can make smarter decisions about how to online memory. >>> Not really sure if this is the right way to go. >>> >>> >>> 2. Use device driver information (as mentioned by Michal S.). >>> >>> The problem right now is that there are no drivers for memory block >>> devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent >>> will not contain a "DRIVER" information and we ave no idea what kind of >>> memory block device we hold in our hands. >>> >>> $ udevadm info -q all -a /sys/devices/system/memory/memory0 >>> >>> looking at device '/devices/system/memory/memory0': >>> KERNEL=="memory0" >>> SUBSYSTEM=="memory" >>> DRIVER=="" >>> ATTR{online}=="1" >>> ATTR{phys_device}=="0" >>> ATTR{phys_index}=="" >>> ATTR{removable}=="0" >>> ATTR{state}=="online" >>> ATTR{valid_zones}=="none" >>> >>> >>> If we would provide "fake" drivers for the memory block devices we want >>> to treat in a special way in user space (e.g. standby memory on s390x), >>> user space could use that information to make smarter decisions. >>> >>> Adding such drivers might work. My suggestion would be to let ordinary >>> DIMMs be without a driver for n
Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
On 23.11.18 19:06, Michal Suchánek wrote: > On Fri, 23 Nov 2018 12:13:58 +0100 > David Hildenbrand wrote: > >> On 28.09.18 17:03, David Hildenbrand wrote: >>> How to/when to online hotplugged memory is hard to manage for >>> distributions because different memory types are to be treated differently. >>> Right now, we need complicated udev rules that e.g. check if we are >>> running on s390x, on a physical system or on a virtualized system. But >>> there is also sometimes the demand to really online memory immediately >>> while adding in the kernel and not to wait for user space to make a >>> decision. And on virtualized systems there might be different >>> requirements, depending on "how" the memory was added (and if it will >>> eventually get unplugged again - DIMM vs. paravirtualized mechanisms). >>> >>> On the one hand, we have physical systems where we sometimes >>> want to be able to unplug memory again - e.g. a DIMM - so we have to online >>> it to the MOVABLE zone optionally. That decision is usually made in user >>> space. >>> >>> On the other hand, we have memory that should never be onlined >>> automatically, only when asked for by an administrator. Such memory only >>> applies to virtualized environments like s390x, where the concept of >>> "standby" memory exists. Memory is detected and added during boot, so it >>> can be onlined when requested by the admininistrator or some tooling. >>> Only when onlining, memory will be allocated in the hypervisor. >>> >>> But then, we also have paravirtualized devices (namely xen and hyper-v >>> balloons), that hotplug memory that will never ever be removed from a >>> system right now using offline_pages/remove_memory. If at all, this memory >>> is logically unplugged and handed back to the hypervisor via ballooning. >>> >>> For paravirtualized devices it is relevant that memory is onlined as >>> quickly as possible after adding - and that it is added to the NORMAL >>> zone. Otherwise, it could happen that too much memory in a row is added >>> (but not onlined), resulting in out-of-memory conditions due to the >>> additional memory for "struct pages" and friends. MOVABLE zone as well >>> as delays might be very problematic and lead to crashes (e.g. zone >>> imbalance). >>> >>> Therefore, introduce memory block types and online memory depending on >>> it when adding the memory. Expose the memory type to user space, so user >>> space handlers can start to process only "normal" memory. Other memory >>> block types can be ignored. One thing less to worry about in user space. >>> >> >> So I was looking into alternatives. >> >> 1. Provide only "normal" and "standby" memory types to user space. This >> way user space can make smarter decisions about how to online memory. >> Not really sure if this is the right way to go. >> >> >> 2. Use device driver information (as mentioned by Michal S.). >> >> The problem right now is that there are no drivers for memory block >> devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent >> will not contain a "DRIVER" information and we ave no idea what kind of >> memory block device we hold in our hands. >> >> $ udevadm info -q all -a /sys/devices/system/memory/memory0 >> >> looking at device '/devices/system/memory/memory0': >> KERNEL=="memory0" >> SUBSYSTEM=="memory" >> DRIVER=="" >> ATTR{online}=="1" >> ATTR{phys_device}=="0" >> ATTR{phys_index}=="" >> ATTR{removable}=="0" >> ATTR{state}=="online" >> ATTR{valid_zones}=="none" >> >> >> If we would provide "fake" drivers for the memory block devices we want >> to treat in a special way in user space (e.g. standby memory on s390x), >> user space could use that information to make smarter decisions. >> >> Adding such drivers might work. My suggestion would be to let ordinary >> DIMMs be without a driver for now and only special case standby memory >> and eventually paravirtualized memory devices (XEN and Hyper-V). >> >> Any thoughts? > > If we are going to fake the driver information we may as well add the > type attribute and be done with it. > > I think the problem with the patch was more with the semantic than the > att
Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
On 28.09.18 17:03, David Hildenbrand wrote: > How to/when to online hotplugged memory is hard to manage for > distributions because different memory types are to be treated differently. > Right now, we need complicated udev rules that e.g. check if we are > running on s390x, on a physical system or on a virtualized system. But > there is also sometimes the demand to really online memory immediately > while adding in the kernel and not to wait for user space to make a > decision. And on virtualized systems there might be different > requirements, depending on "how" the memory was added (and if it will > eventually get unplugged again - DIMM vs. paravirtualized mechanisms). > > On the one hand, we have physical systems where we sometimes > want to be able to unplug memory again - e.g. a DIMM - so we have to online > it to the MOVABLE zone optionally. That decision is usually made in user > space. > > On the other hand, we have memory that should never be onlined > automatically, only when asked for by an administrator. Such memory only > applies to virtualized environments like s390x, where the concept of > "standby" memory exists. Memory is detected and added during boot, so it > can be onlined when requested by the admininistrator or some tooling. > Only when onlining, memory will be allocated in the hypervisor. > > But then, we also have paravirtualized devices (namely xen and hyper-v > balloons), that hotplug memory that will never ever be removed from a > system right now using offline_pages/remove_memory. If at all, this memory > is logically unplugged and handed back to the hypervisor via ballooning. > > For paravirtualized devices it is relevant that memory is onlined as > quickly as possible after adding - and that it is added to the NORMAL > zone. Otherwise, it could happen that too much memory in a row is added > (but not onlined), resulting in out-of-memory conditions due to the > additional memory for "struct pages" and friends. MOVABLE zone as well > as delays might be very problematic and lead to crashes (e.g. zone > imbalance). > > Therefore, introduce memory block types and online memory depending on > it when adding the memory. Expose the memory type to user space, so user > space handlers can start to process only "normal" memory. Other memory > block types can be ignored. One thing less to worry about in user space. > So I was looking into alternatives. 1. Provide only "normal" and "standby" memory types to user space. This way user space can make smarter decisions about how to online memory. Not really sure if this is the right way to go. 2. Use device driver information (as mentioned by Michal S.). The problem right now is that there are no drivers for memory block devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent will not contain a "DRIVER" information and we ave no idea what kind of memory block device we hold in our hands. $ udevadm info -q all -a /sys/devices/system/memory/memory0 looking at device '/devices/system/memory/memory0': KERNEL=="memory0" SUBSYSTEM=="memory" DRIVER=="" ATTR{online}=="1" ATTR{phys_device}=="0" ATTR{phys_index}=="" ATTR{removable}=="0" ATTR{state}=="online" ATTR{valid_zones}=="none" If we would provide "fake" drivers for the memory block devices we want to treat in a special way in user space (e.g. standby memory on s390x), user space could use that information to make smarter decisions. Adding such drivers might work. My suggestion would be to let ordinary DIMMs be without a driver for now and only special case standby memory and eventually paravirtualized memory devices (XEN and Hyper-V). Any thoughts? -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v2] makedumpfile: exclude pages that are logically offline
Linux marks pages that are logically offline via a page flag (map count). Such pages e.g. include pages infated as part of a balloon driver or pages that were not actually onlined when onlining the whole section. While the hypervisor usually allows to read such inflated memory, we basically read and dump data that is completely irrelevant. Also, this might result in quite some overhead in the hypervisor. In addition, we saw some problems under Hyper-V, whereby we can crash the kernel by dumping, when reading memory of a partially onlined memory segment (for memory added by the Hyper-V balloon driver). Therefore, don't read and dump pages that are marked as being logically offline. Signed-off-by: David Hildenbrand --- v1 -> v2: - Fix PAGE_BUDDY_MAPCOUNT_VALUE vs. PAGE_OFFLINE_MAPCOUNT_VALUE makedumpfile.c | 34 ++ makedumpfile.h | 1 + 2 files changed, 31 insertions(+), 4 deletions(-) diff --git a/makedumpfile.c b/makedumpfile.c index 8923538..a5f2ea9 100644 --- a/makedumpfile.c +++ b/makedumpfile.c @@ -88,6 +88,7 @@ mdf_pfn_t pfn_cache_private; mdf_pfn_t pfn_user; mdf_pfn_t pfn_free; mdf_pfn_t pfn_hwpoison; +mdf_pfn_t pfn_offline; mdf_pfn_t num_dumped; @@ -249,6 +250,21 @@ isHugetlb(unsigned long dtor) && (SYMBOL(free_huge_page) == dtor)); } +static int +isOffline(unsigned long flags, unsigned int _mapcount) +{ + if (NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE) == NOT_FOUND_NUMBER) + return FALSE; + + if (flags & (1UL << NUMBER(PG_slab))) + return FALSE; + + if (_mapcount == (int)NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE)) + return TRUE; + + return FALSE; +} + static int is_cache_page(unsigned long flags) { @@ -2287,6 +2303,8 @@ write_vmcoreinfo_data(void) WRITE_NUMBER("PG_hwpoison", PG_hwpoison); WRITE_NUMBER("PAGE_BUDDY_MAPCOUNT_VALUE", PAGE_BUDDY_MAPCOUNT_VALUE); + WRITE_NUMBER("PAGE_OFFLINE_MAPCOUNT_VALUE", +PAGE_OFFLINE_MAPCOUNT_VALUE); WRITE_NUMBER("phys_base", phys_base); WRITE_NUMBER("HUGETLB_PAGE_DTOR", HUGETLB_PAGE_DTOR); @@ -2687,6 +2705,7 @@ read_vmcoreinfo(void) READ_SRCFILE("pud_t", pud_t); READ_NUMBER("PAGE_BUDDY_MAPCOUNT_VALUE", PAGE_BUDDY_MAPCOUNT_VALUE); + READ_NUMBER("PAGE_OFFLINE_MAPCOUNT_VALUE", PAGE_OFFLINE_MAPCOUNT_VALUE); READ_NUMBER("phys_base", phys_base); #ifdef __aarch64__ READ_NUMBER("VA_BITS", VA_BITS); @@ -6041,6 +6060,12 @@ __exclude_unnecessary_pages(unsigned long mem_map, else if (isHWPOISON(flags)) { pfn_counter = _hwpoison; } + /* +* Exclude pages that are logically offline. +*/ + else if (isOffline(flags, _mapcount)) { + pfn_counter = _offline; + } /* * Unexcludable page */ @@ -7522,7 +7547,7 @@ write_elf_pages_cyclic(struct cache_data *cd_header, struct cache_data *cd_page) */ if (info->flag_cyclic) { pfn_zero = pfn_cache = pfn_cache_private = 0; - pfn_user = pfn_free = pfn_hwpoison = 0; + pfn_user = pfn_free = pfn_hwpoison = pfn_offline = 0; pfn_memhole = info->max_mapnr; } @@ -8804,7 +8829,7 @@ write_kdump_pages_and_bitmap_cyclic(struct cache_data *cd_header, struct cache_d * Reset counter for debug message. */ pfn_zero = pfn_cache = pfn_cache_private = 0; - pfn_user = pfn_free = pfn_hwpoison = 0; + pfn_user = pfn_free = pfn_hwpoison = pfn_offline = 0; pfn_memhole = info->max_mapnr; /* @@ -9749,7 +9774,7 @@ print_report(void) pfn_original = info->max_mapnr - pfn_memhole; pfn_excluded = pfn_zero + pfn_cache + pfn_cache_private - + pfn_user + pfn_free + pfn_hwpoison; + + pfn_user + pfn_free + pfn_hwpoison + pfn_offline; shrinking = (pfn_original - pfn_excluded) * 100; shrinking = shrinking / pfn_original; @@ -9763,6 +9788,7 @@ print_report(void) REPORT_MSG("User process data pages : 0x%016llx\n", pfn_user); REPORT_MSG("Free pages : 0x%016llx\n", pfn_free); REPORT_MSG("Hwpoison pages : 0x%016llx\n", pfn_hwpoison); + REPORT_MSG("Offline pages : 0x%016llx\n", pfn_offline); REPORT_MSG(" Remaining pages : 0x%016llx\n", pfn_original - pfn_excluded); REPORT_MSG(" (The number of pages is reduced to %lld%%.)\n", @@ -9790,7 +9816,7 @@ print_mem_usage(void) pfn_original = info->max_map
[PATCH v2 6/8] vmw_balloon: mark inflated pages PG_offline
Mark inflated and never onlined pages PG_offline, to tell the world that the content is stale and should not be dumped. Cc: Xavier Deguillard Cc: Nadav Amit Cc: Arnd Bergmann Cc: Greg Kroah-Hartman Cc: Julien Freche Cc: Andrew Morton Cc: Matthew Wilcox Cc: Michal Hocko Cc: "Michael S. Tsirkin" Acked-by: Nadav Amit Signed-off-by: David Hildenbrand --- drivers/misc/vmw_balloon.c | 32 1 file changed, 32 insertions(+) diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c index e6126a4b95d3..877611b5659b 100644 --- a/drivers/misc/vmw_balloon.c +++ b/drivers/misc/vmw_balloon.c @@ -556,6 +556,36 @@ vmballoon_page_in_frames(enum vmballoon_page_size_type page_size) return 1 << vmballoon_page_order(page_size); } +/** + * vmballoon_mark_page_offline() - mark a page as offline + * @page: pointer for the page. + * @page_size: the size of the page. + */ +static void +vmballoon_mark_page_offline(struct page *page, + enum vmballoon_page_size_type page_size) +{ + int i; + + for (i = 0; i < vmballoon_page_in_frames(page_size); i++) + __SetPageOffline(page + i); +} + +/** + * vmballoon_mark_page_online() - mark a page as online + * @page: pointer for the page. + * @page_size: the size of the page. + */ +static void +vmballoon_mark_page_online(struct page *page, + enum vmballoon_page_size_type page_size) +{ + int i; + + for (i = 0; i < vmballoon_page_in_frames(page_size); i++) + __ClearPageOffline(page + i); +} + /** * vmballoon_send_get_target() - Retrieve desired balloon size from the host. * @@ -612,6 +642,7 @@ static int vmballoon_alloc_page_list(struct vmballoon *b, ctl->page_size); if (page) { + vmballoon_mark_page_offline(page, ctl->page_size); /* Success. Add the page to the list and continue. */ list_add(>lru, >pages); continue; @@ -850,6 +881,7 @@ static void vmballoon_release_page_list(struct list_head *page_list, list_for_each_entry_safe(page, tmp, page_list, lru) { list_del(>lru); + vmballoon_mark_page_online(page, page_size); __free_pages(page, vmballoon_page_order(page_size)); } -- 2.17.2 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v2 7/8] PM / Hibernate: use pfn_to_online_page()
Let's use pfn_to_online_page() instead of pfn_to_page() when checking for saveable pages to not save/restore offline memory sections. Cc: "Rafael J. Wysocki" Cc: Pavel Machek Cc: Len Brown Cc: Andrew Morton Cc: Matthew Wilcox Cc: Michal Hocko Cc: "Michael S. Tsirkin" Suggested-by: Michal Hocko Acked-by: Michal Hocko Acked-by: Pavel Machek Acked-by: Rafael J. Wysocki Signed-off-by: David Hildenbrand --- kernel/power/snapshot.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 640b2034edd6..87e6dd57819f 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -1215,8 +1215,8 @@ static struct page *saveable_highmem_page(struct zone *zone, unsigned long pfn) if (!pfn_valid(pfn)) return NULL; - page = pfn_to_page(pfn); - if (page_zone(page) != zone) + page = pfn_to_online_page(pfn); + if (!page || page_zone(page) != zone) return NULL; BUG_ON(!PageHighMem(page)); @@ -1277,8 +1277,8 @@ static struct page *saveable_page(struct zone *zone, unsigned long pfn) if (!pfn_valid(pfn)) return NULL; - page = pfn_to_page(pfn); - if (page_zone(page) != zone) + page = pfn_to_online_page(pfn); + if (!page || page_zone(page) != zone) return NULL; BUG_ON(PageHighMem(page)); -- 2.17.2 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v2 4/8] xen/balloon: mark inflated pages PG_offline
Mark inflated and never onlined pages PG_offline, to tell the world that the content is stale and should not be dumped. Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: Andrew Morton Cc: Matthew Wilcox Cc: Michal Hocko Cc: "Michael S. Tsirkin" Signed-off-by: David Hildenbrand --- drivers/xen/balloon.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c index 12148289debd..14dd6b814db3 100644 --- a/drivers/xen/balloon.c +++ b/drivers/xen/balloon.c @@ -425,6 +425,7 @@ static int xen_bring_pgs_online(struct page *pg, unsigned int order) for (i = 0; i < size; i++) { p = pfn_to_page(start_pfn + i); __online_page_set_limits(p); + __SetPageOffline(p); __balloon_append(p); } mutex_unlock(_mutex); @@ -493,6 +494,7 @@ static enum bp_state increase_reservation(unsigned long nr_pages) xenmem_reservation_va_mapping_update(1, , _list[i]); /* Relinquish the page back to the allocator. */ + __ClearPageOffline(page); free_reserved_page(page); } @@ -519,6 +521,7 @@ static enum bp_state decrease_reservation(unsigned long nr_pages, gfp_t gfp) state = BP_EAGAIN; break; } + __SetPageOffline(page); adjust_managed_page_count(page, -1); xenmem_reservation_scrub_page(page); list_add(>lru, ); -- 2.17.2 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v2 8/8] PM / Hibernate: exclude all PageOffline() pages
The content of pages that are marked PG_offline is not of interest (e.g. inflated by a balloon driver), let's skip these pages. In saveable_highmem_page(), move the PageReserved() check to a new check along with the PageOffline() check to separate it from the swsusp checks. Cc: "Rafael J. Wysocki" Cc: Pavel Machek Cc: Len Brown Cc: Andrew Morton Cc: Matthew Wilcox Cc: Michal Hocko Cc: "Michael S. Tsirkin" Acked-by: Pavel Machek Acked-by: Rafael J. Wysocki Signed-off-by: David Hildenbrand --- kernel/power/snapshot.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 87e6dd57819f..4802b039b89f 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -1221,8 +1221,10 @@ static struct page *saveable_highmem_page(struct zone *zone, unsigned long pfn) BUG_ON(!PageHighMem(page)); - if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page) || - PageReserved(page)) + if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page)) + return NULL; + + if (PageReserved(page) || PageOffline(page)) return NULL; if (page_is_guard(page)) @@ -1286,6 +1288,9 @@ static struct page *saveable_page(struct zone *zone, unsigned long pfn) if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page)) return NULL; + if (PageOffline(page)) + return NULL; + if (PageReserved(page) && (!kernel_page_present(page) || pfn_is_nosave(pfn))) return NULL; -- 2.17.2 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v2 5/8] hv_balloon: mark inflated pages PG_offline
Mark inflated and never onlined pages PG_offline, to tell the world that the content is stale and should not be dumped. Cc: "K. Y. Srinivasan" Cc: Haiyang Zhang Cc: Stephen Hemminger Cc: Kairui Song Cc: Vitaly Kuznetsov Cc: Andrew Morton Cc: Matthew Wilcox Cc: Michal Hocko Cc: "Michael S. Tsirkin" Acked-by: Pankaj gupta Signed-off-by: David Hildenbrand --- drivers/hv/hv_balloon.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index 211f3fe3a038..47719862e57f 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -681,8 +681,13 @@ static struct notifier_block hv_memory_nb = { /* Check if the particular page is backed and can be onlined and online it. */ static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg) { - if (!has_pfn_is_backed(has, page_to_pfn(pg))) + if (!has_pfn_is_backed(has, page_to_pfn(pg))) { + if (!PageOffline(pg)) + __SetPageOffline(pg); return; + } + if (PageOffline(pg)) + __ClearPageOffline(pg); /* This frame is currently backed; online the page. */ __online_page_set_limits(pg); @@ -1201,6 +1206,7 @@ static void free_balloon_pages(struct hv_dynmem_device *dm, for (i = 0; i < num_pages; i++) { pg = pfn_to_page(i + start_frame); + __ClearPageOffline(pg); __free_page(pg); dm->num_pages_ballooned--; } @@ -1213,7 +1219,7 @@ static unsigned int alloc_balloon_pages(struct hv_dynmem_device *dm, struct dm_balloon_response *bl_resp, int alloc_unit) { - unsigned int i = 0; + unsigned int i, j; struct page *pg; if (num_pages < alloc_unit) @@ -1245,6 +1251,10 @@ static unsigned int alloc_balloon_pages(struct hv_dynmem_device *dm, if (alloc_unit != 1) split_page(pg, get_order(alloc_unit << PAGE_SHIFT)); + /* mark all pages offline */ + for (j = 0; j < (1 << get_order(alloc_unit << PAGE_SHIFT)); j++) + __SetPageOffline(pg + j); + bl_resp->range_count++; bl_resp->range_array[i].finfo.start_page = page_to_pfn(pg); -- 2.17.2 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v2 3/8] kexec: export PG_offline to VMCOREINFO
Right now, pages inflated as part of a balloon driver will be dumped by dump tools like makedumpfile. While XEN is able to check in the crash kernel whether a certain pfn is actuall backed by memory in the hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of other balloon inflated memory will essentially result in zero pages getting allocated by the hypervisor and the dump getting filled with this data. The allocation and reading of zero pages can directly be avoided if a dumping tool could know which pages only contain stale information not to be dumped. We now have PG_offline which can be (and already is by virtio-balloon) used for marking pages as logically offline. Follow up patches will make use of this flag also in other balloon implementations. Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so makedumpfile can directly skip pages that are logically offline and the content therefore stale. (we export is as a macro to match how it is done for PG_buddy. This way it is clearer that this is not actually a flag but only a very specific mapcount value to represent page types). Please note that this is also helpful for a problem we were seeing under Hyper-V: Dumping logically offline memory (pages kept fake offline while onlining a section via online_page_callback) would under some condicions result in a kernel panic when dumping them. Cc: Andrew Morton Cc: Dave Young Cc: "Kirill A. Shutemov" Cc: Baoquan He Cc: Omar Sandoval Cc: Arnd Bergmann Cc: Matthew Wilcox Cc: Michal Hocko Cc: "Michael S. Tsirkin" Cc: Lianbo Jiang Cc: Borislav Petkov Cc: Kazuhito Hagio Acked-by: Michael S. Tsirkin Acked-by: Dave Young Signed-off-by: David Hildenbrand --- kernel/crash_core.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 933cb3e45b98..093c9f917ed0 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -464,6 +464,8 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE); #ifdef CONFIG_HUGETLB_PAGE VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR); +#define PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline) + VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE); #endif arch_crash_save_vmcoreinfo(); -- 2.17.2 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v2 1/8] mm: balloon: update comment about isolation/migration/compaction
Commit b1123ea6d3b3 ("mm: balloon: use general non-lru movable page feature") reworked balloon handling to make use of the general non-lru movable page feature. The big comment block in balloon_compaction.h contains quite some outdated information. Let's fix this. Cc: Andrew Morton Cc: Matthew Wilcox Cc: Michal Hocko Cc: "Michael S. Tsirkin" Acked-by: Michael S. Tsirkin Signed-off-by: David Hildenbrand --- include/linux/balloon_compaction.h | 26 +- 1 file changed, 9 insertions(+), 17 deletions(-) diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index 53051f3d8f25..cbe50da5a59d 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -4,15 +4,18 @@ * * Common interface definitions for making balloon pages movable by compaction. * - * Despite being perfectly possible to perform ballooned pages migration, they - * make a special corner case to compaction scans because balloon pages are not - * enlisted at any LRU list like the other pages we do compact / migrate. + * Balloon page migration makes use of the general non-lru movable page + * feature. + * + * page->private is used to reference the responsible balloon device. + * page->mapping is used in context of non-lru page migration to reference + * the address space operations for page isolation/migration/compaction. * * As the page isolation scanning step a compaction thread does is a lockless * procedure (from a page standpoint), it might bring some racy situations while * performing balloon page compaction. In order to sort out these racy scenarios * and safely perform balloon's page compaction and migration we must, always, - * ensure following these three simple rules: + * ensure following these simple rules: * * i. when updating a balloon's page ->mapping element, strictly do it under * the following lock order, independently of the far superior @@ -21,19 +24,8 @@ * +--spin_lock_irq(_dev_info->pages_lock); * ... page->mapping updates here ... * - * ii. before isolating or dequeueing a balloon page from the balloon device - * pages list, the page reference counter must be raised by one and the - * extra refcount must be dropped when the page is enqueued back into - * the balloon device page list, thus a balloon page keeps its reference - * counter raised only while it is under our special handling; - * - * iii. after the lockless scan step have selected a potential balloon page for - * isolation, re-test the PageBalloon mark and the PagePrivate flag - * under the proper page lock, to ensure isolating a valid balloon page - * (not yet isolated, nor under release procedure) - * - * iv. isolation or dequeueing procedure must clear PagePrivate flag under - * page lock together with removing page from balloon device page list. + * ii. isolation or dequeueing procedure must remove the page from balloon + * device page list under b_dev_info->pages_lock. * * The functions provided by this interface are placed to help on coping with * the aforementioned balloon page corner case, as well as to ensure the simple -- 2.17.2 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v2 0/8] mm/kdump: allow to exclude pages that are logically offline
Right now, pages inflated as part of a balloon driver will be dumped by dump tools like makedumpfile. While XEN is able to check in the crash kernel whether a certain pfn is actually backed by memory in the hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of virtio-balloon, hv-balloon and VMWare balloon inflated memory will essentially result in zero pages getting allocated by the hypervisor and the dump getting filled with this data. The allocation and reading of zero pages can directly be avoided if a dumping tool could know which pages only contain stale information not to be dumped. Also for XEN, calling into the kernel and asking the hypervisor if a pfn is backed can be avoided if the duming tool would skip such pages right from the beginning. Dumping tools have no idea whether a given page is part of a balloon driver and shall not be dumped. Esp. PG_reserved cannot be used for that purpose as all memory allocated during early boot is also PG_reserved, see discussion at [1]. So some other way of indication is required and a new page flag is frowned upon. We have PG_balloon (MAPCOUNT value), which is essentially unused now. I suggest renaming it to something more generic (PG_offline) to mark pages as logically offline. This flag can than e.g. also be used by virtio-mem in the future to mark subsections as offline. Or by other code that wants to put pages logically offline (e.g. later maybe poisoned pages that shall no longer be used). This series converts PG_balloon to PG_offline, allows dumping tools to query the value to detect such pages and marks pages in the hv-balloon and XEN balloon properly as PG_offline. Note that virtio-balloon already set pages to PG_balloon (and now PG_offline). Please note that this is also helpful for a problem we were seeing under Hyper-V: Dumping logically offline memory (pages kept fake offline while onlining a section via online_page_callback) would under some condicions result in a kernel panic when dumping them. As I don't have access to neither XEN nor Hyper-V nor VMWare installations, this was only tested with the virtio-balloon and pages were properly skipped when dumping. I'll also attach the makedumpfile patch to this series. [1] https://lkml.org/lkml/2018/7/20/566 v1 -> v2: - "kexec: export PG_offline to VMCOREINFO" -- Add description why it is exported as a macro - "vmw_balloon: mark inflated pages PG_offline" -- Use helper function + adapt comments - "PM / Hibernate: exclude all PageOffline() pages" -- Perform the check separate from swsusp checks. - Added RBs/ACKs David Hildenbrand (8): mm: balloon: update comment about isolation/migration/compaction mm: convert PG_balloon to PG_offline kexec: export PG_offline to VMCOREINFO xen/balloon: mark inflated pages PG_offline hv_balloon: mark inflated pages PG_offline vmw_balloon: mark inflated pages PG_offline PM / Hibernate: use pfn_to_online_page() PM / Hibernate: exclude all PageOffline() pages Documentation/admin-guide/mm/pagemap.rst | 9 --- drivers/hv/hv_balloon.c | 14 -- drivers/misc/vmw_balloon.c | 32 ++ drivers/xen/balloon.c| 3 +++ fs/proc/page.c | 4 +-- include/linux/balloon_compaction.h | 34 +--- include/linux/page-flags.h | 11 +--- include/uapi/linux/kernel-page-flags.h | 2 +- kernel/crash_core.c | 2 ++ kernel/power/snapshot.c | 17 +++- tools/vm/page-types.c| 2 +- 11 files changed, 90 insertions(+), 40 deletions(-) -- 2.17.2 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v2 2/8] mm: convert PG_balloon to PG_offline
PG_balloon was introduced to implement page migration/compaction for pages inflated in virtio-balloon. Nowadays, it is only a marker that a page is part of virtio-balloon and therefore logically offline. We also want to make use of this flag in other balloon drivers - for inflated pages or when onlining a section but keeping some pages offline (e.g. used right now by XEN and Hyper-V via set_online_page_callback()). We are going to expose this flag to dump tools like makedumpfile. But instead of exposing PG_balloon, let's generalize the concept of marking pages as logically offline, so it can be reused for other purposes later on. Rename PG_balloon to PG_offline. This is an indicator that the page is logically offline, the content stale and that it should not be touched (e.g. a hypervisor would have to allocate backing storage in order for the guest to dump an unused page). We can then e.g. exclude such pages from dumps. We replace and reuse KPF_BALLOON (23), as this shouldn't really harm (and for now the semantics stay the same). In following patches, we will make use of this bit also in other balloon drivers. While at it, document PGTABLE. Cc: Jonathan Corbet Cc: Alexey Dobriyan Cc: Mike Rapoport Cc: Andrew Morton Cc: Christian Hansen Cc: Vlastimil Babka Cc: "Kirill A. Shutemov" Cc: Stephen Rothwell Cc: Matthew Wilcox Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Pavel Tatashin Cc: Alexander Duyck Cc: Naoya Horiguchi Cc: Miles Chen Cc: David Rientjes Cc: Konstantin Khlebnikov Cc: Kazuhito Hagio Acked-by: Konstantin Khlebnikov Acked-by: Michael S. Tsirkin Acked-by: Pankaj gupta Signed-off-by: David Hildenbrand --- Documentation/admin-guide/mm/pagemap.rst | 9 ++--- fs/proc/page.c | 4 ++-- include/linux/balloon_compaction.h | 8 include/linux/page-flags.h | 11 +++ include/uapi/linux/kernel-page-flags.h | 2 +- tools/vm/page-types.c| 2 +- 6 files changed, 21 insertions(+), 15 deletions(-) diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index 3f7bade2c231..340a5aee9b80 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -75,9 +75,10 @@ number of times a page is mapped. 20. NOPAGE 21. KSM 22. THP -23. BALLOON +23. OFFLINE 24. ZERO_PAGE 25. IDLE +26. PGTABLE * ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the memory cgroup each page is charged to, indexed by PFN. Only available when @@ -118,8 +119,8 @@ Short descriptions to the page flags identical memory pages dynamically shared between one or more processes 22 - THP contiguous pages which construct transparent hugepages -23 - BALLOON -balloon compaction page +23 - OFFLINE +page is logically offline 24 - ZERO_PAGE zero page for pfn_zero or huge_zero page 25 - IDLE @@ -128,6 +129,8 @@ Short descriptions to the page flags Note that this flag may be stale in case the page was accessed via a PTE. To make sure the flag is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first. +26 - PGTABLE +page is in use as a page table IO related page flags - diff --git a/fs/proc/page.c b/fs/proc/page.c index 6c517b11acf8..378401af4d9d 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -152,8 +152,8 @@ u64 stable_page_flags(struct page *page) else if (page_count(page) == 0 && is_free_buddy_page(page)) u |= 1 << KPF_BUDDY; - if (PageBalloon(page)) - u |= 1 << KPF_BALLOON; + if (PageOffline(page)) + u |= 1 << KPF_OFFLINE; if (PageTable(page)) u |= 1 << KPF_PGTABLE; diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index cbe50da5a59d..f111c780ef1d 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -95,7 +95,7 @@ extern int balloon_page_migrate(struct address_space *mapping, static inline void balloon_page_insert(struct balloon_dev_info *balloon, struct page *page) { - __SetPageBalloon(page); + __SetPageOffline(page); __SetPageMovable(page, balloon->inode->i_mapping); set_page_private(page, (unsigned long)balloon); list_add(>lru, >pages); @@ -111,7 +111,7 @@ static inline void balloon_page_insert(struct balloon_dev_info *balloon, */ static inline void balloon_page_delete(struct page *page) { - __ClearPageBalloon(page); + __ClearPageOffline(page); __ClearPageMovable(page); set_page_private(page, 0); /* @@ -141,13 +141,13 @@ static inline gfp_t balloon_mapping_gfp_mask(void) static inline void balloon_page_insert(struct balloon_dev_info *b
Re: [PATCH v1] makedumpfile: exclude pages that are logically offline
On 21.11.18 15:58, Kazuhito Hagio wrote: > Hi David, > >> Linux marks pages that are logically offline via a page flag (map count). >> Such pages e.g. include pages infated as part of a balloon driver or >> pages that were not actually onlined when onlining the whole section. >> >> While the hypervisor usually allows to read such inflated memory, we >> basically read and dump data that is completely irrelevant. Also, this >> might result in quite some overhead in the hypervisor. In addition, >> we saw some problems under Hyper-V, whereby we can crash the kernel by >> dumping, when reading memory of a partially onlined memory segment >> (for memory added by the Hyper-V balloon driver). >> >> Therefore, don't read and dump pages that are marked as being logically >> offline. >> >> Signed-off-by: David Hildenbrand >> --- >> makedumpfile.c | 34 ++ >> makedumpfile.h | 1 + >> 2 files changed, 31 insertions(+), 4 deletions(-) >> >> diff --git a/makedumpfile.c b/makedumpfile.c >> index 8923538..b8bfd4c 100644 >> --- a/makedumpfile.c >> +++ b/makedumpfile.c >> @@ -88,6 +88,7 @@ mdf_pfn_t pfn_cache_private; >> mdf_pfn_t pfn_user; >> mdf_pfn_t pfn_free; >> mdf_pfn_t pfn_hwpoison; >> +mdf_pfn_t pfn_offline; >> >> mdf_pfn_t num_dumped; >> >> @@ -249,6 +250,21 @@ isHugetlb(unsigned long dtor) >> && (SYMBOL(free_huge_page) == dtor)); >> } >> >> +static int >> +isOffline(unsigned long flags, unsigned int _mapcount) >> +{ >> +if (NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE) == NOT_FOUND_NUMBER) >> +return FALSE; > > This is NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE), isn't it? > If so, I will correct it when merging. > > Otherwise, looks good to me. > > Thanks! > Kazu Indeed, I will most probably resend either way along with a new mm series! Thanks a lot! -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH v1 8/8] PM / Hibernate: exclude all PageOffline() pages
On 21.11.18 12:35, William Kucharski wrote: > If you are adding PageOffline(page) to the condition list of the already > existing if in > saveable_highmem_page(), why explicitly add it as a separate statement in > saveable_page()? > > It would seem more consistent to make the second check: > > - if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page)) > + if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page) || > + PageOffline(page)) > > instead. > > It's admittedly a nit but it just seems cleaner to either do that or, if your > intention > was to separate the Page checks from the swsusp checks, to break the calls to > PageReserved() and PageOffline() into their own check in > saveable_highmem_page(). I'll split PageReserved() and PageOffline() off from the swsusp checks, thanks for your comment! > > Thanks! > -- Bill -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH v1 6/8] vmw_balloon: mark inflated pages PG_offline
On 21.11.18 04:22, Nadav Amit wrote: > Thanks for this patch! > >> On Nov 19, 2018, at 2:16 AM, David Hildenbrand wrote: >> >> Mark inflated and never onlined pages PG_offline, to tell the world that >> the content is stale and should not be dumped. >> >> Cc: Xavier Deguillard >> Cc: Nadav Amit >> Cc: Arnd Bergmann >> Cc: Greg Kroah-Hartman >> Cc: Julien Freche >> Cc: Andrew Morton >> Cc: Matthew Wilcox >> Cc: Michal Hocko >> Cc: "Michael S. Tsirkin" >> Signed-off-by: David Hildenbrand >> --- >> drivers/misc/vmw_balloon.c | 32 >> 1 file changed, 32 insertions(+) >> >> diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c >> index e6126a4b95d3..8cc8bd9a4e32 100644 >> --- a/drivers/misc/vmw_balloon.c >> +++ b/drivers/misc/vmw_balloon.c >> @@ -544,6 +544,36 @@ unsigned int vmballoon_page_order(enum >> vmballoon_page_size_type page_size) >> return page_size == VMW_BALLOON_2M_PAGE ? VMW_BALLOON_2M_ORDER : 0; >> } >> >> +/** >> + * vmballoon_mark_page_offline() - mark a page as offline >> + * @page: pointer for the page > > If possible, please add a period at the end of the sentence (yes, I know I > got it wrong in some places too). Sure :) > >> + * @page_size: the size of the page. >> + */ >> +static void >> +vmballoon_mark_page_offline(struct page *page, >> +enum vmballoon_page_size_type page_size) >> +{ >> +int i; >> + >> +for (i = 0; i < 1ULL << vmballoon_page_order(page_size); i++) > > Can you please do instead: > > unsigned int; > > for (i = 0; i < vmballoon_page_in_frames(page_size); i++) > Will do, will have to move both functions a little bit down in the file (exactly one function). > We would like to test it in the next few days, but in the meanwhile, after > you address these minor issues: > > Acked-by: Nadav Amit Thanks! > > Thanks again, > Nadav > -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH v1 3/8] kexec: export PG_offline to VMCOREINFO
On 21.11.18 07:04, Baoquan He wrote: > On 11/19/18 at 11:16am, David Hildenbrand wrote: >> diff --git a/kernel/crash_core.c b/kernel/crash_core.c >> index 933cb3e45b98..093c9f917ed0 100644 >> --- a/kernel/crash_core.c >> +++ b/kernel/crash_core.c >> @@ -464,6 +464,8 @@ static int __init crash_save_vmcoreinfo_init(void) >> VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE); >> #ifdef CONFIG_HUGETLB_PAGE >> VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR); >> +#define PAGE_OFFLINE_MAPCOUNT_VALUE (~PG_offline) >> +VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE); >> #endif > > This solution looks good to me. One small concern is why we don't > export PG_offline to vmcoreinfo directly, then define > PAGE_OFFLINE_MAPCOUNT_VALUE in makedumpfile. We have been exporting > kernel data/MACRO directly, why this one is exceptional. > 1. We are much more similar to PG_buddy (in contrast to actual page flags), and for PG_buddy it is historically handled like this (and I think it makes sense to expose these as actual MAPCOUNT_VALUEs). 2. Right now only one page type per page is supported. Therefore only exactly one value in mapcount indicates e.g. PageBuddy()/PageOffline(). Now, if we ever decide to change this (e.g. treat them like real flags), it is much easier to switch to PG_offline/PG_buddy then. We can directly see in makedumpfile that .*_MAPCOUNT_VALUE is no longer available but instead e.g. PG_offline and PG_buddy. Instead we would no see a change in makedumpfile and would have to rely on other properties. If there are no strong opinions I will leave it like this. Thanks! > Thanks > Baoquan > -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH v1 5/8] hv_balloon: mark inflated pages PG_offline
On 20.11.18 09:45, Pankaj Gupta wrote: > > Hi David, > >> >> Mark inflated and never onlined pages PG_offline, to tell the world that >> the content is stale and should not be dumped. >> >> Cc: "K. Y. Srinivasan" >> Cc: Haiyang Zhang >> Cc: Stephen Hemminger >> Cc: Kairui Song >> Cc: Vitaly Kuznetsov >> Cc: Andrew Morton >> Cc: Matthew Wilcox >> Cc: Michal Hocko >> Cc: "Michael S. Tsirkin" >> Signed-off-by: David Hildenbrand >> --- >> drivers/hv/hv_balloon.c | 14 -- >> 1 file changed, 12 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c >> index 211f3fe3a038..47719862e57f 100644 >> --- a/drivers/hv/hv_balloon.c >> +++ b/drivers/hv/hv_balloon.c >> @@ -681,8 +681,13 @@ static struct notifier_block hv_memory_nb = { >> /* Check if the particular page is backed and can be onlined and online it. >> */ >> static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg) >> { >> -if (!has_pfn_is_backed(has, page_to_pfn(pg))) >> +if (!has_pfn_is_backed(has, page_to_pfn(pg))) { >> +if (!PageOffline(pg)) >> +__SetPageOffline(pg); >> return; >> +} >> +if (PageOffline(pg)) >> +__ClearPageOffline(pg); >> >> /* This frame is currently backed; online the page. */ >> __online_page_set_limits(pg); >> @@ -1201,6 +1206,7 @@ static void free_balloon_pages(struct hv_dynmem_device >> *dm, >> >> for (i = 0; i < num_pages; i++) { >> pg = pfn_to_page(i + start_frame); >> +__ClearPageOffline(pg); > > Just thinking, do we need to care for clearing PageOffline flag before freeing > a balloon'd page? Yes we have to otherwise the code will crash when trying to set PageBuddy. (only one page type at a time may be set right now, and it makes sense. A page that is offline cannot e.g. be a buddy page) So PageOffline is completely managed by the page owner. > > Thanks, > Pankaj -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH v1 2/8] mm: convert PG_balloon to PG_offline
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 50ce1bddaf56..f91da3d0a67e 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -670,7 +670,7 @@ PAGEFLAG_FALSE(DoubleMap) > #define PAGE_TYPE_BASE 0xf000 > /* Reserve 0x007f to catch underflows of page_mapcount */ > #define PG_buddy 0x0080 > -#define PG_balloon 0x0100 > +#define PG_offline 0x0100 > #define PG_kmemcg0x0200 > #define PG_table 0x0400 > > @@ -700,10 +700,13 @@ static __always_inline void __ClearPage##uname(struct > page *page) \ > PAGE_TYPE_OPS(Buddy, buddy) > > /* > - * PageBalloon() is true for pages that are on the balloon page list > - * (see mm/balloon_compaction.c). > + * PageOffline() indicates that the pages is logically offline although the s/pages/page/ :) -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH v1 4/8] xen/balloon: mark inflated pages PG_offline
On 19.11.18 13:22, Juergen Gross wrote: > On 19/11/2018 11:16, David Hildenbrand wrote: >> Mark inflated and never onlined pages PG_offline, to tell the world that >> the content is stale and should not be dumped. >> >> Cc: Boris Ostrovsky >> Cc: Juergen Gross >> Cc: Stefano Stabellini >> Cc: Andrew Morton >> Cc: Matthew Wilcox >> Cc: Michal Hocko >> Cc: "Michael S. Tsirkin" >> Signed-off-by: David Hildenbrand >> --- >> drivers/xen/balloon.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c >> index 12148289debd..14dd6b814db3 100644 >> --- a/drivers/xen/balloon.c >> +++ b/drivers/xen/balloon.c >> @@ -425,6 +425,7 @@ static int xen_bring_pgs_online(struct page *pg, >> unsigned int order) >> for (i = 0; i < size; i++) { >> p = pfn_to_page(start_pfn + i); >> __online_page_set_limits(p); >> +__SetPageOffline(p); >> __balloon_append(p); >> } > > This seems not to be based on current master. Could you please tell > against which tree this should be reviewed? > Hi Juergen, this is based on linux-next/master. Thanks! > > Juergen > -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v1] makedumpfile: exclude pages that are logically offline
Linux marks pages that are logically offline via a page flag (map count). Such pages e.g. include pages infated as part of a balloon driver or pages that were not actually onlined when onlining the whole section. While the hypervisor usually allows to read such inflated memory, we basically read and dump data that is completely irrelevant. Also, this might result in quite some overhead in the hypervisor. In addition, we saw some problems under Hyper-V, whereby we can crash the kernel by dumping, when reading memory of a partially onlined memory segment (for memory added by the Hyper-V balloon driver). Therefore, don't read and dump pages that are marked as being logically offline. Signed-off-by: David Hildenbrand --- makedumpfile.c | 34 ++ makedumpfile.h | 1 + 2 files changed, 31 insertions(+), 4 deletions(-) diff --git a/makedumpfile.c b/makedumpfile.c index 8923538..b8bfd4c 100644 --- a/makedumpfile.c +++ b/makedumpfile.c @@ -88,6 +88,7 @@ mdf_pfn_t pfn_cache_private; mdf_pfn_t pfn_user; mdf_pfn_t pfn_free; mdf_pfn_t pfn_hwpoison; +mdf_pfn_t pfn_offline; mdf_pfn_t num_dumped; @@ -249,6 +250,21 @@ isHugetlb(unsigned long dtor) && (SYMBOL(free_huge_page) == dtor)); } +static int +isOffline(unsigned long flags, unsigned int _mapcount) +{ + if (NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE) == NOT_FOUND_NUMBER) + return FALSE; + + if (flags & (1UL << NUMBER(PG_slab))) + return FALSE; + + if (_mapcount == (int)NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE)) + return TRUE; + + return FALSE; +} + static int is_cache_page(unsigned long flags) { @@ -2287,6 +2303,8 @@ write_vmcoreinfo_data(void) WRITE_NUMBER("PG_hwpoison", PG_hwpoison); WRITE_NUMBER("PAGE_BUDDY_MAPCOUNT_VALUE", PAGE_BUDDY_MAPCOUNT_VALUE); + WRITE_NUMBER("PAGE_OFFLINE_MAPCOUNT_VALUE", +PAGE_OFFLINE_MAPCOUNT_VALUE); WRITE_NUMBER("phys_base", phys_base); WRITE_NUMBER("HUGETLB_PAGE_DTOR", HUGETLB_PAGE_DTOR); @@ -2687,6 +2705,7 @@ read_vmcoreinfo(void) READ_SRCFILE("pud_t", pud_t); READ_NUMBER("PAGE_BUDDY_MAPCOUNT_VALUE", PAGE_BUDDY_MAPCOUNT_VALUE); + READ_NUMBER("PAGE_OFFLINE_MAPCOUNT_VALUE", PAGE_OFFLINE_MAPCOUNT_VALUE); READ_NUMBER("phys_base", phys_base); #ifdef __aarch64__ READ_NUMBER("VA_BITS", VA_BITS); @@ -6041,6 +6060,12 @@ __exclude_unnecessary_pages(unsigned long mem_map, else if (isHWPOISON(flags)) { pfn_counter = _hwpoison; } + /* +* Exclude pages that are logically offline. +*/ + else if (isOffline(flags, _mapcount)) { + pfn_counter = _offline; + } /* * Unexcludable page */ @@ -7522,7 +7547,7 @@ write_elf_pages_cyclic(struct cache_data *cd_header, struct cache_data *cd_page) */ if (info->flag_cyclic) { pfn_zero = pfn_cache = pfn_cache_private = 0; - pfn_user = pfn_free = pfn_hwpoison = 0; + pfn_user = pfn_free = pfn_hwpoison = pfn_offline = 0; pfn_memhole = info->max_mapnr; } @@ -8804,7 +8829,7 @@ write_kdump_pages_and_bitmap_cyclic(struct cache_data *cd_header, struct cache_d * Reset counter for debug message. */ pfn_zero = pfn_cache = pfn_cache_private = 0; - pfn_user = pfn_free = pfn_hwpoison = 0; + pfn_user = pfn_free = pfn_hwpoison = pfn_offline = 0; pfn_memhole = info->max_mapnr; /* @@ -9749,7 +9774,7 @@ print_report(void) pfn_original = info->max_mapnr - pfn_memhole; pfn_excluded = pfn_zero + pfn_cache + pfn_cache_private - + pfn_user + pfn_free + pfn_hwpoison; + + pfn_user + pfn_free + pfn_hwpoison + pfn_offline; shrinking = (pfn_original - pfn_excluded) * 100; shrinking = shrinking / pfn_original; @@ -9763,6 +9788,7 @@ print_report(void) REPORT_MSG("User process data pages : 0x%016llx\n", pfn_user); REPORT_MSG("Free pages : 0x%016llx\n", pfn_free); REPORT_MSG("Hwpoison pages : 0x%016llx\n", pfn_hwpoison); + REPORT_MSG("Offline pages : 0x%016llx\n", pfn_offline); REPORT_MSG(" Remaining pages : 0x%016llx\n", pfn_original - pfn_excluded); REPORT_MSG(" (The number of pages is reduced to %lld%%.)\n", @@ -9790,7 +9816,7 @@ print_mem_usage(void) pfn_original = info->max_mapnr - pfn_memhole; pfn_excluded = pfn_zero + pfn_cache + pfn_cache_private -
[PATCH v1 8/8] PM / Hibernate: exclude all PageOffline() pages
The content of pages that are marked PG_offline is not of interest (e.g. inflated by a balloon driver), let's skip these pages. Cc: "Rafael J. Wysocki" Cc: Pavel Machek Cc: Len Brown Cc: Andrew Morton Cc: Matthew Wilcox Cc: Michal Hocko Cc: "Michael S. Tsirkin" Acked-by: Pavel Machek Signed-off-by: David Hildenbrand --- kernel/power/snapshot.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 87e6dd57819f..8d7b4d458842 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -1222,7 +1222,7 @@ static struct page *saveable_highmem_page(struct zone *zone, unsigned long pfn) BUG_ON(!PageHighMem(page)); if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page) || - PageReserved(page)) + PageReserved(page) || PageOffline(page)) return NULL; if (page_is_guard(page)) @@ -1286,6 +1286,9 @@ static struct page *saveable_page(struct zone *zone, unsigned long pfn) if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page)) return NULL; + if (PageOffline(page)) + return NULL; + if (PageReserved(page) && (!kernel_page_present(page) || pfn_is_nosave(pfn))) return NULL; -- 2.17.2 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v1 7/8] PM / Hibernate: use pfn_to_online_page()
Let's use pfn_to_online_page() instead of pfn_to_page() when checking for saveable pages to not save/restore offline memory sections. Cc: "Rafael J. Wysocki" Cc: Pavel Machek Cc: Len Brown Cc: Andrew Morton Cc: Matthew Wilcox Cc: Michal Hocko Cc: "Michael S. Tsirkin" Suggested-by: Michal Hocko Signed-off-by: David Hildenbrand --- kernel/power/snapshot.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 640b2034edd6..87e6dd57819f 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -1215,8 +1215,8 @@ static struct page *saveable_highmem_page(struct zone *zone, unsigned long pfn) if (!pfn_valid(pfn)) return NULL; - page = pfn_to_page(pfn); - if (page_zone(page) != zone) + page = pfn_to_online_page(pfn); + if (!page || page_zone(page) != zone) return NULL; BUG_ON(!PageHighMem(page)); @@ -1277,8 +1277,8 @@ static struct page *saveable_page(struct zone *zone, unsigned long pfn) if (!pfn_valid(pfn)) return NULL; - page = pfn_to_page(pfn); - if (page_zone(page) != zone) + page = pfn_to_online_page(pfn); + if (!page || page_zone(page) != zone) return NULL; BUG_ON(PageHighMem(page)); -- 2.17.2 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v1 6/8] vmw_balloon: mark inflated pages PG_offline
Mark inflated and never onlined pages PG_offline, to tell the world that the content is stale and should not be dumped. Cc: Xavier Deguillard Cc: Nadav Amit Cc: Arnd Bergmann Cc: Greg Kroah-Hartman Cc: Julien Freche Cc: Andrew Morton Cc: Matthew Wilcox Cc: Michal Hocko Cc: "Michael S. Tsirkin" Signed-off-by: David Hildenbrand --- drivers/misc/vmw_balloon.c | 32 1 file changed, 32 insertions(+) diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c index e6126a4b95d3..8cc8bd9a4e32 100644 --- a/drivers/misc/vmw_balloon.c +++ b/drivers/misc/vmw_balloon.c @@ -544,6 +544,36 @@ unsigned int vmballoon_page_order(enum vmballoon_page_size_type page_size) return page_size == VMW_BALLOON_2M_PAGE ? VMW_BALLOON_2M_ORDER : 0; } +/** + * vmballoon_mark_page_offline() - mark a page as offline + * @page: pointer for the page + * @page_size: the size of the page. + */ +static void +vmballoon_mark_page_offline(struct page *page, + enum vmballoon_page_size_type page_size) +{ + int i; + + for (i = 0; i < 1ULL << vmballoon_page_order(page_size); i++) + __SetPageOffline(page + i); +} + +/** + * vmballoon_mark_page_online() - mark a page as online + * @page: pointer for the page + * @page_size: the size of the page. + */ +static void +vmballoon_mark_page_online(struct page *page, + enum vmballoon_page_size_type page_size) +{ + int i; + + for (i = 0; i < 1ULL << vmballoon_page_order(page_size); i++) + __ClearPageOffline(page + i); +} + /** * vmballoon_page_in_frames() - returns the number of frames in a page. * @page_size: the size of the page. @@ -612,6 +642,7 @@ static int vmballoon_alloc_page_list(struct vmballoon *b, ctl->page_size); if (page) { + vmballoon_mark_page_offline(page, ctl->page_size); /* Success. Add the page to the list and continue. */ list_add(>lru, >pages); continue; @@ -850,6 +881,7 @@ static void vmballoon_release_page_list(struct list_head *page_list, list_for_each_entry_safe(page, tmp, page_list, lru) { list_del(>lru); + vmballoon_mark_page_online(page, page_size); __free_pages(page, vmballoon_page_order(page_size)); } -- 2.17.2 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v1 4/8] xen/balloon: mark inflated pages PG_offline
Mark inflated and never onlined pages PG_offline, to tell the world that the content is stale and should not be dumped. Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: Andrew Morton Cc: Matthew Wilcox Cc: Michal Hocko Cc: "Michael S. Tsirkin" Signed-off-by: David Hildenbrand --- drivers/xen/balloon.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c index 12148289debd..14dd6b814db3 100644 --- a/drivers/xen/balloon.c +++ b/drivers/xen/balloon.c @@ -425,6 +425,7 @@ static int xen_bring_pgs_online(struct page *pg, unsigned int order) for (i = 0; i < size; i++) { p = pfn_to_page(start_pfn + i); __online_page_set_limits(p); + __SetPageOffline(p); __balloon_append(p); } mutex_unlock(_mutex); @@ -493,6 +494,7 @@ static enum bp_state increase_reservation(unsigned long nr_pages) xenmem_reservation_va_mapping_update(1, , _list[i]); /* Relinquish the page back to the allocator. */ + __ClearPageOffline(page); free_reserved_page(page); } @@ -519,6 +521,7 @@ static enum bp_state decrease_reservation(unsigned long nr_pages, gfp_t gfp) state = BP_EAGAIN; break; } + __SetPageOffline(page); adjust_managed_page_count(page, -1); xenmem_reservation_scrub_page(page); list_add(>lru, ); -- 2.17.2 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v1 3/8] kexec: export PG_offline to VMCOREINFO
Right now, pages inflated as part of a balloon driver will be dumped by dump tools like makedumpfile. While XEN is able to check in the crash kernel whether a certain pfn is actuall backed by memory in the hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of other balloon inflated memory will essentially result in zero pages getting allocated by the hypervisor and the dump getting filled with this data. The allocation and reading of zero pages can directly be avoided if a dumping tool could know which pages only contain stale information not to be dumped. We now have PG_offline which can be (and already is by virtio-balloon) used for marking pages as logically offline. Follow up patches will make use of this flag also in other balloon implementations. Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so makedumpfile can directly skip pages that are logically offline and the content therefore stale. Please note that this is also helpful for a problem we were seeing under Hyper-V: Dumping logically offline memory (pages kept fake offline while onlining a section via online_page_callback) would under some condicions result in a kernel panic when dumping them. Cc: Andrew Morton Cc: Dave Young Cc: "Kirill A. Shutemov" Cc: Baoquan He Cc: Omar Sandoval Cc: Arnd Bergmann Cc: Matthew Wilcox Cc: Michal Hocko Cc: "Michael S. Tsirkin" Cc: Lianbo Jiang Cc: Borislav Petkov Cc: Kazuhito Hagio Signed-off-by: David Hildenbrand --- kernel/crash_core.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 933cb3e45b98..093c9f917ed0 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -464,6 +464,8 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE); #ifdef CONFIG_HUGETLB_PAGE VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR); +#define PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline) + VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE); #endif arch_crash_save_vmcoreinfo(); -- 2.17.2 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v1 1/8] mm: balloon: update comment about isolation/migration/compaction
Commit b1123ea6d3b3 ("mm: balloon: use general non-lru movable page feature") reworked balloon handling to make use of the general non-lru movable page feature. The big comment block in balloon_compaction.h contains quite some outdated information. Let's fix this. Cc: Andrew Morton Cc: Matthew Wilcox Cc: Michal Hocko Cc: "Michael S. Tsirkin" Signed-off-by: David Hildenbrand --- include/linux/balloon_compaction.h | 26 +- 1 file changed, 9 insertions(+), 17 deletions(-) diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index 53051f3d8f25..cbe50da5a59d 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -4,15 +4,18 @@ * * Common interface definitions for making balloon pages movable by compaction. * - * Despite being perfectly possible to perform ballooned pages migration, they - * make a special corner case to compaction scans because balloon pages are not - * enlisted at any LRU list like the other pages we do compact / migrate. + * Balloon page migration makes use of the general non-lru movable page + * feature. + * + * page->private is used to reference the responsible balloon device. + * page->mapping is used in context of non-lru page migration to reference + * the address space operations for page isolation/migration/compaction. * * As the page isolation scanning step a compaction thread does is a lockless * procedure (from a page standpoint), it might bring some racy situations while * performing balloon page compaction. In order to sort out these racy scenarios * and safely perform balloon's page compaction and migration we must, always, - * ensure following these three simple rules: + * ensure following these simple rules: * * i. when updating a balloon's page ->mapping element, strictly do it under * the following lock order, independently of the far superior @@ -21,19 +24,8 @@ * +--spin_lock_irq(_dev_info->pages_lock); * ... page->mapping updates here ... * - * ii. before isolating or dequeueing a balloon page from the balloon device - * pages list, the page reference counter must be raised by one and the - * extra refcount must be dropped when the page is enqueued back into - * the balloon device page list, thus a balloon page keeps its reference - * counter raised only while it is under our special handling; - * - * iii. after the lockless scan step have selected a potential balloon page for - * isolation, re-test the PageBalloon mark and the PagePrivate flag - * under the proper page lock, to ensure isolating a valid balloon page - * (not yet isolated, nor under release procedure) - * - * iv. isolation or dequeueing procedure must clear PagePrivate flag under - * page lock together with removing page from balloon device page list. + * ii. isolation or dequeueing procedure must remove the page from balloon + * device page list under b_dev_info->pages_lock. * * The functions provided by this interface are placed to help on coping with * the aforementioned balloon page corner case, as well as to ensure the simple -- 2.17.2 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v1 2/8] mm: convert PG_balloon to PG_offline
PG_balloon was introduced to implement page migration/compaction for pages inflated in virtio-balloon. Nowadays, it is only a marker that a page is part of virtio-balloon and therefore logically offline. We also want to make use of this flag in other balloon drivers - for inflated pages or when onlining a section but keeping some pages offline (e.g. used right now by XEN and Hyper-V via set_online_page_callback()). We are going to expose this flag to dump tools like makedumpfile. But instead of exposing PG_balloon, let's generalize the concept of marking pages as logically offline, so it can be reused for other purposes later on. Rename PG_balloon to PG_offline. This is an indicator that the page is logically offline, the content stale and that it should not be touched (e.g. a hypervisor would have to allocate backing storage in order for the guest to dump an unused page). We can then e.g. exclude such pages from dumps. We replace and reuse KPF_BALLOON (23), as this shouldn't really harm (and for now the semantics stay the same). In following patches, we will make use of this bit also in other balloon drivers. While at it, document PGTABLE. Cc: Jonathan Corbet Cc: Alexey Dobriyan Cc: Mike Rapoport Cc: Andrew Morton Cc: Christian Hansen Cc: Vlastimil Babka Cc: "Kirill A. Shutemov" Cc: Stephen Rothwell Cc: Matthew Wilcox Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Pavel Tatashin Cc: Alexander Duyck Cc: Naoya Horiguchi Cc: Miles Chen Cc: David Rientjes Cc: Konstantin Khlebnikov Cc: Kazuhito Hagio Signed-off-by: David Hildenbrand --- Documentation/admin-guide/mm/pagemap.rst | 9 ++--- fs/proc/page.c | 4 ++-- include/linux/balloon_compaction.h | 8 include/linux/page-flags.h | 11 +++ include/uapi/linux/kernel-page-flags.h | 2 +- tools/vm/page-types.c| 2 +- 6 files changed, 21 insertions(+), 15 deletions(-) diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index 3f7bade2c231..340a5aee9b80 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -75,9 +75,10 @@ number of times a page is mapped. 20. NOPAGE 21. KSM 22. THP -23. BALLOON +23. OFFLINE 24. ZERO_PAGE 25. IDLE +26. PGTABLE * ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the memory cgroup each page is charged to, indexed by PFN. Only available when @@ -118,8 +119,8 @@ Short descriptions to the page flags identical memory pages dynamically shared between one or more processes 22 - THP contiguous pages which construct transparent hugepages -23 - BALLOON -balloon compaction page +23 - OFFLINE +page is logically offline 24 - ZERO_PAGE zero page for pfn_zero or huge_zero page 25 - IDLE @@ -128,6 +129,8 @@ Short descriptions to the page flags Note that this flag may be stale in case the page was accessed via a PTE. To make sure the flag is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first. +26 - PGTABLE +page is in use as a page table IO related page flags - diff --git a/fs/proc/page.c b/fs/proc/page.c index 6c517b11acf8..378401af4d9d 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -152,8 +152,8 @@ u64 stable_page_flags(struct page *page) else if (page_count(page) == 0 && is_free_buddy_page(page)) u |= 1 << KPF_BUDDY; - if (PageBalloon(page)) - u |= 1 << KPF_BALLOON; + if (PageOffline(page)) + u |= 1 << KPF_OFFLINE; if (PageTable(page)) u |= 1 << KPF_PGTABLE; diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index cbe50da5a59d..f111c780ef1d 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -95,7 +95,7 @@ extern int balloon_page_migrate(struct address_space *mapping, static inline void balloon_page_insert(struct balloon_dev_info *balloon, struct page *page) { - __SetPageBalloon(page); + __SetPageOffline(page); __SetPageMovable(page, balloon->inode->i_mapping); set_page_private(page, (unsigned long)balloon); list_add(>lru, >pages); @@ -111,7 +111,7 @@ static inline void balloon_page_insert(struct balloon_dev_info *balloon, */ static inline void balloon_page_delete(struct page *page) { - __ClearPageBalloon(page); + __ClearPageOffline(page); __ClearPageMovable(page); set_page_private(page, 0); /* @@ -141,13 +141,13 @@ static inline gfp_t balloon_mapping_gfp_mask(void) static inline void balloon_page_insert(struct balloon_dev_info *balloon, struct page *page) { - __SetPageBalloon(page); + __Se
[PATCH v1 5/8] hv_balloon: mark inflated pages PG_offline
Mark inflated and never onlined pages PG_offline, to tell the world that the content is stale and should not be dumped. Cc: "K. Y. Srinivasan" Cc: Haiyang Zhang Cc: Stephen Hemminger Cc: Kairui Song Cc: Vitaly Kuznetsov Cc: Andrew Morton Cc: Matthew Wilcox Cc: Michal Hocko Cc: "Michael S. Tsirkin" Signed-off-by: David Hildenbrand --- drivers/hv/hv_balloon.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index 211f3fe3a038..47719862e57f 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -681,8 +681,13 @@ static struct notifier_block hv_memory_nb = { /* Check if the particular page is backed and can be onlined and online it. */ static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg) { - if (!has_pfn_is_backed(has, page_to_pfn(pg))) + if (!has_pfn_is_backed(has, page_to_pfn(pg))) { + if (!PageOffline(pg)) + __SetPageOffline(pg); return; + } + if (PageOffline(pg)) + __ClearPageOffline(pg); /* This frame is currently backed; online the page. */ __online_page_set_limits(pg); @@ -1201,6 +1206,7 @@ static void free_balloon_pages(struct hv_dynmem_device *dm, for (i = 0; i < num_pages; i++) { pg = pfn_to_page(i + start_frame); + __ClearPageOffline(pg); __free_page(pg); dm->num_pages_ballooned--; } @@ -1213,7 +1219,7 @@ static unsigned int alloc_balloon_pages(struct hv_dynmem_device *dm, struct dm_balloon_response *bl_resp, int alloc_unit) { - unsigned int i = 0; + unsigned int i, j; struct page *pg; if (num_pages < alloc_unit) @@ -1245,6 +1251,10 @@ static unsigned int alloc_balloon_pages(struct hv_dynmem_device *dm, if (alloc_unit != 1) split_page(pg, get_order(alloc_unit << PAGE_SHIFT)); + /* mark all pages offline */ + for (j = 0; j < (1 << get_order(alloc_unit << PAGE_SHIFT)); j++) + __SetPageOffline(pg + j); + bl_resp->range_count++; bl_resp->range_array[i].finfo.start_page = page_to_pfn(pg); -- 2.17.2 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v1 0/8] mm/kdump: allow to exclude pages that are logically offline
Right now, pages inflated as part of a balloon driver will be dumped by dump tools like makedumpfile. While XEN is able to check in the crash kernel whether a certain pfn is actuall backed by memory in the hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of virtio-balloon, hv-balloon and VMWare balloon inflated memory will essentially result in zero pages getting allocated by the hypervisor and the dump getting filled with this data. The allocation and reading of zero pages can directly be avoided if a dumping tool could know which pages only contain stale information not to be dumped. Also for XEN, calling into the kernel and asking the hypervisor if a pfn is backed can be avoided if the duming tool would skip such pages right from the beginning. Dumping tools have no idea whether a given page is part of a balloon driver and shall not be dumped. Esp. PG_reserved cannot be used for that purpose as all memory allocated during early boot is also PG_reserved, see discussion at [1]. So some other way of indication is required and a new page flag is frowned upon. We have PG_balloon (MAPCOUNT value), which is essentially unused now. I suggest renaming it to something more generic (PG_offline) to mark pages as logically offline. This flag can than e.g. also be used by virtio-mem in the future to mark subsections as offline. Or by other code that wants to put pages logically offline (e.g. later maybe poisoned pages that shall no longer be used). This series converts PG_balloon to PG_offline, allows dumping tools to query the value to detect such pages and marks pages in the hv-balloon and XEN balloon properly as PG_offline. Note that virtio-balloon already set pages to PG_balloon (and now PG_offline). Please note that this is also helpful for a problem we were seeing under Hyper-V: Dumping logically offline memory (pages kept fake offline while onlining a section via online_page_callback) would under some condicions result in a kernel panic when dumping them. As I don't have access to neither XEN nor Hyper-V nor VMWare installations, this was only tested with the virtio-balloon and pages were properly skipped when dumping. I'll also attach the makedumpfile patch to this series. [1] https://lkml.org/lkml/2018/7/20/566 RFC -> v1: - Add "PM / Hibernate: use pfn_to_online_page()" - Add "vmw_balloon: mark inflated pages PG_offline" - "mm: convert PG_balloon to PG_offline" -- After discussions, also rename the UAPI bit name (KPF_BALLOON -> KPF_OFFLINE) David Hildenbrand (8): mm: balloon: update comment about isolation/migration/compaction mm: convert PG_balloon to PG_offline kexec: export PG_offline to VMCOREINFO xen/balloon: mark inflated pages PG_offline hv_balloon: mark inflated pages PG_offline vmw_balloon: mark inflated pages PG_offline PM / Hibernate: use pfn_to_online_page() PM / Hibernate: exclude all PageOffline() pages Documentation/admin-guide/mm/pagemap.rst | 9 --- drivers/hv/hv_balloon.c | 14 -- drivers/misc/vmw_balloon.c | 32 ++ drivers/xen/balloon.c| 3 +++ fs/proc/page.c | 4 +-- include/linux/balloon_compaction.h | 34 +--- include/linux/page-flags.h | 11 +--- include/uapi/linux/kernel-page-flags.h | 2 +- kernel/crash_core.c | 2 ++ kernel/power/snapshot.c | 13 + tools/vm/page-types.c| 2 +- 11 files changed, 87 insertions(+), 39 deletions(-) -- 2.17.2 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFC 0/6] mm/kdump: allow to exclude pages that are logically offline
On 14.11.18 22:16, David Hildenbrand wrote: > Right now, pages inflated as part of a balloon driver will be dumped > by dump tools like makedumpfile. While XEN is able to check in the > crash kernel whether a certain pfn is actuall backed by memory in the > hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of > virtio-balloon and hv-balloon inflated memory will essentially result in > zero pages getting allocated by the hypervisor and the dump getting > filled with this data. > > The allocation and reading of zero pages can directly be avoided if a > dumping tool could know which pages only contain stale information not to > be dumped. > > Also for XEN, calling into the kernel and asking the hypervisor if a > pfn is backed can be avoided if the duming tool would skip such pages > right from the beginning. > > Dumping tools have no idea whether a given page is part of a balloon driver > and shall not be dumped. Esp. PG_reserved cannot be used for that purpose > as all memory allocated during early boot is also PG_reserved, see > discussion at [1]. So some other way of indication is required and a new > page flag is frowned upon. > > We have PG_balloon (MAPCOUNT value), which is essentially unused now. I > suggest renaming it to something more generic (PG_offline) to mark pages as > logically offline. This flag can than e.g. also be used by virtio-mem in > the future to mark subsections as offline. Or by other code that wants to > put pages logically offline (e.g. later maybe poisoned pages that shall > no longer be used). > > This series converts PG_balloon to PG_offline, allows dumping tools to > query the value to detect such pages and marks pages in the hv-balloon > and XEN balloon properly as PG_offline. Note that virtio-balloon already > set pages to PG_balloon (and now PG_offline). > > Please note that this is also helpful for a problem we were seeing under > Hyper-V: Dumping logically offline memory (pages kept fake offline while > onlining a section via online_page_callback) would under some condicions > result in a kernel panic when dumping them. > > As I don't have access to neither XEN nor Hyper-V installation, this was > not tested yet (and a makedumpfile change will be required to skip > dumping these pages). > > [1] https://lkml.org/lkml/2018/7/20/566 > > David Hildenbrand (6): > mm: balloon: update comment about isolation/migration/compaction > mm: convert PG_balloon to PG_offline > kexec: export PG_offline to VMCOREINFO > xen/balloon: mark inflated pages PG_offline > hv_balloon: mark inflated pages PG_offline > PM / Hibernate: exclude all PageOffline() pages > > Documentation/admin-guide/mm/pagemap.rst | 6 + > drivers/hv/hv_balloon.c | 14 -- > drivers/xen/balloon.c| 3 +++ > fs/proc/page.c | 4 +-- > include/linux/balloon_compaction.h | 34 +--- > include/linux/page-flags.h | 11 +--- > include/uapi/linux/kernel-page-flags.h | 1 + > kernel/crash_core.c | 2 ++ > kernel/power/snapshot.c | 5 +++- > tools/vm/page-types.c| 1 + > 10 files changed, 51 insertions(+), 30 deletions(-) > I just did a test with virtio-balloon (and a very simple makedumpfile patch which I can supply on demand). 1. Guest with 8GB. Inflate balloon to 4GB via sudo virsh setmem f29 --size 4096M --live 2. Trigger a kernel panic in the guest echo 1 > /proc/sys/kernel/sysrq echo c > /proc/sysrq-trigger Original pages : 0x001e1da8 Excluded pages : 0x001c9221 Pages filled with zero : 0x50b0 Non-private cache pages : 0x00046547 Private cache pages : 0x2165 User process data pages : 0x48cf Free pages : 0x000771f6 Hwpoison pages : 0x Offline pages : 0x0010 Remaining pages : 0x00018b87 (The number of pages is reduced to 5%.) Memory Hole : 0x0009e258 -- Total pages : 0x0028 (Offline patches matches the 4GB) -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFC 6/6] PM / Hibernate: exclude all PageOffline() pages
On 15.11.18 13:23, Michal Hocko wrote: > On Wed 14-11-18 22:17:04, David Hildenbrand wrote: > [...] >> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c >> index b0308a2c6000..01db1d13481a 100644 >> --- a/kernel/power/snapshot.c >> +++ b/kernel/power/snapshot.c >> @@ -1222,7 +1222,7 @@ static struct page *saveable_highmem_page(struct zone >> *zone, unsigned long pfn) >> BUG_ON(!PageHighMem(page)); >> >> if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page) || >> -PageReserved(page)) >> +PageReserved(page) || PageOffline(page)) >> return NULL; >> >> if (page_is_guard(page)) >> @@ -1286,6 +1286,9 @@ static struct page *saveable_page(struct zone *zone, >> unsigned long pfn) >> if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page)) >> return NULL; >> >> +if (PageOffline(page)) >> +return NULL; >> + >> if (PageReserved(page) >> && (!kernel_page_present(page) || pfn_is_nosave(pfn))) >> return NULL; > > Btw. now that you are touching this file could you also make > s@pfn_to_page@pfn_to_online_page@ please? We really do not want to touch > offline pfn ranges in general. A separate patch for that of course. > > Thanks! > Sure thing, will look into that! Thanks! -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFC 3/6] kexec: export PG_offline to VMCOREINFO
On 15.11.18 12:52, Borislav Petkov wrote: > On Thu, Nov 15, 2018 at 12:20:40PM +0100, David Hildenbrand wrote: >> Sorry to say, but that is the current practice without which >> makedumpfile would not be able to work at all. (exclude user pages, >> exclude page cache, exclude buddy pages). Let's not reinvent the wheel >> here. This is how dumping works forever. > > Sorry, but "we've always done this in the past" doesn't make it better. Just saying that "I'm not the first to do it, don't hit me with a stick" :) > >> I don't see how there should be "set of pages which do not have >> PG_offline". > > It doesn't have to be a set of pages. Think a (mmconfig perhaps) region > which the kdump kernel should completely skip because poking in it in > the kdump kernel, causes all kinds of havoc like machine checks. etc. > We've had and still have one issue like that. Indeed. And we still have without makedumpfile. I think you are aware of this, but I'll explain it just for consistency: PG_hwpoison At some point we detect a HW error and mask a page as PG_hwpoison. makedumpfile knows how to treat that flag and can exclude it from the dump (== not access it). No crash. kdump itself has no clue about old "struct pages". Especially: a) Where they are located in memory (e.g. SPARSE) b) What their format is ("where are the flags") c) What the meaning of flags is ("what does bit X mean") In order to know such information, we would have to do parsing of quite some information inside the kernel in kdump. Basically what makedumpfile does just now. Is this feasible? I don't think so. So we would need another approach to communicate such information as you said. I can't think of any, but if anybody reading this has an idea, please speak up. I am interested. The *only* way right now we would have to handle such scenarios: 1. While dumping memory and we get a machine check, fake reading a zero page instead of crashing. 2. While dumping memory and we get a fault, fake reading a zero page instead of crashing. > > But let me clarify my note: I don't want to be discussing with you the > design of makedumpfile and how it should or should not work - that ship > has already sailed. Apparently there are valid reasons to do it this > way. Indeed, and the basic design is to export these flags. (let's say "unfortunately", being able to handle such stuff in kdump directly would be the dream). > I was *simply* stating that it feels wrong to export mm flags like that. > > But as I said already, that is mm guys' call and looking at how we're > already exporting a bunch of stuff in the vmcoreinfo - including other > mm flags - I guess one more flag doesn't matter anymore. Fair enough, noted. If you have an idea how to handle this in kdump, please let me know. -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFC 3/6] kexec: export PG_offline to VMCOREINFO
On 15.11.18 12:10, Borislav Petkov wrote: > On Thu, Nov 15, 2018 at 02:19:23PM +0800, Dave Young wrote: >> It would be good to copy some background info from cover letter to the >> patch description so that we can get better understanding why this is >> needed now. >> >> BTW, Lianbo is working on a documentation of the vmcoreinfo exported >> fields. Ccing him so that he is aware of this. >> >> Also cc Boris, although I do not want the doc changes blocks this >> he might have different opinion :) > > Yeah, my initial reaction is that exporting an mm-internal flag to > userspace is a no-no. Sorry to say, but that is the current practice without which makedumpfile would not be able to work at all. (exclude user pages, exclude page cache, exclude buddy pages). Let's not reinvent the wheel here. This is how dumping works forever. Also see how hwpoisoned pages are handled. (please note that most distributions only support dumping via makedumpfile, for said reasons) > > What would be better, IMHO, is having a general method of telling the > kdump kernel - maybe ranges of physical addresses - which to skip. And that has to be updated whenever we change a page flag. But then the dump kernel would have to be aware about "struct page" location and format of some old kernel to be dump. Let's just not even discuss going down that path. > > Because the moment there's a set of pages which do not have PG_offline > set but kdump would still like to skip, this breaks. I don't understand your concern. PG_offline is only an optimization for sections that are online. Offline sections can already be completely ignored when dumping. I don't see how there should be "set of pages which do not have PG_offline". I mean if they don't have PG_offline they will simply be dumped like before. Which worked forever. Sorry, I don't get your point. -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFC 3/6] kexec: export PG_offline to VMCOREINFO
On 15.11.18 07:19, Dave Young wrote: > Hi David, > > On 11/14/18 at 10:17pm, David Hildenbrand wrote: >> Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so >> makedumpfile can directly skip pages that are logically offline and the >> content therefore stale. > > It would be good to copy some background info from cover letter to the > patch description so that we can get better understanding why this is > needed now. Yes, will add more detail! > > BTW, Lianbo is working on a documentation of the vmcoreinfo exported > fields. Ccing him so that he is aware of this. > > Also cc Boris, although I do not want the doc changes blocks this > he might have different opinion :) I'll be happy to help updating documentation (or updating it myself if the doc updates go in first). -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFC 2/6] mm: convert PG_balloon to PG_offline
On 15.11.18 03:07, Mike Rapoport wrote: > On Wed, Nov 14, 2018 at 11:49:15PM +0100, David Hildenbrand wrote: >> On 14.11.18 23:23, Matthew Wilcox wrote: >>> On Wed, Nov 14, 2018 at 10:17:00PM +0100, David Hildenbrand wrote: >>>> Rename PG_balloon to PG_offline. This is an indicator that the page is >>>> logically offline, the content stale and that it should not be touched >>>> (e.g. a hypervisor would have to allocate backing storage in order for the >>>> guest to dump an unused page). We can then e.g. exclude such pages from >>>> dumps. >>>> >>>> In following patches, we will make use of this bit also in other balloon >>>> drivers. While at it, document PGTABLE. >>> >>> Thank you for documenting PGTABLE. I didn't realise I also had this >>> document to update when I added PGTABLE. >> >> Thank you for looking into this :) >> >>> >>>> +++ b/Documentation/admin-guide/mm/pagemap.rst >>>> @@ -78,6 +78,8 @@ number of times a page is mapped. >>>> 23. BALLOON >>>> 24. ZERO_PAGE >>>> 25. IDLE >>>> +26. PGTABLE >>>> +27. OFFLINE >>> >>> So the offline *user* bit is new ... even though the *kernel* bit >>> just renames the balloon bit. I'm not sure how I feel about this. >>> I'm going to think about it some more. Could you share your decision >>> process with us? >> >> BALLOON was/is documented as >> >> "23 - BALLOON >> balloon compaction page >> " >> >> and only includes all virtio-ballon pages after the non-lru migration >> feature has been implemented for ballooned pages. Since then, this flag >> does basically no longer stands for what it actually was supposed to do. > > Perhaps I missing something, but how the user should interpret "23" when he > reads /proc/kpageflags? Looking at the history in more detail: commit 09316c09dde33aae14f34489d9e3d243ec0d5938 Author: Konstantin Khlebnikov Date: Thu Oct 9 15:29:32 2014 -0700 mm/balloon_compaction: add vmstat counters and kpageflags bit Always mark pages with PageBalloon even if balloon compaction is disabled and expose this mark in /proc/kpageflags as KPF_BALLOON. So KPF_BALLOON was exposed when virtio-balloon pages were always marked with PG_balloon. So the documentation is actually wrong ("balloon page" vs. "balloon compaction page"). I have no idea who actually used that information. I suspect this was just some debugging aid. > >> To not break uapi I decided to not rename it but instead to add a new flag. > > I've got a feeling that uapi was anyway changed for the BALLON flag > meaning. Yes. If we *replace* KPF_BALLOON by KPF_OFFLINE a) Some applications might no longer compile (I guess that's ok) b) Some old applications will treat KPF_OFFLINE like KPF_BALLOON (which should at least for virtio-balloon usage until now be fine - it is just more generic) So I guess it's up to Maintainers/Matthew to decide :) -- Thanks, David / dhildenb ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel