Re: [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes

2019-11-08 Thread David Hildenbrand

On 08.11.19 19:29, Dan Williams wrote:

On Fri, Nov 8, 2019 at 2:22 AM David Hildenbrand  wrote:


On 08.11.19 08:14, David Hildenbrand wrote:

On 08.11.19 06:09, Dan Williams wrote:

On Thu, Nov 7, 2019 at 2:07 PM David Hildenbrand  wrote:


On 07.11.19 19:22, David Hildenbrand wrote:




Am 07.11.2019 um 16:40 schrieb Dan Williams :

On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand  wrote:


Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make
sure the function produces the same result once we stop setting ZONE_DEVICE
pages PG_reserved.

Cc: Alex Williamson 
Cc: Cornelia Huck 
Signed-off-by: David Hildenbrand 
---
drivers/vfio/vfio_iommu_type1.c | 10 --
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..f8ce8c408ba8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long 
npage, bool async)
 */
static bool is_invalid_reserved_pfn(unsigned long pfn)
{
-   if (pfn_valid(pfn))
-   return PageReserved(pfn_to_page(pfn));
+   struct page *page = pfn_to_online_page(pfn);


Ugh, I just realized this is not a safe conversion until
pfn_to_online_page() is moved over to subsection granularity. As it
stands it will return true for any ZONE_DEVICE pages that share a
section with boot memory.


That should not happen right now and I commented back when you introduced 
subsection support that I don’t want to have ZONE_DEVICE mixed with online 
pages in a section. Having memory block devices that partially span ZONE_DEVICE 
would be ... really weird. With something like pfn_active() - as discussed - we 
could at least make this check work - but I am not sure if we really want to go 
down that path. In the worst case, some MB of RAM are lost ... I guess this 
needs more thought.



I just realized the "boot memory" part. Is that a real thing? IOW, can
we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat
have doubts that this would work ...


One of the real world failure cases that started the subsection effect
is that Persistent Memory collides with System RAM on a 64MB boundary
on shipping platforms. System RAM ends on a 64MB boundary and due to a
lack of memory controller resources PMEM is mapped contiguously at the
end of that boundary. Some more details in the subsection cover letter
/ changelogs [1] [2]. It's not sufficient to just lose some memory,
that's the broken implementation that lead to the subsection work
because the lost memory may change from one boot to the next and
software can't reliably inject a padding that conforms to the x86
128MB section constraint.


Thanks, I thought it was mostly for weird alignment where other parts of
the section are basically "holes" and not memory.

Yes, it is a real bug that ZONE_DEVICE pages fall into sections that are
marked SECTION_IS_ONLINE.



Suffice to say I think we need your pfn_active() to get subsection
granularity pfn_to_online_page() before PageReserved() can be removed.


I agree that we have to fix this. I don't like ZONE_DEVICE pages falling
into memory device blocks (e.g., cannot get offlined), but I guess that
train is gone :) As long as it's not for memory hotplug, I can most
probably live with this.

Also, I'd like to get Michals opinion on this and the pfn_active()
approach, but I can understand he's busy.

This patch set can wait, I won't be working next week besides
reading/writing mails either way.

Is anybody looking into the pfn_active() thingy?



I wonder if we should do something like this right now to fix this
(exclude the false positive ZONE_DEVICE pages we could have within an
online section, which was not possible before subsection hotplug):

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 384ffb3d69ab..490a9e9358b3 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -30,6 +30,8 @@ struct vmem_altmap;
  if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
  pfn_valid_within(___pfn))  \
  ___page = pfn_to_page(___pfn); \
+   if (unlikely(___page && is_zone_device_page(___page))) \
+   ___page = NULL;\
  ___page;   \
   })


Yeah, it's another is_zone_device_page(), but it should not be racy
here, as 

Re: [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes

2019-11-08 Thread David Hildenbrand

On 08.11.19 08:14, David Hildenbrand wrote:

On 08.11.19 06:09, Dan Williams wrote:

On Thu, Nov 7, 2019 at 2:07 PM David Hildenbrand  wrote:


On 07.11.19 19:22, David Hildenbrand wrote:




Am 07.11.2019 um 16:40 schrieb Dan Williams :

On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand  wrote:


Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make
sure the function produces the same result once we stop setting ZONE_DEVICE
pages PG_reserved.

Cc: Alex Williamson 
Cc: Cornelia Huck 
Signed-off-by: David Hildenbrand 
---
drivers/vfio/vfio_iommu_type1.c | 10 --
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..f8ce8c408ba8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long 
npage, bool async)
*/
static bool is_invalid_reserved_pfn(unsigned long pfn)
{
-   if (pfn_valid(pfn))
-   return PageReserved(pfn_to_page(pfn));
+   struct page *page = pfn_to_online_page(pfn);


Ugh, I just realized this is not a safe conversion until
pfn_to_online_page() is moved over to subsection granularity. As it
stands it will return true for any ZONE_DEVICE pages that share a
section with boot memory.


That should not happen right now and I commented back when you introduced 
subsection support that I don’t want to have ZONE_DEVICE mixed with online 
pages in a section. Having memory block devices that partially span ZONE_DEVICE 
would be ... really weird. With something like pfn_active() - as discussed - we 
could at least make this check work - but I am not sure if we really want to go 
down that path. In the worst case, some MB of RAM are lost ... I guess this 
needs more thought.



I just realized the "boot memory" part. Is that a real thing? IOW, can
we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat
have doubts that this would work ...


One of the real world failure cases that started the subsection effect
is that Persistent Memory collides with System RAM on a 64MB boundary
on shipping platforms. System RAM ends on a 64MB boundary and due to a
lack of memory controller resources PMEM is mapped contiguously at the
end of that boundary. Some more details in the subsection cover letter
/ changelogs [1] [2]. It's not sufficient to just lose some memory,
that's the broken implementation that lead to the subsection work
because the lost memory may change from one boot to the next and
software can't reliably inject a padding that conforms to the x86
128MB section constraint.


Thanks, I thought it was mostly for weird alignment where other parts of
the section are basically "holes" and not memory.

Yes, it is a real bug that ZONE_DEVICE pages fall into sections that are
marked SECTION_IS_ONLINE.



Suffice to say I think we need your pfn_active() to get subsection
granularity pfn_to_online_page() before PageReserved() can be removed.


I agree that we have to fix this. I don't like ZONE_DEVICE pages falling
into memory device blocks (e.g., cannot get offlined), but I guess that
train is gone :) As long as it's not for memory hotplug, I can most
probably live with this.

Also, I'd like to get Michals opinion on this and the pfn_active()
approach, but I can understand he's busy.

This patch set can wait, I won't be working next week besides
reading/writing mails either way.

Is anybody looking into the pfn_active() thingy?



I wonder if we should do something like this right now to fix this 
(exclude the false positive ZONE_DEVICE pages we could have within an 
online section, which was not possible before subsection hotplug):


diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 384ffb3d69ab..490a9e9358b3 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -30,6 +30,8 @@ struct vmem_altmap;
if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
pfn_valid_within(___pfn))  \
___page = pfn_to_page(___pfn); \
+   if (unlikely(___page && is_zone_device_page(___page))) \
+   ___page = NULL;\
___page;   \
 })


Yeah, it's another is_zone_device_page(), but it should not be racy 
here, as we want to exclude, not include ZONE_DEVICE.


I don't have time to look into this right now

Re: [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes

2019-11-07 Thread David Hildenbrand

On 08.11.19 06:09, Dan Williams wrote:

On Thu, Nov 7, 2019 at 2:07 PM David Hildenbrand  wrote:


On 07.11.19 19:22, David Hildenbrand wrote:




Am 07.11.2019 um 16:40 schrieb Dan Williams :

On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand  wrote:


Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make
sure the function produces the same result once we stop setting ZONE_DEVICE
pages PG_reserved.

Cc: Alex Williamson 
Cc: Cornelia Huck 
Signed-off-by: David Hildenbrand 
---
drivers/vfio/vfio_iommu_type1.c | 10 --
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..f8ce8c408ba8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long 
npage, bool async)
   */
static bool is_invalid_reserved_pfn(unsigned long pfn)
{
-   if (pfn_valid(pfn))
-   return PageReserved(pfn_to_page(pfn));
+   struct page *page = pfn_to_online_page(pfn);


Ugh, I just realized this is not a safe conversion until
pfn_to_online_page() is moved over to subsection granularity. As it
stands it will return true for any ZONE_DEVICE pages that share a
section with boot memory.


That should not happen right now and I commented back when you introduced 
subsection support that I don’t want to have ZONE_DEVICE mixed with online 
pages in a section. Having memory block devices that partially span ZONE_DEVICE 
would be ... really weird. With something like pfn_active() - as discussed - we 
could at least make this check work - but I am not sure if we really want to go 
down that path. In the worst case, some MB of RAM are lost ... I guess this 
needs more thought.



I just realized the "boot memory" part. Is that a real thing? IOW, can
we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat
have doubts that this would work ...


One of the real world failure cases that started the subsection effect
is that Persistent Memory collides with System RAM on a 64MB boundary
on shipping platforms. System RAM ends on a 64MB boundary and due to a
lack of memory controller resources PMEM is mapped contiguously at the
end of that boundary. Some more details in the subsection cover letter
/ changelogs [1] [2]. It's not sufficient to just lose some memory,
that's the broken implementation that lead to the subsection work
because the lost memory may change from one boot to the next and
software can't reliably inject a padding that conforms to the x86
128MB section constraint.


Thanks, I thought it was mostly for weird alignment where other parts of 
the section are basically "holes" and not memory.


Yes, it is a real bug that ZONE_DEVICE pages fall into sections that are 
marked SECTION_IS_ONLINE.




Suffice to say I think we need your pfn_active() to get subsection
granularity pfn_to_online_page() before PageReserved() can be removed.


I agree that we have to fix this. I don't like ZONE_DEVICE pages falling 
into memory device blocks (e.g., cannot get offlined), but I guess that 
train is gone :) As long as it's not for memory hotplug, I can most 
probably live with this.


Also, I'd like to get Michals opinion on this and the pfn_active() 
approach, but I can understand he's busy.


This patch set can wait, I won't be working next week besides 
reading/writing mails either way.


Is anybody looking into the pfn_active() thingy?



[1]: 
https://lore.kernel.org/linux-mm/156092349300.979959.17603710711957735135.st...@dwillia2-desk3.amr.corp.intel.com/
[2]: 
https://lore.kernel.org/linux-mm/156092354368.979959.6232443923440952359.st...@dwillia2-desk3.amr.corp.intel.com/




--

Thanks,

David / dhildenb

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes

2019-11-07 Thread David Hildenbrand

On 07.11.19 19:22, David Hildenbrand wrote:




Am 07.11.2019 um 16:40 schrieb Dan Williams :

On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand  wrote:


Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make
sure the function produces the same result once we stop setting ZONE_DEVICE
pages PG_reserved.

Cc: Alex Williamson 
Cc: Cornelia Huck 
Signed-off-by: David Hildenbrand 
---
drivers/vfio/vfio_iommu_type1.c | 10 --
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..f8ce8c408ba8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long 
npage, bool async)
  */
static bool is_invalid_reserved_pfn(unsigned long pfn)
{
-   if (pfn_valid(pfn))
-   return PageReserved(pfn_to_page(pfn));
+   struct page *page = pfn_to_online_page(pfn);


Ugh, I just realized this is not a safe conversion until
pfn_to_online_page() is moved over to subsection granularity. As it
stands it will return true for any ZONE_DEVICE pages that share a
section with boot memory.


That should not happen right now and I commented back when you introduced 
subsection support that I don’t want to have ZONE_DEVICE mixed with online 
pages in a section. Having memory block devices that partially span ZONE_DEVICE 
would be ... really weird. With something like pfn_active() - as discussed - we 
could at least make this check work - but I am not sure if we really want to go 
down that path. In the worst case, some MB of RAM are lost ... I guess this 
needs more thought.



I just realized the "boot memory" part. Is that a real thing? IOW, can 
we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat 
have doubts that this would work ...


--

Thanks,

David / dhildenb

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes

2019-11-07 Thread David Hildenbrand


> Am 07.11.2019 um 16:40 schrieb Dan Williams :
> 
> On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand  wrote:
>> 
>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
>> change that.
>> 
>> KVM has this weird use case that you can map anything from /dev/mem
>> into the guest. pfn_valid() is not a reliable check whether the memmap
>> was initialized and can be touched. pfn_to_online_page() makes sure
>> that we have an initialized memmap (and don't have ZONE_DEVICE memory).
>> 
>> Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make
>> sure the function produces the same result once we stop setting ZONE_DEVICE
>> pages PG_reserved.
>> 
>> Cc: Alex Williamson 
>> Cc: Cornelia Huck 
>> Signed-off-by: David Hildenbrand 
>> ---
>> drivers/vfio/vfio_iommu_type1.c | 10 --
>> 1 file changed, 8 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>> b/drivers/vfio/vfio_iommu_type1.c
>> index 2ada8e6cdb88..f8ce8c408ba8 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long 
>> npage, bool async)
>>  */
>> static bool is_invalid_reserved_pfn(unsigned long pfn)
>> {
>> -   if (pfn_valid(pfn))
>> -   return PageReserved(pfn_to_page(pfn));
>> +   struct page *page = pfn_to_online_page(pfn);
> 
> Ugh, I just realized this is not a safe conversion until
> pfn_to_online_page() is moved over to subsection granularity. As it
> stands it will return true for any ZONE_DEVICE pages that share a
> section with boot memory.

That should not happen right now and I commented back when you introduced 
subsection support that I don’t want to have ZONE_DEVICE mixed with online 
pages in a section. Having memory block devices that partially span ZONE_DEVICE 
would be ... really weird. With something like pfn_active() - as discussed - we 
could at least make this check work - but I am not sure if we really want to go 
down that path. In the worst case, some MB of RAM are lost ... I guess this 
needs more thought.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes

2019-11-05 Thread David Hildenbrand

On 06.11.19 01:08, Dan Williams wrote:

On Tue, Nov 5, 2019 at 4:03 PM Sean Christopherson
 wrote:


On Tue, Nov 05, 2019 at 03:43:29PM -0800, Dan Williams wrote:

On Tue, Nov 5, 2019 at 3:30 PM Dan Williams  wrote:


On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson
 wrote:


On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote:

On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand  wrote:

The scarier code (for me) is transparent_hugepage_adjust() and
kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
interaction between THP and _PAGE_DEVMAP.


The x86 KVM MMU code is one of the ugliest code I know (sorry, but it
had to be said :/ ). Luckily, this should be independent of the
PG_reserved thingy AFAIKs.


Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
are honoring kvm_is_reserved_pfn(), so again I'm missing where the
page count gets mismanaged and leads to the reported hang.


When mapping pages into the guest, KVM gets the page via gup(), which
increments the page count for ZONE_DEVICE pages.  But KVM puts the page
using kvm_release_pfn_clean(), which skips put_page() if PageReserved()
and so never puts its reference to ZONE_DEVICE pages.


Oh, yeah, that's busted.


Ugh, it's extra busted because every other gup user in the kernel
tracks the pages resulting from gup and puts them (put_page()) when
they are done. KVM wants to forget about whether it did a gup to get
the page and optionally trigger put_page() based purely on the pfn.
Outside of VFIO device assignment that needs pages pinned for DMA, why
does KVM itself need to pin pages? If pages are pinned over a return
to userspace that needs to be a FOLL_LONGTERM gup.


Short answer, KVM pins the page to ensure correctness with respect to the
primary MMU invalidating the associated host virtual address, e.g. when
the page is being migrated or unmapped from host userspace.

The main use of gup() is to handle guest page faults and map pages into
the guest, i.e. into KVM's secondary MMU.  KVM uses gup() to both get the
PFN and to temporarily pin the page.  The pin is held just long enough to
guaranteed that any invalidation via the mmu_notifier will be stalled
until after KVM finishes installing the page into the secondary MMU, i.e.
the pin is short-term and not held across a return to userspace or entry
into the guest.  When a subsequent mmu_notifier invalidation occurs, KVM
pulls the PFN from the secondary MMU and uses that to update accessed
and dirty bits in the host.

There are a few other KVM flows that eventually call into gup(), but those
are "traditional" short-term pins and use put_page() directly.


Ok, I was misinterpreting the effect of the bug with what KVM is using
the reference to do.

To your other point:


But David's proposed fix for the above refcount bug is to omit the patch
so that KVM no longer treats ZONE_DEVICE pages as reserved.  That seems
like the right thing to do, including for thp_adjust(), e.g. it would
naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is
mapped with a huge page (2mb or above) in the host.  The only hiccup is
figuring out how to correctly transfer the reference.


That might not be the only hiccup. There's currently no such thing as
huge pages for ZONE_DEVICE, there are huge *mappings* (pmd and pud),
but the result of pfn_to_page() on such a mapping does not yield a
huge 'struct page'. It seems there are other paths in KVM that assume
that more typical page machinery is active like SetPageDirty() based
on kvm_is_reserved_pfn(). While I told David that I did not want to
see more usage of is_zone_device_page(), this patch below (untested)
seems a cleaner path with less surprises:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4df0aa6b8e5c..fbea17c1810c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1831,7 +1831,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean);

  void kvm_release_pfn_clean(kvm_pfn_t pfn)
  {
-   if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
+   if ((!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) ||
+   (pfn_valid(pfn) && is_zone_device_page(pfn_to_page(pfn
 put_page(pfn_to_page(pfn));
  }
  EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);


I had the same thought, but I do wonder about the kvm_get_pfn() users, 
e.g.,:


hva_to_pfn_remapped():
r = follow_pfn(vma, addr, );
...
kvm_get_pfn(pfn);
...

We would not take a reference for ZONE_DEVICE, but later drop one 
reference via kvm_release_pfn_clean(). IOW, kvm_get_pfn() gets *really* 
dangerous to use. I can't tell if this can happen right now.


We do have 3 users of kvm_get_pfn() that we have to audit before this 
change. Also, we should add a comment to kvm_get_pfn() that it should 
never be used with possible ZONE_DEVICE pages.


Also, we should add a comment to kvm_release_pfn_clean(), describing why 
we treat ZONE_DEVIC

Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes

2019-11-05 Thread David Hildenbrand

I think I know what's going wrong:

Pages that are pinned via gfn_to_pfn() and friends take a references,
however are often released via
kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()...


E.g., in arch/x86/kvm/x86.c:reexecute_instruction()

...
pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
...
kvm_release_pfn_clean(pfn);



void kvm_release_pfn_clean(kvm_pfn_t pfn)
{
if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
put_page(pfn_to_page(pfn));
}

This function makes perfect sense as the counterpart for kvm_get_pfn():

void kvm_get_pfn(kvm_pfn_t pfn)
{
if (!kvm_is_reserved_pfn(pfn))
get_page(pfn_to_page(pfn));
}


As all ZONE_DEVICE pages are currently reserved, pages pinned via
gfn_to_pfn() and friends will often not see a put_page() AFAIKS.


Assuming gup() takes a reference for ZONE_DEVICE pages, yes, this is a
KVM bug.


Yes, it does take a reference AFAIKs. E.g.,

mm/gup.c:gup_pte_range():
...
if (pte_devmap(pte)) {
if (unlikely(flags & FOLL_LONGTERM))
goto pte_unmap;

pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
if (unlikely(!pgmap)) {
undo_dev_pagemap(nr, nr_start, pages);
goto pte_unmap;
}
} else if (pte_special(pte))
goto pte_unmap;

VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);

head = try_get_compound_head(page, 1);

try_get_compound_head() will increment the reference count.




Now, my patch does not change that, the result of
kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would
probably be

a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and
friends, after you successfully pinned the pages. (not sure if that's
the right thing to do but you're the expert)

b) To not use kvm_release_pfn_clean() and friends on pages that were
definitely pinned.


This is already KVM's intent, i.e. the purpose of the PageReserved() check
is simply to avoid putting a non-existent reference.  The problem is that
KVM assumes pages with PG_reserved set are never pinned, which AFAICT was
true when the code was first added.


(talking to myself, sorry)

Thinking again, dropping this patch from this series could effectively also
fix that issue. E.g., kvm_release_pfn_clean() and friends would always do a
put_page() if "pfn_valid() and !PageReserved()", so after patch 9 also on
ZONDE_DEVICE pages.


Yeah, this appears to be the correct fix.


But it would have side effects that might not be desired. E.g.,:

1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be the
right thing to do).


This should be ok, at least on x86.  There are only three users of
kvm_pfn_to_page().  Two of those are on allocations that are controlled by
KVM and are guaranteed to be vanilla MAP_ANONYMOUS.  The third is on guest
memory when running a nested guest, and in that case supporting ZONE_DEVICE
memory is desirable, i.e. KVM should play nice with a guest that is backed
by ZONE_DEVICE memory.


2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be
okay)


This is ok from a KVM perspective.


What about

void kvm_get_pfn(kvm_pfn_t pfn)
{
if (!kvm_is_reserved_pfn(pfn))
get_page(pfn_to_page(pfn));
}

Is a pure get_page() sufficient in case of ZONE_DEVICE?
(asking because of the live references obtained via 
get_dev_pagemap(pte_pfn(pte), pgmap) in mm/gup.c:gup_pte_range() 
somewhat confuse me :) )





The scarier code (for me) is transparent_hugepage_adjust() and
kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
interaction between THP and _PAGE_DEVMAP.


The x86 KVM MMU code is one of the ugliest code I know (sorry, but it 
had to be said :/ ). Luckily, this should be independent of the 
PG_reserved thingy AFAIKs.


--

Thanks,

David / dhildenb

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 02/10] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes

2019-11-05 Thread David Hildenbrand

On 05.11.19 02:37, Dan Williams wrote:

On Thu, Oct 24, 2019 at 5:10 AM David Hildenbrand  wrote:


Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite kvm_is_mmio_pfn() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Sean Christopherson 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Cc: Jim Mattson 
Cc: Joerg Roedel 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: KarimAllah Ahmed 
Cc: Michal Hocko 
Cc: Dan Williams 
Signed-off-by: David Hildenbrand 
---
  arch/x86/kvm/mmu.c | 29 +
  1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24c23c66b226..f03089a336de 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2962,20 +2962,25 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,

  static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
  {
+   struct page *page = pfn_to_online_page(pfn);
+
+   /*
+* ZONE_DEVICE pages are never online. Online pages that are reserved
+* either indicate the zero page or MMIO pages.
+*/
+   if (page)
+   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+
+   /*
+* Anything with a valid (but not online) memmap could be ZONE_DEVICE.
+* Treat only UC/UC-/WC pages as MMIO.


You might clarify that ZONE_DEVICE pages that belong to WB cacheable
pmem have the correct memtype established by devm_memremap_pages().


/*
 * Anything with a valid (but not online) memmap could be ZONE_DEVICE.
 * Treat only UC/UC-/WC pages as MMIO. devm_memremap_pages() established
 * the correct memtype for WB cacheable ZONE_DEVICE pages.
 */

Thanks!



Other than that, feel free to add:

Reviewed-by: Dan Williams 




--

Thanks,

David / dhildenb

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 09/10] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap

2019-11-05 Thread David Hildenbrand

On 04.11.19 23:44, Boris Ostrovsky wrote:

On 10/24/19 8:09 AM, David Hildenbrand wrote:

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 4f2e78a5e4db..af69f057913a 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -374,6 +374,13 @@ static void xen_online_page(struct page *page, unsigned 
int order)
mutex_lock(_mutex);
for (i = 0; i < size; i++) {
p = pfn_to_page(start_pfn + i);
+   /*
+* TODO: The core used to mark the pages reserved. Most probably
+* we can stop doing that now. However, especially
+* alloc_xenballooned_pages() left PG_reserved set
+* on pages that can get mapped to user space.
+*/
+   __SetPageReserved(p);


I suspect this is not needed. Pages can get into balloon either from
here or from non-hotplug path (e.g. decrease_reservation()) and so when
we get a page from the balloon we would get a random page that may or
may not have Reserved bit set.


Yeah, I also think it is fine. If you agree, I'll drop this hunk and add 
details to the patch description.



--

Thanks,

David / dhildenb

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes

2019-11-05 Thread David Hildenbrand

On 05.11.19 10:49, David Hildenbrand wrote:

On 05.11.19 10:17, David Hildenbrand wrote:

On 05.11.19 05:38, Dan Williams wrote:

On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand  wrote:


Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite kvm_is_reserved_pfn() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: KarimAllah Ahmed 
Signed-off-by: David Hildenbrand 
---
virt/kvm/kvm_main.c | 10 --
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e9eb666eb6e8..9d18cc67d124 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct 
kvm *kvm,

bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
{
-   if (pfn_valid(pfn))
-   return PageReserved(pfn_to_page(pfn));
+   struct page *page = pfn_to_online_page(pfn);

+   /*
+* We treat any pages that are not online (not managed by the buddy)
+* as reserved - this includes ZONE_DEVICE pages and pages without
+* a memmap (e.g., mapped via /dev/mem).
+*/
+   if (page)
+   return PageReserved(page);
   return true;
}


So after this all the pfn_valid() usage in kvm_main.c is replaced with
pfn_to_online_page()? Looks correct to me.

However, I'm worried that kvm is taking reference on ZONE_DEVICE pages
through some other path resulting in this:

   https://lore.kernel.org/linux-nvdimm/20190919154708.ga24...@angband.pl/

I'll see if this patch set modulates or maintains that failure mode.



I'd assume that the behavior is unchanged. Ithink we get a reference to
these ZONE_DEVICE pages via __get_user_pages_fast() and friends in
hva_to_pfn_fast() and friends in virt/kvm/kvm_main.c



I think I know what's going wrong:

Pages that are pinned via gfn_to_pfn() and friends take a references,
however are often released via
kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()...


E.g., in arch/x86/kvm/x86.c:reexecute_instruction()

...
pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
...
kvm_release_pfn_clean(pfn);



void kvm_release_pfn_clean(kvm_pfn_t pfn)
{
if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
put_page(pfn_to_page(pfn));
}

This function makes perfect sense as the counterpart for kvm_get_pfn():

void kvm_get_pfn(kvm_pfn_t pfn)
{
if (!kvm_is_reserved_pfn(pfn))
get_page(pfn_to_page(pfn));
}


As all ZONE_DEVICE pages are currently reserved, pages pinned via
gfn_to_pfn() and friends will often not see a put_page() AFAIKS.

Now, my patch does not change that, the result of
kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would
probably be

a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and
friends, after you successfully pinned the pages. (not sure if that's
the right thing to do but you're the expert)

b) To not use kvm_release_pfn_clean() and friends on pages that were
definitely pinned.



(talking to myself, sorry)

Thinking again, dropping this patch from this series could effectively 
also fix that issue. E.g., kvm_release_pfn_clean() and friends would 
always do a put_page() if "pfn_valid() and !PageReserved()", so after 
patch 9 also on ZONDE_DEVICE pages.


But it would have side effects that might not be desired. E.g.,:

1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be 
the right thing to do).


2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be 
okay)


--

Thanks,

David / dhildenb

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes

2019-11-05 Thread David Hildenbrand

On 05.11.19 10:17, David Hildenbrand wrote:

On 05.11.19 05:38, Dan Williams wrote:

On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand  wrote:


Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite kvm_is_reserved_pfn() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: KarimAllah Ahmed 
Signed-off-by: David Hildenbrand 
---
   virt/kvm/kvm_main.c | 10 --
   1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e9eb666eb6e8..9d18cc67d124 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct 
kvm *kvm,

   bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
   {
-   if (pfn_valid(pfn))
-   return PageReserved(pfn_to_page(pfn));
+   struct page *page = pfn_to_online_page(pfn);

+   /*
+* We treat any pages that are not online (not managed by the buddy)
+* as reserved - this includes ZONE_DEVICE pages and pages without
+* a memmap (e.g., mapped via /dev/mem).
+*/
+   if (page)
+   return PageReserved(page);
  return true;
   }


So after this all the pfn_valid() usage in kvm_main.c is replaced with
pfn_to_online_page()? Looks correct to me.

However, I'm worried that kvm is taking reference on ZONE_DEVICE pages
through some other path resulting in this:

  https://lore.kernel.org/linux-nvdimm/20190919154708.ga24...@angband.pl/

I'll see if this patch set modulates or maintains that failure mode.



I'd assume that the behavior is unchanged. Ithink we get a reference to
these ZONE_DEVICE pages via __get_user_pages_fast() and friends in
hva_to_pfn_fast() and friends in virt/kvm/kvm_main.c



I think I know what's going wrong:

Pages that are pinned via gfn_to_pfn() and friends take a references, 
however are often released via 
kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()...



E.g., in arch/x86/kvm/x86.c:reexecute_instruction()

...
pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
...
kvm_release_pfn_clean(pfn);



void kvm_release_pfn_clean(kvm_pfn_t pfn)
{
if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
put_page(pfn_to_page(pfn));
}

This function makes perfect sense as the counterpart for kvm_get_pfn():

void kvm_get_pfn(kvm_pfn_t pfn)
{
if (!kvm_is_reserved_pfn(pfn))
get_page(pfn_to_page(pfn));
}


As all ZONE_DEVICE pages are currently reserved, pages pinned via 
gfn_to_pfn() and friends will often not see a put_page() AFAIKS.


Now, my patch does not change that, the result of 
kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would 
probably be


a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and 
friends, after you successfully pinned the pages. (not sure if that's 
the right thing to do but you're the expert)


b) To not use kvm_release_pfn_clean() and friends on pages that were 
definitely pinned.


--

Thanks,

David / dhildenb

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 01/10] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes

2019-11-05 Thread David Hildenbrand

On 05.11.19 02:30, Dan Williams wrote:

On Thu, Oct 24, 2019 at 5:10 AM David Hildenbrand  wrote:


Our onlining/offlining code is unnecessarily complicated. Only memory
blocks added during boot can have holes (a range that is not
IORESOURCE_SYSTEM_RAM). Hotplugged memory never has holes (e.g., see
add_memory_resource()). All boot memory is alread online.


s/alread/already/



Thanks.


...also perhaps clarify "already online" by what point in time and why
that is relevant. For example a description of the difference between
the SetPageReserved() in the bootmem path and the one in the hotplug
path.


Will add.




Therefore, when we stop allowing to offline memory blocks with holes, we
implicitly no longer have to deal with onlining memory blocks with holes.


Maybe an explicit reference of the code areas that deal with holes
would help to back up that assertion. Certainly it would have saved me
some time for the review.


I can add a reference to the onlining code that will only online pages 
that don't fall into memory holes.





This allows to simplify the code. For example, we no longer have to
worry about marking pages that fall into memory holes PG_reserved when
onlining memory. We can stop setting pages PG_reserved.


...but not for bootmem, right?


Yes, bootmem is not changed. (especially, early allocations and memory 
holes are marked PG_reserved - basically everything not given to the 
buddy after boot)






Offlining memory blocks added during boot is usually not guranteed to work


s/guranteed/guaranteed/


Thanks.




either way (unmovable data might have easily ended up on that memory during
boot). So stopping to do that should not really hurt (+ people are not
even aware of a setup where that used to work


Maybe put a "Link: https://lkml.kernel.org/r/$msg_id; to that discussion?


and that the existing code
still works correctly with memory holes). For the use case of offlining
memory to unplug DIMMs, we should see no change. (holes on DIMMs would be
weird).


However, less memory can be offlined than was theoretically allowed
previously, so I don't understand the "we should see no change"
comment. I still agree that's a price worth paying to get the code
cleanups and if someone screams we can look at adding it back, but the
fact that it was already fragile seems decent enough protection.


Hotplugged DIMMs (add_memory()) have no holes and will therefore see no 
change.




Please note that hardware errors (PG_hwpoison) are not memory holes and
not affected by this change when offlining.

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Cc: Anshuman Khandual 
Signed-off-by: David Hildenbrand 
---
  mm/memory_hotplug.c | 26 --
  1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 561371ead39a..8d81730cf036 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct 
memory_notify *arg)
 node_clear_state(node, N_MEMORY);
  }

+static int count_system_ram_pages_cb(unsigned long start_pfn,
+unsigned long nr_pages, void *data)
+{
+   unsigned long *nr_system_ram_pages = data;
+
+   *nr_system_ram_pages += nr_pages;
+   return 0;
+}
+
  static int __ref __offline_pages(unsigned long start_pfn,
   unsigned long end_pfn)
  {
-   unsigned long pfn, nr_pages;
+   unsigned long pfn, nr_pages = 0;
 unsigned long offlined_pages = 0;
 int ret, node, nr_isolate_pageblock;
 unsigned long flags;
@@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn,

 mem_hotplug_begin();

+   /*
+* Don't allow to offline memory blocks that contain holes.
+* Consecuently, memory blocks with holes can never get onlined


s/Consecuently/Consequently/


Thanks.




+* (hotplugged memory has no holes and all boot memory is online).
+* This allows to simplify the onlining/offlining code quite a lot.
+*/


The last sentence of this comment makes sense in the context of this
patch, but I don't think it stands by itself in the code base after
the fact. The person reading the comment can't see the simplifications
because the code is already gone. I'd clarify it to talk about why it
is safe to not mess around with PG_Reserved in the hotplug path
because of this check.


I'll think of something. It's not just the PG_reserved handling but the 
whole pfn_valid()/walk_system_ram_range() handling that can be simplified.




After those clarifications you can add:

Reviewed-by: Dan Williams 



Thanks!

--

Thanks,

David / dhildenb

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes

2019-11-05 Thread David Hildenbrand

On 05.11.19 05:38, Dan Williams wrote:

On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand  wrote:


Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite kvm_is_reserved_pfn() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: KarimAllah Ahmed 
Signed-off-by: David Hildenbrand 
---
  virt/kvm/kvm_main.c | 10 --
  1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e9eb666eb6e8..9d18cc67d124 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct 
kvm *kvm,

  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
  {
-   if (pfn_valid(pfn))
-   return PageReserved(pfn_to_page(pfn));
+   struct page *page = pfn_to_online_page(pfn);

+   /*
+* We treat any pages that are not online (not managed by the buddy)
+* as reserved - this includes ZONE_DEVICE pages and pages without
+* a memmap (e.g., mapped via /dev/mem).
+*/
+   if (page)
+   return PageReserved(page);
 return true;
  }


So after this all the pfn_valid() usage in kvm_main.c is replaced with
pfn_to_online_page()? Looks correct to me.

However, I'm worried that kvm is taking reference on ZONE_DEVICE pages
through some other path resulting in this:

 https://lore.kernel.org/linux-nvdimm/20190919154708.ga24...@angband.pl/

I'll see if this patch set modulates or maintains that failure mode.



I'd assume that the behavior is unchanged. Ithink we get a reference to 
these ZONE_DEVICE pages via __get_user_pages_fast() and friends in 
hva_to_pfn_fast() and friends in virt/kvm/kvm_main.c


--

Thanks,

David / dhildenb

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)

2019-11-01 Thread David Hildenbrand

On 24.10.19 14:09, David Hildenbrand wrote:

This is the result of a recent discussion with Michal ([1], [2]). Right
now we set all pages PG_reserved when initializing hotplugged memmaps. This
includes ZONE_DEVICE memory. In case of system memory, PG_reserved is
cleared again when onlining the memory, in case of ZONE_DEVICE memory
never.

In ancient times, we needed PG_reserved, because there was no way to tell
whether the memmap was already properly initialized. We now have
SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE
memory is already initialized deferred, and there shouldn't be a visible
change in that regard.

One of the biggest fears were side effects. I went ahead and audited all
users of PageReserved(). The details can be found in "mm/memory_hotplug:
Don't mark pages PG_reserved when initializing the memmap".

This patch set adapts all relevant users of PageReserved() to keep the
existing behavior in respect to ZONE_DEVICE pages. The biggest part part
that needs changes is KVM, to keep the existing behavior (that's all I
care about in this series).

Note that this series is able to rely completely on pfn_to_online_page().
No new is_zone_device_page() calles are introduced (as requested by Dan).
We are currently discussing a way to mark also ZONE_DEVICE memmaps as
active/initialized - pfn_active() - and lightweight locking to make sure
memmaps remain active (e.g., using RCU). We might later be able to convert
some suers of pfn_to_online_page() to pfn_active(). Details can be found
in [3], however, this represents yet another cleanup/fix we'll perform
on top of this cleanup.

I only gave it a quick test with DIMMs on x86-64, but didn't test the
ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Also, I didn't
test the KVM parts (especially with ZONE_DEVICE pages or no memmap at all).
Compile-tested on x86-64 and PPC.



Jeff Moyer ran some NVDIMM test cases for me (thanks!!!), including 
xfstests, pmdk, and ndctl. No regressions found.


I will run some KVM tests, especially NDIMM passthrough, but will have 
to setup a test environment first.


I would appreciate some review in the meantime. :)

--

Thanks,

David / dhildenb

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)

2019-10-24 Thread David Hildenbrand

On 23.10.19 09:26, David Hildenbrand wrote:

On 22.10.19 23:54, Dan Williams wrote:

Hi David,

Thanks for tackling this!


Thanks for having a look :)

[...]



I am probably a little bit too careful (but I don't want to break things).
In most places (besides KVM and vfio that are nuts), the
pfn_to_online_page() check could most probably be avoided by a
is_zone_device_page() check. However, I usually get suspicious when I see
a pfn_valid() check (especially after I learned that people mmap parts of
/dev/mem into user space, including memory without memmaps. Also, people
could memmap offline memory blocks this way :/). As long as this does not
hurt performance, I think we should rather do it the clean way.


I'm concerned about using is_zone_device_page() in places that are not
known to already have a reference to the page. Here's an audit of
current usages, and the ones I think need to cleaned up. The "unsafe"
ones do not appear to have any protections against the device page
being removed (get_dev_pagemap()). Yes, some of these were added by
me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
pages into anonymous memory paths and I'm not up to speed on how it
guarantees 'struct page' validity vs device shutdown without using
get_dev_pagemap().

smaps_pmd_entry(): unsafe

put_devmap_managed_page(): safe, page reference is held

is_device_private_page(): safe? gpu driver manages private page lifetime

is_pci_p2pdma_page(): safe, page reference is held

uncharge_page(): unsafe? HMM

add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()

soft_offline_page(): unsafe

remove_migration_pte(): unsafe? HMM

move_to_new_page(): unsafe? HMM

migrate_vma_pages() and helpers: unsafe? HMM

try_to_unmap_one(): unsafe? HMM

__put_page(): safe

release_pages(): safe

I'm hoping all the HMM ones can be converted to
is_device_private_page() directlly and have that routine grow a nice
comment about how it knows it can always safely de-reference its @page
argument.

For the rest I'd like to propose that we add a facility to determine
ZONE_DEVICE by pfn rather than page. The most straightforward why I
can think of would be to just add another bitmap to mem_section_usage
to indicate if a subsection is ZONE_DEVICE or not.


(it's a somewhat unrelated bigger discussion, but we can start discussing it in 
this thread)

I dislike this for three reasons

a) It does not protect against any races, really, it does not improve things.
b) We do have the exact same problem with pfn_to_online_page(). As long as we
don't hold the memory hotplug lock, memory can get offlined and remove any 
time. Racy.
c) We mix in ZONE specific stuff into the core. It should be "just another zone"

What I propose instead (already discussed in 
https://lkml.org/lkml/2019/10/10/87)

1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
2. Convert SECTION_IS_ACTIVE to a subsection bitmap
3. Introduce pfn_active() that checks against the subsection bitmap
4. Once the memmap was initialized / prepared, set the subsection active
(similar to SECTION_IS_ONLINE in the buddy right now)
5. Before the memmap gets invalidated, set the subsection inactive
(similar to SECTION_IS_ONLINE in the buddy right now)
5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE



Dan, I am suspecting that you want a pfn_to_zone() that will not touch 
the memmap, because it could potentially (altmap) lie on slow memory, right?


A modification might make this possible (but I am not yet sure if we 
want a less generic MM implementation just to fine tune slow memmap 
access here)


1. Keep SECTION_IS_ONLINE as it is with the same semantics
2. Introduce a subsection bitmap to record active ("initialized memmap")
   PFNs. E.g., also set it when setting sections online.
3. Introduce pfn_active() that checks against the subsection bitmap
4. Once the memmap was initialized / prepared, set the subsection active
   (similar to SECTION_IS_ONLINE in the buddy right now)
5. Before the memmap gets invalidated, set the subsection inactive
   (similar to SECTION_IS_ONLINE in the buddy right now)
5. pfn_to_online_page() = pfn_active() && section == SECTION_IS_ONLINE
   (or keep it as is, depends on the RCU locking we eventually
implement)
6. pfn_to_device_page() = pfn_active() && section != SECTION_IS_ONLINE
7. use pfn_active() whenever we don't care about the zone.

Again, not really a friend of that, it hardcodes ZONE_DEVICE vs. 
!ZONE_DEVICE. When we do a random "pfn_to_page()" (e.g., a pfn walker) 
we really want to touch the memmap right away either way. So we can also 
directly read the zone from it. I really do prefer right now a more 
generic implementation.


--

Thanks,

David / dhildenb

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 09/10] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap

2019-10-24 Thread David Hildenbrand
Everything should be prepared to stop setting pages PG_reserved when
initializing the memmap on memory hotplug. Most importantly, we
stop marking ZONE_DEVICE pages PG_reserved.

a) We made sure that any code that relied on PG_reserved to detect
   ZONE_DEVICE memory will no longer rely on PG_reserved (especially,
   by relying on pfn_to_online_page() for now). Details can be found
   below.
b) We made sure that memory blocks with holes cannot be offlined and
   therefore also not onlined. We have quite some code that relies on
   memory holes being marked PG_reserved. This is now not an issue
   anymore.

generic_online_page() still calls __free_pages_core(), which performs
__ClearPageReserved(p). AFAIKS, this should not hurt.

It is worth nothing that the users of online_page_callback_t might see a
change. E.g., until now, pages not freed to the buddy by the HyperV
balloonm were set PG_reserved until freed via generic_online_page(). Now,
they would look like ordinarily allocated pages (refcount == 1). This
callback is used by the XEN balloon and the HyperV balloon. To not
introduce any silent errors, keep marking the pages PG_reserved. We can
most probably stop doing that, but have to double check if there are
issues (e.g., offlining code aborts right away in has_unmovable_pages()
when it runs into a PageReserved(page))

Update the documentation at various places in the MM core.

There are three PageReserved() users that might be affected by this change.
 - drivers/staging/gasket/gasket_page_table.c:gasket_release_page()
   -> We might (unlikely) set SetPageDirty() on a ZONE_DEVICE page
   -> I assume "we don't care"
 - drivers/staging/kpc2000/kpc_dma/fileops.c:transfer_complete_cb()
   -> We might (unlikely) set SetPageDirty() on a ZONE_DEVICE page
   -> I assume "we don't care"
 - mm/usercopy.c: check_page_span()
   -> According to Dan, non-HMM ZONE_DEVICE usage excluded this code since
  commit 52f476a323f9 ("libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY
  overhead")
   -> It is unclear whether we rally cared about ZONE_DEVICE here (HMM) or
  simply about "PG_reserved". The worst thing that could happen is a
  false negative with CONFIG_HARDENED_USERCOPY we should be able to
  identify easily.
   -> There is a discussion to rip out that code completely
   -> I assume "not relevant" / "we don't care"

I audited the other PageReserved() users. They don't affect ZONE_DEVICE:
 - mm/page_owner.c:pagetypeinfo_showmixedcount_print()
   -> Never called for ZONE_DEVICE, (+ pfn_to_online_page(pfn))
 - mm/page_owner.c:init_pages_in_zone()
   -> Never called for ZONE_DEVICE (!populated_zone(zone))
 - mm/page_ext.c:free_page_ext()
   -> Only a BUG_ON(PageReserved(page)), not relevant
 - mm/page_ext.c:has_unmovable_pages()
   -> Not releveant for ZONE_DEVICE
 - mm/page_ext.c:pfn_range_valid_contig()
   -> pfn_to_online_page() already guards us
 - mm/mempolicy.c:queue_pages_pte_range()
   -> vm_normal_page() checks against pte_devmap()
 - mm/memory-failure.c:hwpoison_user_mappings()
   -> Not reached via memory_failure() due to pfn_to_online_page()
   -> Also not reached indirectly via memory_failure_hugetlb()
 - mm/hugetlb.c:gather_bootmem_prealloc()
   -> Only a WARN_ON(PageReserved(page)), not relevant
 - kernel/power/snapshot.c:saveable_highmem_page()
   -> pfn_to_online_page() already guards us
 - kernel/power/snapshot.c:saveable_page()
   -> pfn_to_online_page() already guards us
 - fs/proc/task_mmu.c:can_gather_numa_stats()
   -> vm_normal_page() checks against pte_devmap()
 - fs/proc/task_mmu.c:can_gather_numa_stats_pmd
   -> vm_normal_page_pmd() checks against pte_devmap()
 - fs/proc/page.c:stable_page_flags()
   -> The reserved bit is simply copied, irrelevant
 - drivers/firmware/memmap.c:release_firmware_map_entry()
   -> really only a check to detect bootmem. Not relevant for ZONE_DEVICE
 - arch/ia64/kernel/mca_drv.c
 - arch/mips/mm/init.c
 - arch/mips/mm/ioremap.c
 - arch/nios2/mm/ioremap.c
 - arch/parisc/mm/ioremap.c
 - arch/sparc/mm/tlb.c
 - arch/xtensa/mm/cache.c
   -> No ZONE_DEVICE support
 - arch/powerpc/mm/init_64.c:vmemmap_free()
   -> Special-cases memmap on altmap
   -> Only a check for bootmem
 - arch/x86/kernel/alternative.c:__text_poke()
   -> Only a WARN_ON(!PageReserved(pages[0])) to verify it is bootmem
 - arch/x86/mm/init_64.c
   -> Only a check for bootmem

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Andrew Morton 
Cc: Alexander Duyck 
Cc: Pavel Tatashin 
Cc: Vlastimil Babka 
Cc: Johannes Weiner 
Cc: Anthony Yznaga 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Dan Williams 
Cc: Mel Gorman 
Cc: Mike Rapoport 
Cc: Anshuman Khandual 
Cc: Matt Sickler 
Cc: Kees Cook 
Suggested-by: Michal Hocko 
Signed-off

[PATCH v1 08/10] x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes

2019-10-24 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

Rewrite __ioremap_check_ram() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Signed-off-by: David Hildenbrand 
---
 arch/x86/mm/ioremap.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index a39dcdb5ae34..db6913b48edf 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -77,10 +77,17 @@ static unsigned int __ioremap_check_ram(struct resource 
*res)
start_pfn = (res->start + PAGE_SIZE - 1) >> PAGE_SHIFT;
stop_pfn = (res->end + 1) >> PAGE_SHIFT;
if (stop_pfn > start_pfn) {
-   for (i = 0; i < (stop_pfn - start_pfn); ++i)
-   if (pfn_valid(start_pfn + i) &&
-   !PageReserved(pfn_to_page(start_pfn + i)))
+   for (i = 0; i < (stop_pfn - start_pfn); ++i) {
+   struct page *page;
+/*
+ * We treat any pages that are not online (not managed
+ * by the buddy) as not being RAM. This includes
+ * ZONE_DEVICE pages.
+ */
+   page = pfn_to_online_page(start_pfn + i);
+   if (page && !PageReserved(page))
return IORES_MAP_SYSTEM_RAM;
+   }
}
 
return 0;
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 07/10] powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes

2019-10-24 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

Rewrite maybe_pte_to_page() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: "Aneesh Kumar K.V" 
Cc: Allison Randal 
Cc: Nicholas Piggin 
Cc: Thomas Gleixner 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/mm/pgtable.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index e3759b69f81b..613c98fa7dc0 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -55,10 +55,12 @@ static struct page *maybe_pte_to_page(pte_t pte)
unsigned long pfn = pte_pfn(pte);
struct page *page;
 
-   if (unlikely(!pfn_valid(pfn)))
-   return NULL;
-   page = pfn_to_page(pfn);
-   if (PageReserved(page))
+   /*
+* We reject any pages that are not online (not managed by the buddy).
+* This includes ZONE_DEVICE pages.
+*/
+   page = pfn_to_online_page(pfn);
+   if (unlikely(!page || PageReserved(page)))
return NULL;
return page;
 }
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 10/10] mm/usercopy.c: Update comment in check_page_span() regarding ZONE_DEVICE

2019-10-24 Thread David Hildenbrand
ZONE_DEVICE (a.k.a. device memory) is no longer marked PG_reserved. Update
the comment.

While at it, make it match what the code is acutally doing (reject vs.
accept).

Cc: Kees Cook 
Cc: Andrew Morton 
Cc: "Isaac J. Manjarres" 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Qian Cai 
Cc: Thomas Gleixner 
Signed-off-by: David Hildenbrand 
---
 mm/usercopy.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/usercopy.c b/mm/usercopy.c
index 660717a1ea5c..80f254024c97 100644
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -199,9 +199,9 @@ static inline void check_page_span(const void *ptr, 
unsigned long n,
return;
 
/*
-* Reject if range is entirely either Reserved (i.e. special or
-* device memory), or CMA. Otherwise, reject since the object spans
-* several independently allocated pages.
+* Accept if the range is entirely either Reserved ("special") or
+* CMA. Otherwise, reject since the object spans several independently
+* allocated pages.
 */
is_reserved = PageReserved(page);
is_cma = is_migrate_cma_page(page);
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 06/10] powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved changes

2019-10-24 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

Rewrite hash_page_do_lazy_icache() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Aneesh Kumar K.V" 
Cc: Christophe Leroy 
Cc: Nicholas Piggin 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: YueHaibing 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 6c123760164e..a1566039e747 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1084,13 +1084,15 @@ void hash__early_init_mmu_secondary(void)
  */
 unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap)
 {
-   struct page *page;
+   struct page *page = pfn_to_online_page(pte_pfn(pte));
 
-   if (!pfn_valid(pte_pfn(pte)))
+   /*
+* We ignore any pages that are not online (not managed by the buddy).
+* This includes ZONE_DEVICE pages.
+*/
+   if (!page)
return pp;
 
-   page = pte_page(pte);
-
/* page is dirty */
if (!test_bit(PG_arch_1, >flags) && !PageReserved(page)) {
if (trap == 0x400) {
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes

2019-10-24 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make
sure the function produces the same result once we stop setting ZONE_DEVICE
pages PG_reserved.

Cc: Alex Williamson 
Cc: Cornelia Huck 
Signed-off-by: David Hildenbrand 
---
 drivers/vfio/vfio_iommu_type1.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..f8ce8c408ba8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long 
npage, bool async)
  */
 static bool is_invalid_reserved_pfn(unsigned long pfn)
 {
-   if (pfn_valid(pfn))
-   return PageReserved(pfn_to_page(pfn));
+   struct page *page = pfn_to_online_page(pfn);
 
+   /*
+* We treat any pages that are not online (not managed by the buddy)
+* as reserved - this includes ZONE_DEVICE pages and pages without
+* a memmap (e.g., mapped via /dev/mem).
+*/
+   if (page)
+   return PageReserved(page);
return true;
 }
 
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 05/10] powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for PG_reserved changes

2019-10-24 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite kvmppc_book3s_instantiate_page() similar to kvm_is_reserved_pfn()
to make sure the function produces the same result once we stop setting
ZONE_DEVICE pages PG_reserved.

Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..05397c0561fc 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -801,12 +801,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
   writing, upgrade_p);
if (is_error_noslot_pfn(pfn))
return -EFAULT;
-   page = NULL;
-   if (pfn_valid(pfn)) {
-   page = pfn_to_page(pfn);
-   if (PageReserved(page))
-   page = NULL;
-   }
+   /*
+* We treat any pages that are not online (not managed by the
+* buddy) as reserved - this includes ZONE_DEVICE pages and
+* pages without a memmap (e.g., mapped via /dev/mem).
+*/
+   page = pfn_to_online_page(pfn);
+   if (page && PageReserved(page))
+   page = NULL;
}
 
/*
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes

2019-10-24 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite kvm_is_reserved_pfn() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: KarimAllah Ahmed 
Signed-off-by: David Hildenbrand 
---
 virt/kvm/kvm_main.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e9eb666eb6e8..9d18cc67d124 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct 
kvm *kvm,
 
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
-   if (pfn_valid(pfn))
-   return PageReserved(pfn_to_page(pfn));
+   struct page *page = pfn_to_online_page(pfn);
 
+   /*
+* We treat any pages that are not online (not managed by the buddy)
+* as reserved - this includes ZONE_DEVICE pages and pages without
+* a memmap (e.g., mapped via /dev/mem).
+*/
+   if (page)
+   return PageReserved(page);
return true;
 }
 
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)

2019-10-24 Thread David Hildenbrand
This is the result of a recent discussion with Michal ([1], [2]). Right
now we set all pages PG_reserved when initializing hotplugged memmaps. This
includes ZONE_DEVICE memory. In case of system memory, PG_reserved is
cleared again when onlining the memory, in case of ZONE_DEVICE memory
never.

In ancient times, we needed PG_reserved, because there was no way to tell
whether the memmap was already properly initialized. We now have
SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE
memory is already initialized deferred, and there shouldn't be a visible
change in that regard.

One of the biggest fears were side effects. I went ahead and audited all
users of PageReserved(). The details can be found in "mm/memory_hotplug:
Don't mark pages PG_reserved when initializing the memmap".

This patch set adapts all relevant users of PageReserved() to keep the
existing behavior in respect to ZONE_DEVICE pages. The biggest part part
that needs changes is KVM, to keep the existing behavior (that's all I
care about in this series).

Note that this series is able to rely completely on pfn_to_online_page().
No new is_zone_device_page() calles are introduced (as requested by Dan).
We are currently discussing a way to mark also ZONE_DEVICE memmaps as
active/initialized - pfn_active() - and lightweight locking to make sure
memmaps remain active (e.g., using RCU). We might later be able to convert
some suers of pfn_to_online_page() to pfn_active(). Details can be found
in [3], however, this represents yet another cleanup/fix we'll perform
on top of this cleanup.

I only gave it a quick test with DIMMs on x86-64, but didn't test the
ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Also, I didn't
test the KVM parts (especially with ZONE_DEVICE pages or no memmap at all).
Compile-tested on x86-64 and PPC.

Based on next/master. The current version (kept updated) can be found at:
https://github.com/davidhildenbrand/linux.git online_reserved_cleanup

RFC -> v1:
- Dropped "staging/gasket: Prepare gasket_release_page() for PG_reserved
  changes"
- Dropped "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved
  changes"
- Converted "mm/usercopy.c: Prepare check_page_span() for PG_reserved
  changes" to "mm/usercopy.c: Update comment in check_page_span()
  regarding ZONE_DEVICE"
- No new users of is_zone_device_page() are introduced.
- Rephrased comments and patch descriptions.

[1] https://lkml.org/lkml/2019/10/21/736
[2] https://lkml.org/lkml/2019/10/21/1034
[3] https://www.spinics.net/lists/linux-mm/msg194112.html

Cc: Michal Hocko 
Cc: Dan Williams 
Cc: kvm-...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: k...@vger.kernel.org
Cc: linux-hyp...@vger.kernel.org
Cc: de...@driverdev.osuosl.org
Cc: xen-de...@lists.xenproject.org
Cc: x...@kernel.org
Cc: Alexander Duyck 

David Hildenbrand (10):
  mm/memory_hotplug: Don't allow to online/offline memory blocks with
holes
  KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
  KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for
PG_reserved changes
  powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved
changes
  powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes
  x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes
  mm/memory_hotplug: Don't mark pages PG_reserved when initializing the
memmap
  mm/usercopy.c: Update comment in check_page_span() regarding
ZONE_DEVICE

 arch/powerpc/kvm/book3s_64_mmu_radix.c | 14 +
 arch/powerpc/mm/book3s64/hash_utils.c  | 10 +++---
 arch/powerpc/mm/pgtable.c  | 10 +++---
 arch/x86/kvm/mmu.c | 29 ++---
 arch/x86/mm/ioremap.c  | 13 ++--
 drivers/hv/hv_balloon.c|  6 
 drivers/vfio/vfio_iommu_type1.c| 10 --
 drivers/xen/balloon.c  |  7 +
 include/linux/page-flags.h |  8 +
 mm/memory_hotplug.c| 43 +++---
 mm/page_alloc.c| 11 ---
 mm/usercopy.c  |  6 ++--
 virt/kvm/kvm_main.c| 10 --
 13 files changed, 111 insertions(+), 66 deletions(-)

-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 02/10] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes

2019-10-24 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite kvm_is_mmio_pfn() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Sean Christopherson 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Cc: Jim Mattson 
Cc: Joerg Roedel 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: KarimAllah Ahmed 
Cc: Michal Hocko 
Cc: Dan Williams 
Signed-off-by: David Hildenbrand 
---
 arch/x86/kvm/mmu.c | 29 +
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24c23c66b226..f03089a336de 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2962,20 +2962,25 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
 
 static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
+   struct page *page = pfn_to_online_page(pfn);
+
+   /*
+* ZONE_DEVICE pages are never online. Online pages that are reserved
+* either indicate the zero page or MMIO pages.
+*/
+   if (page)
+   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+
+   /*
+* Anything with a valid (but not online) memmap could be ZONE_DEVICE.
+* Treat only UC/UC-/WC pages as MMIO.
+*/
if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
-   /*
-* Some reserved pages, such as those from NVDIMM
-* DAX devices, are not for MMIO, and can be mapped
-* with cached memory type for better performance.
-* However, the above check misconceives those pages
-* as MMIO, and results in KVM mapping them with UC
-* memory type, which would hurt the performance.
-* Therefore, we check the host memory type in addition
-* and only treat UC/UC-/WC pages as MMIO.
-*/
-   (!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn));
+   return !pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn);
 
+   /*
+* Any RAM that has no memmap (e.g., mapped via /dev/mem) is not MMIO.
+*/
return !e820__mapped_raw_any(pfn_to_hpa(pfn),
 pfn_to_hpa(pfn + 1) - 1,
 E820_TYPE_RAM);
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 01/10] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes

2019-10-24 Thread David Hildenbrand
Our onlining/offlining code is unnecessarily complicated. Only memory
blocks added during boot can have holes (a range that is not
IORESOURCE_SYSTEM_RAM). Hotplugged memory never has holes (e.g., see
add_memory_resource()). All boot memory is alread online.

Therefore, when we stop allowing to offline memory blocks with holes, we
implicitly no longer have to deal with onlining memory blocks with holes.

This allows to simplify the code. For example, we no longer have to
worry about marking pages that fall into memory holes PG_reserved when
onlining memory. We can stop setting pages PG_reserved.

Offlining memory blocks added during boot is usually not guranteed to work
either way (unmovable data might have easily ended up on that memory during
boot). So stopping to do that should not really hurt (+ people are not
even aware of a setup where that used to work and that the existing code
still works correctly with memory holes). For the use case of offlining
memory to unplug DIMMs, we should see no change. (holes on DIMMs would be
weird).

Please note that hardware errors (PG_hwpoison) are not memory holes and
not affected by this change when offlining.

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Cc: Anshuman Khandual 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 561371ead39a..8d81730cf036 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct 
memory_notify *arg)
node_clear_state(node, N_MEMORY);
 }
 
+static int count_system_ram_pages_cb(unsigned long start_pfn,
+unsigned long nr_pages, void *data)
+{
+   unsigned long *nr_system_ram_pages = data;
+
+   *nr_system_ram_pages += nr_pages;
+   return 0;
+}
+
 static int __ref __offline_pages(unsigned long start_pfn,
  unsigned long end_pfn)
 {
-   unsigned long pfn, nr_pages;
+   unsigned long pfn, nr_pages = 0;
unsigned long offlined_pages = 0;
int ret, node, nr_isolate_pageblock;
unsigned long flags;
@@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
mem_hotplug_begin();
 
+   /*
+* Don't allow to offline memory blocks that contain holes.
+* Consecuently, memory blocks with holes can never get onlined
+* (hotplugged memory has no holes and all boot memory is online).
+* This allows to simplify the onlining/offlining code quite a lot.
+*/
+   walk_system_ram_range(start_pfn, end_pfn - start_pfn, _pages,
+ count_system_ram_pages_cb);
+   if (nr_pages != end_pfn - start_pfn) {
+   ret = -EINVAL;
+   reason = "memory holes";
+   goto failed_removal;
+   }
+
/* This makes hotplug much easier...and readable.
   we assume this for now. .*/
if (!test_pages_in_a_zone(start_pfn, end_pfn, _start,
@@ -1472,7 +1495,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
zone = page_zone(pfn_to_page(valid_start));
node = zone_to_nid(zone);
-   nr_pages = end_pfn - start_pfn;
 
/* set above range as isolated */
ret = start_isolate_page_range(start_pfn, end_pfn,
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes

2019-10-24 Thread David Hildenbrand

On 24.10.19 05:53, Anshuman Khandual wrote:


On 10/22/2019 10:42 PM, David Hildenbrand wrote:

Our onlining/offlining code is unnecessarily complicated. Only memory
blocks added during boot can have holes. Hotplugged memory never has
holes. That memory is already online.


Why hot plugged memory at runtime cannot have holes (e.g a semi bad DIMM).


Important: HWPoison != memory hole

A memory hole is memory that is not "IORESOURCE_SYSRAM". These pages are 
currently marked PG_reserved. Such holes are sometimes used for mapping 
something into kernel space. Some archs use the PG_reserved to detect 
the memory hole ("not ram") and ignore the memmap.


Poisoned pages are marked PG_hwpoison.


Currently, do we just abort adding that memory block if there are holes ?


There is no interface to do that.

E.g., have a look at add_memory() add_memory_resource(). You can only 
pass one memory resource (that is all IORESOURCE_SYSRAM | IORESOURCE_BUSY)


Hotplugging memory with holes is not supported (nor can I imagine a use 
case for that).




When we stop allowing to offline memory blocks with holes, we implicitly
stop to online memory blocks with holes.


Reducing hotplug support for memory blocks with holes just to simplify
the code. Is it worth ?


Me and Michal are not aware of a users, not even aware of a use case. 
Keeping code around that nobody really needs that limits cleanups, no 
thanks. Similar to us not supporting to offline memory blocks that span 
multiple nodes/zones.


E.g., have a look at the isolation code. It is full of code that jumps 
over memory holes (start_isolate_page_range() -> __first_valid_page()). 
That made sense for our complicated memory offlining code, but it is 
actually harmful when dealing with alloc_contig_range(). Allocation 
never wants to jump over memory holes. After this patch, we can just 
fail hard on any memory hole we detect, instead of ignoring it (or 
special-casing it).






This allows to simplify the code. For example, we no longer have to
worry about marking pages that fall into memory holes PG_reserved when
onlining memory. We can stop setting pages PG_reserved.


Could not there be any other way of tracking these holes if not the page
reserved bit. In the memory section itself and corresponding struct pages
just remained poisoned ? Just wondering, might be all wrong here.


Of course there could be ways (e.g., using PG_offline eventually), but 
it boils down to us having to deal with it in onlining/offlining code. 
And that is some handling nobody really seems to need.






Offlining memory blocks added during boot is usually not guranteed to work
either way. So stopping to do that (if anybody really used and tested


That guarantee does not exist right now because how boot memory could have
been used after boot not from a limitation of the memory hot remove itself.


Yep. However, Michal and I are not even aware of a setup that would made 
this work and guarantee that the existing code actually still is able to 
deal with holes. Are you?





this over the years) should not really hurt. For the use case of
offlining memory to unplug DIMMs, we should see no change. (holes on
DIMMs would be weird)


Holes on DIMM could be due to HW errors affecting only parts of it. By not


Again, HW errors != holes. We have PG_hwpoison for that.


allowing such DIMM's hot add and remove, we are definitely reducing the
scope of overall hotplug functionality. Is code simplification in itself
is worth this reduction in functionality ?


What you describe is not affected.

Thanks!

--

Thanks,

David / dhildenb

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)

2019-10-23 Thread David Hildenbrand
On 23.10.19 21:39, Dan Williams wrote:
> On Wed, Oct 23, 2019 at 10:28 AM David Hildenbrand  wrote:
>>
>>>> I dislike this for three reasons
>>>>
>>>> a) It does not protect against any races, really, it does not improve 
>>>> things.
>>>> b) We do have the exact same problem with pfn_to_online_page(). As long as 
>>>> we
>>>>don't hold the memory hotplug lock, memory can get offlined and remove 
>>>> any time. Racy.
>>>
>>> True, we need to solve that problem too. That seems to want something
>>> lighter weight than the hotplug lock that can be held over pfn lookups
>>> +  use rather than requiring a page lookup in paths where it's not
>>> clear that a page reference would prevent unplug.
>>>
>>>> c) We mix in ZONE specific stuff into the core. It should be "just another 
>>>> zone"
>>>
>>> Not sure I grok this when the RFC is sprinkling zone-specific
>>> is_zone_device_page() throughout the core?
>>
>> Most users should not care about the zone. pfn_active() would be enough
>> in most situations, especially most PFN walkers - "this memmap is valid
>> and e.g., contains a valid zone ...".
> 
> Oh, I see, you're saying convert most users to pfn_active() (and some
> TBD rcu locking), but only pfn_to_online_page() users would need the
> zone lookup? I can get on board with that.

I guess my answer to that is simple: If we only care about "is this
memmap safe to touch", use pfn_active()

(well, with pfn_valid_within() similar as done in pfn_to_online_page()
due to memory holes, but these are details - e.g., pfn_active() can
check against pfn_valid_within() right away internally). (+locking TBD
to make sure it remains active)

However, if we want to special case in addition on zones (!ZONE_DEVICE
(a.k.a., onlined via memory blocks, managed by the buddy), ZONE_DEVICE,
whatever might come in the future, ...), also access the zone stored in
the memmap. E.g., by using pfn_to_online_page().

> 
>>
>>>
>>>>
>>>> What I propose instead (already discussed in 
>>>> https://lkml.org/lkml/2019/10/10/87)
>>>
>>> Sorry I missed this earlier...
>>>
>>>>
>>>> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
>>>> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
>>>> 3. Introduce pfn_active() that checks against the subsection bitmap
>>>> 4. Once the memmap was initialized / prepared, set the subsection active
>>>>(similar to SECTION_IS_ONLINE in the buddy right now)
>>>> 5. Before the memmap gets invalidated, set the subsection inactive
>>>>(similar to SECTION_IS_ONLINE in the buddy right now)
>>>> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
>>>> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
>>>
>>> This does not seem to reduce any complexity because it still requires
>>> a pfn to zone lookup at the end of the process.
>>>
>>> I.e. converting pfn_to_online_page() to use a new pfn_active()
>>> subsection map plus looking up the zone from pfn_to_page() is more
>>> steps than just doing a direct pfn to zone lookup. What am I missing?
>>
>> That a real "pfn to zone" lookup without going via the struct page will
>> require to have more than just a single bitmap. IMHO, keeping the
>> information at a single place (memmap) is the clean thing to do (not
>> replicating it somewhere else). Going via the memmap might not be as
>> fast as a direct lookup, but do we actually care? We are already looking
>> at "random PFNs we are not sure if there is a valid memmap".
> 
> True, we only care about the validity of the check, and as you pointed
> out moving the check to the pfn level does not solve the validity
> race. It needs a lock.

Let's call pfn_active() "a pfn that is active in the system and has an
initialized memmap, which contains sane values" (valid memmap sounds
like pfn_valid(), which is actually "there is a memmap which might
contain garbage"). Yes we need some sort of lightweight locking as
discussed.

[...]

>>>> However, I think we also have to clarify if we need the change in 3 at all.
>>>> It comes from
>>>>
>>>> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
>>>> Author: Kees Cook 
>>>> Date:   Tue Jun 7 11:05:33 2016 -0700
>>>>
>>>> mm: Hardened usercopy
>>>>
>>>> This is the start of porting P

Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)

2019-10-23 Thread David Hildenbrand
>> I dislike this for three reasons
>>
>> a) It does not protect against any races, really, it does not improve things.
>> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>>don't hold the memory hotplug lock, memory can get offlined and remove 
>> any time. Racy.
> 
> True, we need to solve that problem too. That seems to want something
> lighter weight than the hotplug lock that can be held over pfn lookups
> +  use rather than requiring a page lookup in paths where it's not
> clear that a page reference would prevent unplug.
> 
>> c) We mix in ZONE specific stuff into the core. It should be "just another 
>> zone"
> 
> Not sure I grok this when the RFC is sprinkling zone-specific
> is_zone_device_page() throughout the core?

Most users should not care about the zone. pfn_active() would be enough
in most situations, especially most PFN walkers - "this memmap is valid
and e.g., contains a valid zone ...".

> 
>>
>> What I propose instead (already discussed in 
>> https://lkml.org/lkml/2019/10/10/87)
> 
> Sorry I missed this earlier...
> 
>>
>> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
>> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
>> 3. Introduce pfn_active() that checks against the subsection bitmap
>> 4. Once the memmap was initialized / prepared, set the subsection active
>>(similar to SECTION_IS_ONLINE in the buddy right now)
>> 5. Before the memmap gets invalidated, set the subsection inactive
>>(similar to SECTION_IS_ONLINE in the buddy right now)
>> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
>> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
> 
> This does not seem to reduce any complexity because it still requires
> a pfn to zone lookup at the end of the process.
> 
> I.e. converting pfn_to_online_page() to use a new pfn_active()
> subsection map plus looking up the zone from pfn_to_page() is more
> steps than just doing a direct pfn to zone lookup. What am I missing?

That a real "pfn to zone" lookup without going via the struct page will
require to have more than just a single bitmap. IMHO, keeping the
information at a single place (memmap) is the clean thing to do (not
replicating it somewhere else). Going via the memmap might not be as
fast as a direct lookup, but do we actually care? We are already looking
at "random PFNs we are not sure if there is a valid memmap".

>>
>> Especially, driver-reserved device memory will not get set active in
>> the subsection bitmap.
>>
>> Now to the race. Taking the memory hotplug lock at random places is ugly. I 
>> do
>> wonder if we can use RCU:
> 
> Ah, yes, exactly what I was thinking above.
> 
>>
>> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
>>
>> /* the memmap is guaranteed to remain active under RCU */
>> rcu_read_lock();
>> if (pfn_active(random_pfn)) {
>> page = pfn_to_page(random_pfn);
>> ... use the page, stays valid
>> }
>> rcu_unread_lock();
>>
>> Memory offlining/memremap code:
>>
>> set_subsections_inactive(pfn, nr_pages); /* clears the bit 
>> atomically */
>> synchronize_rcu();
>> /* all users saw the bitmap update, we can invalide the memmap */
>> remove_pfn_range_from_zone(zone, pfn, nr_pages);
> 
> Looks good to me.
> 
>>
>>>

 I only gave it a quick test with DIMMs on x86-64, but didn't test the
 ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
 on x86-64 and PPC.
>>>
>>> I'll give it a spin, but I don't think the kernel wants to grow more
>>> is_zone_device_page() users.
>>
>> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
>> The other parts can rely on pfn_to_online_page() only.
>>
>> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
>> - Basically never used with ZONE_DEVICE.
>> - We hold a reference!
>> - All it protects is a SetPageDirty(page);
>>
>> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
>> - Same as 1.
>>
>> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
>> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
>>   references (I assume this should be fine as we don't come via random
>>   PFNs)
>> - We check that we don't mix Reserved (including device memory) and CMA
>>   pages when crossing compound pages.
>>
>> I think we can drop 1. and 2., resulting in a total of 2 new users in
>> the same context. I think that is totally tolerable to finally clean
>> this up.
> 
> ...but more is_zone_device_page() doesn't "finally clean this up".
> Like we discussed above it's the missing locking that's the real
> cleanup, the pfn_to_online_page() internals are secondary.

It's a different cleanup IMHO. We can't do everything in one shot. But
maybe I can drop the is_zone_device_page() parts from this patch and
completely rely on pfn_to_online_page(). Yes, that 

Re: [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes

2019-10-23 Thread David Hildenbrand
On 23.10.19 18:25, Kees Cook wrote:
> On Wed, Oct 23, 2019 at 10:20:14AM +0200, David Hildenbrand wrote:
>> On 22.10.19 19:12, David Hildenbrand wrote:
>>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
>>> change that.
>>>
>>> Let's make sure that the logic in the function won't change. Once we no
>>> longer set these pages to reserved, we can rework this function to
>>> perform separate checks for ZONE_DEVICE (split from PG_reserved checks).
>>>
>>> Cc: Kees Cook 
>>> Cc: Andrew Morton 
>>> Cc: Kate Stewart 
>>> Cc: Allison Randal 
>>> Cc: "Isaac J. Manjarres" 
>>> Cc: Qian Cai 
>>> Cc: Thomas Gleixner 
>>> Signed-off-by: David Hildenbrand 
>>> ---
>>>   mm/usercopy.c | 5 +++--
>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/usercopy.c b/mm/usercopy.c
>>> index 660717a1ea5c..a3ac4be35cde 100644
>>> --- a/mm/usercopy.c
>>> +++ b/mm/usercopy.c
>>> @@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, 
>>> unsigned long n,
>>>  * device memory), or CMA. Otherwise, reject since the object spans
>>>  * several independently allocated pages.
>>>  */
>>> -   is_reserved = PageReserved(page);
>>> +   is_reserved = PageReserved(page) || is_zone_device_page(page);
>>> is_cma = is_migrate_cma_page(page);
>>> if (!is_reserved && !is_cma)
>>> usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
>>> for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {
>>> page = virt_to_head_page(ptr);
>>> -   if (is_reserved && !PageReserved(page))
>>> +   if (is_reserved && !(PageReserved(page) ||
>>> +is_zone_device_page(page)))
>>> usercopy_abort("spans Reserved and non-Reserved pages",
>>>NULL, to_user, 0, n);
>>> if (is_cma && !is_migrate_cma_page(page))
>>>
>>
>> @Kees, would it be okay to stop checking against ZONE_DEVICE pages here or
>> is there a good rationale behind this?
>>
>> (I would turn this patch into a simple update of the comment if we agree
>> that we don't care)
> 
> There has been work to actually remove the page span checks entirely,
> but there wasn't consensus on what the right way forward was. I continue
> to leaning toward just dropping it entirely, but Matthew Wilcox has some
> alternative ideas that could use some further thought/testing.

Thanks for your reply!

So, the worst thing that could happen right now, when dropping this
patch, is that we would reject some ranges when hardening is on,
correct? (sounds like that can easily be found by testing if it is
actually relevant)

Do you remember if there were real ZONE_DEVICE usecases that required
this filter to be in place for PG_reserved pages?

-- 

Thanks,

David / dhildenb

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes

2019-10-23 Thread David Hildenbrand

On 22.10.19 19:12, David Hildenbrand wrote:

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

Let's make sure that the logic in the function won't change. Once we no
longer set these pages to reserved, we can rework this function to
perform separate checks for ZONE_DEVICE (split from PG_reserved checks).

Cc: Kees Cook 
Cc: Andrew Morton 
Cc: Kate Stewart 
Cc: Allison Randal 
Cc: "Isaac J. Manjarres" 
Cc: Qian Cai 
Cc: Thomas Gleixner 
Signed-off-by: David Hildenbrand 
---
  mm/usercopy.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/usercopy.c b/mm/usercopy.c
index 660717a1ea5c..a3ac4be35cde 100644
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, 
unsigned long n,
 * device memory), or CMA. Otherwise, reject since the object spans
 * several independently allocated pages.
 */
-   is_reserved = PageReserved(page);
+   is_reserved = PageReserved(page) || is_zone_device_page(page);
is_cma = is_migrate_cma_page(page);
if (!is_reserved && !is_cma)
usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
  
  	for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {

page = virt_to_head_page(ptr);
-   if (is_reserved && !PageReserved(page))
+   if (is_reserved && !(PageReserved(page) ||
+is_zone_device_page(page)))
usercopy_abort("spans Reserved and non-Reserved pages",
   NULL, to_user, 0, n);
if (is_cma && !is_migrate_cma_page(page))



@Kees, would it be okay to stop checking against ZONE_DEVICE pages here 
or is there a good rationale behind this?


(I would turn this patch into a simple update of the comment if we agree 
that we don't care)


--

Thanks,

David / dhildenb

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC v1 06/12] staging/gasket: Prepare gasket_release_page() for PG_reserved changes

2019-10-23 Thread David Hildenbrand

On 22.10.19 19:12, David Hildenbrand wrote:

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

The pages are obtained via get_user_pages_fast(). I assume, these
could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.

Cc: Rob Springer 
Cc: Todd Poynor 
Cc: Ben Chan 
Cc: Greg Kroah-Hartman 
Signed-off-by: David Hildenbrand 
---
  drivers/staging/gasket/gasket_page_table.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/gasket/gasket_page_table.c 
b/drivers/staging/gasket/gasket_page_table.c
index f6d715787da8..d43fed58bf65 100644
--- a/drivers/staging/gasket/gasket_page_table.c
+++ b/drivers/staging/gasket/gasket_page_table.c
@@ -447,7 +447,7 @@ static bool gasket_release_page(struct page *page)
if (!page)
return false;
  
-	if (!PageReserved(page))

+   if (!PageReserved(page) && !is_zone_device_page(page))
SetPageDirty(page);
put_page(page);
  




@Dan, is SetPageDirty() on ZONE_DEVICE pages bad or do we simply not 
care? I think that ending up with ZONE_DEVICE pages here is very 
unlikely. I'd like to drop this (and the next) patch and document why it 
is okay to do so.


--

Thanks,

David / dhildenb

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)

2019-10-23 Thread David Hildenbrand
On 22.10.19 23:54, Dan Williams wrote:
> Hi David,
> 
> Thanks for tackling this!

Thanks for having a look :)

[...]


>> I am probably a little bit too careful (but I don't want to break things).
>> In most places (besides KVM and vfio that are nuts), the
>> pfn_to_online_page() check could most probably be avoided by a
>> is_zone_device_page() check. However, I usually get suspicious when I see
>> a pfn_valid() check (especially after I learned that people mmap parts of
>> /dev/mem into user space, including memory without memmaps. Also, people
>> could memmap offline memory blocks this way :/). As long as this does not
>> hurt performance, I think we should rather do it the clean way.
> 
> I'm concerned about using is_zone_device_page() in places that are not
> known to already have a reference to the page. Here's an audit of
> current usages, and the ones I think need to cleaned up. The "unsafe"
> ones do not appear to have any protections against the device page
> being removed (get_dev_pagemap()). Yes, some of these were added by
> me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
> pages into anonymous memory paths and I'm not up to speed on how it
> guarantees 'struct page' validity vs device shutdown without using
> get_dev_pagemap().
> 
> smaps_pmd_entry(): unsafe
> 
> put_devmap_managed_page(): safe, page reference is held
> 
> is_device_private_page(): safe? gpu driver manages private page lifetime
> 
> is_pci_p2pdma_page(): safe, page reference is held
> 
> uncharge_page(): unsafe? HMM
> 
> add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()
> 
> soft_offline_page(): unsafe
> 
> remove_migration_pte(): unsafe? HMM
> 
> move_to_new_page(): unsafe? HMM
> 
> migrate_vma_pages() and helpers: unsafe? HMM
> 
> try_to_unmap_one(): unsafe? HMM
> 
> __put_page(): safe
> 
> release_pages(): safe
> 
> I'm hoping all the HMM ones can be converted to
> is_device_private_page() directlly and have that routine grow a nice
> comment about how it knows it can always safely de-reference its @page
> argument.
> 
> For the rest I'd like to propose that we add a facility to determine
> ZONE_DEVICE by pfn rather than page. The most straightforward why I
> can think of would be to just add another bitmap to mem_section_usage
> to indicate if a subsection is ZONE_DEVICE or not.

(it's a somewhat unrelated bigger discussion, but we can start discussing it in 
this thread)

I dislike this for three reasons

a) It does not protect against any races, really, it does not improve things.
b) We do have the exact same problem with pfn_to_online_page(). As long as we
   don't hold the memory hotplug lock, memory can get offlined and remove any 
time. Racy.
c) We mix in ZONE specific stuff into the core. It should be "just another zone"

What I propose instead (already discussed in 
https://lkml.org/lkml/2019/10/10/87)

1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
2. Convert SECTION_IS_ACTIVE to a subsection bitmap
3. Introduce pfn_active() that checks against the subsection bitmap
4. Once the memmap was initialized / prepared, set the subsection active
   (similar to SECTION_IS_ONLINE in the buddy right now)
5. Before the memmap gets invalidated, set the subsection inactive
   (similar to SECTION_IS_ONLINE in the buddy right now)
5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE

Especially, driver-reserved device memory will not get set active in
the subsection bitmap.

Now to the race. Taking the memory hotplug lock at random places is ugly. I do
wonder if we can use RCU:

The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():

/* the memmap is guaranteed to remain active under RCU */
rcu_read_lock();
if (pfn_active(random_pfn)) {
page = pfn_to_page(random_pfn);
... use the page, stays valid
}
rcu_unread_lock();

Memory offlining/memremap code:

set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
synchronize_rcu();
/* all users saw the bitmap update, we can invalide the memmap */
remove_pfn_range_from_zone(zone, pfn, nr_pages);

> 
>>
>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
>> on x86-64 and PPC.
> 
> I'll give it a spin, but I don't think the kernel wants to grow more
> is_zone_device_page() users.

Let's recap. In this RFC, I introduce a total of 4 (!) users only.
The other parts can rely on pfn_to_online_page() only.

1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
- Basically never used with ZONE_DEVICE.
- We hold a reference!
- All it protects is a SetPageDirty(page);

2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
- Same as 1.

3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
- We 

Re: [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes

2019-10-22 Thread David Hildenbrand

On 22.10.19 19:55, Matt Sickler wrote:

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that.

The pages are obtained via get_user_pages_fast(). I assume, these could be 
ZONE_DEVICE pages. Let's just exclude them as well explicitly.


I'm not sure what ZONE_DEVICE pages are, but these pages are normal system RAM, 
typically HugePages (but not always).


ZONE_DEVICE, a.k.a. devmem, are pages that bypass the pagecache (e.g., 
DAX) completely and will therefore never get swapped. These pages are 
not managed by any page allocator (especially not the buddy), they are 
rather "directly mapped device memory".


E.g., a NVDIMM. It is mapped into the physical address space similar to 
ordinary RAM (a DIMM). Any write to such a PFN will directly end up on 
the target device. In contrast to a DIMM, the memory is persistent 
accross reboots.


Now, if you mmap such an NVDIMM into a user space process, you will end 
up with ZONE_DEVICE pages as part of the user space mapping (VMA). 
get_user_pages_fast() on this memory will result in "struct pages" that 
belong to ZONE_DEVICE. This is where this patch comes into play.


This patch makes sure that there is absolutely no change once we stop 
setting these ZONE_DEVICE pages PG_reserved. E.g., AFAIK, setting a 
ZONE_DEVICE page dirty does not make too much sense (never swapped).


Yes, it might not be a likely setup, however, it is possible. In this 
series I collect all places that *could* be affected. If that change is 
really needed has to be decided. I can see that the two staging drivers 
I have patches for might be able to just live with the change - but then 
we talked about it and are aware of the change.


Thanks!

--

Thanks,

David / dhildenb

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC v1 10/12] powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

We could explicitly check for is_zone_device_page(page). But looking at
the pfn_valid() check, it seems safer to just use pfn_to_online_page()
here, that will skip all ZONE_DEVICE pages right away.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: "Aneesh Kumar K.V" 
Cc: Allison Randal 
Cc: Nicholas Piggin 
Cc: Thomas Gleixner 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/mm/pgtable.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index e3759b69f81b..613c98fa7dc0 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -55,10 +55,12 @@ static struct page *maybe_pte_to_page(pte_t pte)
unsigned long pfn = pte_pfn(pte);
struct page *page;
 
-   if (unlikely(!pfn_valid(pfn)))
-   return NULL;
-   page = pfn_to_page(pfn);
-   if (PageReserved(page))
+   /*
+* We reject any pages that are not online (not managed by the buddy).
+* This includes ZONE_DEVICE pages.
+*/
+   page = pfn_to_online_page(pfn);
+   if (unlikely(!page || PageReserved(page)))
return NULL;
return page;
 }
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC v1 09/12] powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

We could explicitly check for is_zone_device_page(page). But looking at
the pfn_valid() check, it seems safer to just use pfn_to_online_page()
here, that will skip all ZONE_DEVICE pages right away.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Aneesh Kumar K.V" 
Cc: Christophe Leroy 
Cc: Nicholas Piggin 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: YueHaibing 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 6c123760164e..a1566039e747 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1084,13 +1084,15 @@ void hash__early_init_mmu_secondary(void)
  */
 unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap)
 {
-   struct page *page;
+   struct page *page = pfn_to_online_page(pte_pfn(pte));
 
-   if (!pfn_valid(pte_pfn(pte)))
+   /*
+* We ignore any pages that are not online (not managed by the buddy).
+* This includes ZONE_DEVICE pages.
+*/
+   if (!page)
return pp;
 
-   page = pte_page(pte);
-
/* page is dirty */
if (!test_bit(PG_arch_1, >flags) && !PageReserved(page)) {
if (trap == 0x400) {
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC v1 11/12] x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

We could explicitly check for is_zone_device_page(page). But looking at
the pfn_valid() check, it seems safer to just use pfn_to_online_page()
here, that will skip all ZONE_DEVICE pages right away.

Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Signed-off-by: David Hildenbrand 
---
 arch/x86/mm/ioremap.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index a39dcdb5ae34..db6913b48edf 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -77,10 +77,17 @@ static unsigned int __ioremap_check_ram(struct resource 
*res)
start_pfn = (res->start + PAGE_SIZE - 1) >> PAGE_SHIFT;
stop_pfn = (res->end + 1) >> PAGE_SHIFT;
if (stop_pfn > start_pfn) {
-   for (i = 0; i < (stop_pfn - start_pfn); ++i)
-   if (pfn_valid(start_pfn + i) &&
-   !PageReserved(pfn_to_page(start_pfn + i)))
+   for (i = 0; i < (stop_pfn - start_pfn); ++i) {
+   struct page *page;
+/*
+ * We treat any pages that are not online (not managed
+ * by the buddy) as not being RAM. This includes
+ * ZONE_DEVICE pages.
+ */
+   page = pfn_to_online_page(start_pfn + i);
+   if (page && !PageReserved(page))
return IORES_MAP_SYSTEM_RAM;
+   }
}
 
return 0;
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes

2019-10-22 Thread David Hildenbrand
Our onlining/offlining code is unnecessarily complicated. Only memory
blocks added during boot can have holes. Hotplugged memory never has
holes. That memory is already online.

When we stop allowing to offline memory blocks with holes, we implicitly
stop to online memory blocks with holes.

This allows to simplify the code. For example, we no longer have to
worry about marking pages that fall into memory holes PG_reserved when
onlining memory. We can stop setting pages PG_reserved.

Offlining memory blocks added during boot is usually not guranteed to work
either way. So stopping to do that (if anybody really used and tested
this over the years) should not really hurt. For the use case of
offlining memory to unplug DIMMs, we should see no change. (holes on
DIMMs would be weird)

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 561371ead39a..7210f4375279 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct 
memory_notify *arg)
node_clear_state(node, N_MEMORY);
 }
 
+static int count_system_ram_pages_cb(unsigned long start_pfn,
+unsigned long nr_pages, void *data)
+{
+   unsigned long *nr_system_ram_pages = data;
+
+   *nr_system_ram_pages += nr_pages;
+   return 0;
+}
+
 static int __ref __offline_pages(unsigned long start_pfn,
  unsigned long end_pfn)
 {
-   unsigned long pfn, nr_pages;
+   unsigned long pfn, nr_pages = 0;
unsigned long offlined_pages = 0;
int ret, node, nr_isolate_pageblock;
unsigned long flags;
@@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
mem_hotplug_begin();
 
+   /*
+* We don't allow to offline memory blocks that contain holes
+* and consecuently don't allow to online memory blocks that contain
+* holes. This allows to simplify the code quite a lot and we don't
+* have to mess with PG_reserved pages for memory holes.
+*/
+   walk_system_ram_range(start_pfn, end_pfn - start_pfn, _pages,
+ count_system_ram_pages_cb);
+   if (nr_pages != end_pfn - start_pfn) {
+   ret = -EINVAL;
+   reason = "memory holes";
+   goto failed_removal;
+   }
+
/* This makes hotplug much easier...and readable.
   we assume this for now. .*/
if (!test_pages_in_a_zone(start_pfn, end_pfn, _start,
@@ -1472,7 +1495,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
zone = page_zone(pfn_to_page(valid_start));
node = zone_to_nid(zone);
-   nr_pages = end_pfn - start_pfn;
 
/* set above range as isolated */
ret = start_isolate_page_range(start_pfn, end_pfn,
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC v1 06/12] staging/gasket: Prepare gasket_release_page() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

The pages are obtained via get_user_pages_fast(). I assume, these
could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.

Cc: Rob Springer 
Cc: Todd Poynor 
Cc: Ben Chan 
Cc: Greg Kroah-Hartman 
Signed-off-by: David Hildenbrand 
---
 drivers/staging/gasket/gasket_page_table.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/gasket/gasket_page_table.c 
b/drivers/staging/gasket/gasket_page_table.c
index f6d715787da8..d43fed58bf65 100644
--- a/drivers/staging/gasket/gasket_page_table.c
+++ b/drivers/staging/gasket/gasket_page_table.c
@@ -447,7 +447,7 @@ static bool gasket_release_page(struct page *page)
if (!page)
return false;
 
-   if (!PageReserved(page))
+   if (!PageReserved(page) && !is_zone_device_page(page))
SetPageDirty(page);
put_page(page);
 
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC v1 08/12] powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap. Note that ZONE_DEVICE memory is
never online (IOW, managed by the buddy).

Switching to pfn_to_online_page() keeps the existing behavior for
PFNs without a memmap and for ZONE_DEVICE memory.

Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..05397c0561fc 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -801,12 +801,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
   writing, upgrade_p);
if (is_error_noslot_pfn(pfn))
return -EFAULT;
-   page = NULL;
-   if (pfn_valid(pfn)) {
-   page = pfn_to_page(pfn);
-   if (PageReserved(page))
-   page = NULL;
-   }
+   /*
+* We treat any pages that are not online (not managed by the
+* buddy) as reserved - this includes ZONE_DEVICE pages and
+* pages without a memmap (e.g., mapped via /dev/mem).
+*/
+   page = pfn_to_online_page(pfn);
+   if (page && PageReserved(page))
+   page = NULL;
}
 
/*
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC v1 12/12] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap

2019-10-22 Thread David Hildenbrand
Everything should be prepared to stop setting pages PG_reserved when
initializing the memmap on memory hotplug. Most importantly, we
stop marking ZONE_DEVICE pages PG_reserved.

a) We made sure that any code that relied on PG_reserved to detect
   ZONE_DEVICE memory will no longer rely on PG_reserved - either
   by using pfn_to_online_page() to exclude them right away or by
   checking against is_zone_device_page().
b) We made sure that memory blocks with holes cannot be offlined and
   therefore also not onlined. We have quite some code that relies on
   memory holes being marked PG_reserved. This is now not an issue
   anymore.

generic_online_page() still calls __free_pages_core(), which performs
__ClearPageReserved(p). AFAIKS, this should not hurt.

It is worth nothing that the users of online_page_callback_t might see a
change. E.g., until now, pages not freed to the buddy by the HyperV
balloonm were set PG_reserved until freed via generic_online_page(). Now,
they would look like ordinarily allocated pages (refcount == 1). This
callback is used by the XEN balloon and the HyperV balloon. To not
introduce any silent errors, keep marking the pages PG_reserved. We can
most probably stop doing that, but have to double check if there are
issues (e.g., offlining code aborts right away in has_unmovable_pages()
when it runs into a PageReserved(page))

Update the documentation at various places.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Andrew Morton 
Cc: Alexander Duyck 
Cc: Pavel Tatashin 
Cc: Vlastimil Babka 
Cc: Johannes Weiner 
Cc: Anthony Yznaga 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Dan Williams 
Cc: Mel Gorman 
Cc: Mike Rapoport 
Cc: Anshuman Khandual 
Suggested-by: Michal Hocko 
Signed-off-by: David Hildenbrand 
---
 drivers/hv/hv_balloon.c|  6 ++
 drivers/xen/balloon.c  |  7 +++
 include/linux/page-flags.h |  8 +---
 mm/memory_hotplug.c| 17 +++--
 mm/page_alloc.c| 11 ---
 5 files changed, 21 insertions(+), 28 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index c722079d3c24..3214b0ef5247 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -670,6 +670,12 @@ static struct notifier_block hv_memory_nb = {
 /* Check if the particular page is backed and can be onlined and online it. */
 static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg)
 {
+   /*
+* TODO: The core used to mark the pages reserved. Most probably
+* we can stop doing that now.
+*/
+   __SetPageReserved(pg);
+
if (!has_pfn_is_backed(has, page_to_pfn(pg))) {
if (!PageOffline(pg))
__SetPageOffline(pg);
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 4f2e78a5e4db..af69f057913a 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -374,6 +374,13 @@ static void xen_online_page(struct page *page, unsigned 
int order)
mutex_lock(_mutex);
for (i = 0; i < size; i++) {
p = pfn_to_page(start_pfn + i);
+   /*
+* TODO: The core used to mark the pages reserved. Most probably
+* we can stop doing that now. However, especially
+* alloc_xenballooned_pages() left PG_reserved set
+* on pages that can get mapped to user space.
+*/
+   __SetPageReserved(p);
balloon_append(p);
}
mutex_unlock(_mutex);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f91cb8898ff0..d4f85d866b71 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -30,24 +30,18 @@
  * - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying
  *   to read/write these pages might end badly. Don't touch!
  * - The zero page(s)
- * - Pages not added to the page allocator when onlining a section because
- *   they were excluded via the online_page_callback() or because they are
- *   PG_hwpoison.
  * - Pages allocated in the context of kexec/kdump (loaded kernel image,
  *   control pages, vmcoreinfo)
  * - MMIO/DMA pages. Some architectures don't allow to ioremap pages that are
  *   not marked PG_reserved (as they might be in use by somebody else who does
  *   not respect the caching strategy).
- * - Pages part of an offline section (struct pages of offline sections should
- *   not be trusted as they will be initialized when first onlined).
  * - MCA pages on ia64
  * - Pages holding CPU notes for POWER Firmware Assisted Dump
- * - Device memory (e.g. PMEM, DAX, HMM)
  * Some PG_reserved pages will be excluded from the hibernation image.
  * PG_reserved does in general not hinder anybody from dumping or swapping
  * and is no longer required for remap_pfn_range(). ioremap might require it.
  * Consequently

[PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

The pages are obtained via get_user_pages_fast(). I assume, these
could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.

Cc: Greg Kroah-Hartman 
Cc: Vandana BN 
Cc: "Simon Sandström" 
Cc: Dan Carpenter 
Cc: Nishka Dasgupta 
Cc: Madhumitha Prabakaran 
Cc: Fabio Estevam 
Cc: Matt Sickler 
Cc: Jeremy Sowden 
Signed-off-by: David Hildenbrand 
---
 drivers/staging/kpc2000/kpc_dma/fileops.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/kpc2000/kpc_dma/fileops.c 
b/drivers/staging/kpc2000/kpc_dma/fileops.c
index cb52bd9a6d2f..457adcc81fe6 100644
--- a/drivers/staging/kpc2000/kpc_dma/fileops.c
+++ b/drivers/staging/kpc2000/kpc_dma/fileops.c
@@ -212,7 +212,8 @@ void  transfer_complete_cb(struct aio_cb_data *acd, size_t 
xfr_count, u32 flags)
BUG_ON(acd->ldev->pldev == NULL);
 
for (i = 0 ; i < acd->page_count ; i++) {
-   if (!PageReserved(acd->user_pages[i])) {
+   if (!PageReserved(acd->user_pages[i]) &&
+   !is_zone_device_page(acd->user_pages[i])) {
set_page_dirty(acd->user_pages[i]);
}
}
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC v1 05/12] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap. Note that ZONE_DEVICE memory is
never online (IOW, managed by the buddy).

Switching to pfn_to_online_page() keeps the existing behavior for
PFNs without a memmap and for ZONE_DEVICE memory. They are treated as
reserved and the page is not touched (e.g., to set it dirty or accessed).

Cc: Alex Williamson 
Cc: Cornelia Huck 
Signed-off-by: David Hildenbrand 
---
 drivers/vfio/vfio_iommu_type1.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..f8ce8c408ba8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long 
npage, bool async)
  */
 static bool is_invalid_reserved_pfn(unsigned long pfn)
 {
-   if (pfn_valid(pfn))
-   return PageReserved(pfn_to_page(pfn));
+   struct page *page = pfn_to_online_page(pfn);
 
+   /*
+* We treat any pages that are not online (not managed by the buddy)
+* as reserved - this includes ZONE_DEVICE pages and pages without
+* a memmap (e.g., mapped via /dev/mem).
+*/
+   if (page)
+   return PageReserved(page);
return true;
 }
 
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC v1 04/12] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap. Note that ZONE_DEVICE memory is
never online (IOW, managed by the buddy).

Switching to pfn_to_online_page() keeps the existing behavior for
PFNs without a memmap and for ZONE_DEVICE memory. They are treated as
reserved and the page is not touched (e.g., to set it dirty or accessed).

Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: KarimAllah Ahmed 
Signed-off-by: David Hildenbrand 
---
 virt/kvm/kvm_main.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 66a977472a1c..b233d4129014 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct 
kvm *kvm,
 
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
-   if (pfn_valid(pfn))
-   return PageReserved(pfn_to_page(pfn));
+   struct page *page = pfn_to_online_page(pfn);
 
+   /*
+* We treat any pages that are not online (not managed by the buddy)
+* as reserved - this includes ZONE_DEVICE pages and pages without
+* a memmap (e.g., mapped via /dev/mem).
+*/
+   if (page)
+   return PageReserved(page);
return true;
 }
 
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC v1 03/12] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap - however, there is no reliable and
fast check to detect memmaps that were initialized and are ZONE_DEVICE.

Let's rewrite kvm_is_mmio_pfn() so we really only touch initialized
memmaps that are guaranteed to not contain garbage. Make sure that
RAM without a memmap is still not detected as MMIO and that ZONE_DEVICE
that is not UC/UC-/WC is not detected as MMIO.

Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Sean Christopherson 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Cc: Jim Mattson 
Cc: Joerg Roedel 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: KarimAllah Ahmed 
Cc: Michal Hocko 
Cc: Dan Williams 
Signed-off-by: David Hildenbrand 
---
 arch/x86/kvm/mmu.c | 30 ++
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24c23c66b226..795869ffd4bb 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2962,20 +2962,26 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
 
 static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
+   struct page *page = pfn_to_online_page(pfn);
+
+   /*
+* Online pages consist of pages managed by the buddy. Especially,
+* ZONE_DEVICE pages are never online. Online pages that are reserved
+* indicate the zero page and MMIO pages.
+*/
+   if (page)
+   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+
+   /*
+* Anything with a valid memmap could be ZONE_DEVICE - or the
+* memmap could be uninitialized. Treat only UC/UC-/WC pages as MMIO.
+*/
if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
-   /*
-* Some reserved pages, such as those from NVDIMM
-* DAX devices, are not for MMIO, and can be mapped
-* with cached memory type for better performance.
-* However, the above check misconceives those pages
-* as MMIO, and results in KVM mapping them with UC
-* memory type, which would hurt the performance.
-* Therefore, we check the host memory type in addition
-* and only treat UC/UC-/WC pages as MMIO.
-*/
-   (!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn));
+   return !pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn);
 
+   /*
+* Any RAM that has no memmap (e.g., mapped via /dev/mem) is not MMIO.
+*/
return !e820__mapped_raw_any(pfn_to_hpa(pfn),
 pfn_to_hpa(pfn + 1) - 1,
 E820_TYPE_RAM);
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

Let's make sure that the logic in the function won't change. Once we no
longer set these pages to reserved, we can rework this function to
perform separate checks for ZONE_DEVICE (split from PG_reserved checks).

Cc: Kees Cook 
Cc: Andrew Morton 
Cc: Kate Stewart 
Cc: Allison Randal 
Cc: "Isaac J. Manjarres" 
Cc: Qian Cai 
Cc: Thomas Gleixner 
Signed-off-by: David Hildenbrand 
---
 mm/usercopy.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/usercopy.c b/mm/usercopy.c
index 660717a1ea5c..a3ac4be35cde 100644
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, 
unsigned long n,
 * device memory), or CMA. Otherwise, reject since the object spans
 * several independently allocated pages.
 */
-   is_reserved = PageReserved(page);
+   is_reserved = PageReserved(page) || is_zone_device_page(page);
is_cma = is_migrate_cma_page(page);
if (!is_reserved && !is_cma)
usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
 
for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {
page = virt_to_head_page(ptr);
-   if (is_reserved && !PageReserved(page))
+   if (is_reserved && !(PageReserved(page) ||
+is_zone_device_page(page)))
usercopy_abort("spans Reserved and non-Reserved pages",
   NULL, to_user, 0, n);
if (is_cma && !is_migrate_cma_page(page))
-- 
2.21.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)

2019-10-22 Thread David Hildenbrand
This series is based on [2], which should pop up in linux/next soon:
https://lkml.org/lkml/2019/10/21/1034

This is the result of a recent discussion with Michal ([1], [2]). Right
now we set all pages PG_reserved when initializing hotplugged memmaps. This
includes ZONE_DEVICE memory. In case of system memory, PG_reserved is
cleared again when onlining the memory, in case of ZONE_DEVICE memory
never. In ancient times, we needed PG_reserved, because there was no way
to tell whether the memmap was already properly initialized. We now have
SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE
memory is already initialized deferred, and there shouldn't be a visible
change in that regard.

I remember that some time ago, we already talked about stopping to set
ZONE_DEVICE pages PG_reserved on the list, but I never saw any patches.
Also, I forgot who was part of the discussion :)

One of the biggest fear were side effects. I went ahead and audited all
users of PageReserved(). The ones that don't need any care (patches)
can be found below. I will double check and hope I am not missing something
important.

I am probably a little bit too careful (but I don't want to break things).
In most places (besides KVM and vfio that are nuts), the
pfn_to_online_page() check could most probably be avoided by a
is_zone_device_page() check. However, I usually get suspicious when I see
a pfn_valid() check (especially after I learned that people mmap parts of
/dev/mem into user space, including memory without memmaps. Also, people
could memmap offline memory blocks this way :/). As long as this does not
hurt performance, I think we should rather do it the clean way.

I only gave it a quick test with DIMMs on x86-64, but didn't test the
ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
on x86-64 and PPC.

Other users of PageReserved() that should be fine:
- mm/page_owner.c:pagetypeinfo_showmixedcount_print()
  -> Never called for ZONE_DEVICE, (+ pfn_to_online_page(pfn))
- mm/page_owner.c:init_pages_in_zone()
  -> Never called for ZONE_DEVICE (!populated_zone(zone))
- mm/page_ext.c:free_page_ext()
  -> Only a BUG_ON(PageReserved(page)), not relevant
- mm/page_ext.c:has_unmovable_pages()
  -> Not releveant for ZONE_DEVICE
- mm/page_ext.c:pfn_range_valid_contig()
  -> pfn_to_online_page() already guards us
- mm/mempolicy.c:queue_pages_pte_range()
  -> vm_normal_page() checks against pte_devmap()
- mm/memory-failure.c:hwpoison_user_mappings()
  -> Not reached via memory_failure() due to pfn_to_online_page()
  -> Also not reached indirectly via memory_failure_hugetlb()
- mm/hugetlb.c:gather_bootmem_prealloc()
  -> Only a WARN_ON(PageReserved(page)), not relevant
- kernel/power/snapshot.c:saveable_highmem_page()
  -> pfn_to_online_page() already guards us
- kernel/power/snapshot.c:saveable_page()
  -> pfn_to_online_page() already guards us
- fs/proc/task_mmu.c:can_gather_numa_stats()
  -> vm_normal_page() checks against pte_devmap()
- fs/proc/task_mmu.c:can_gather_numa_stats_pmd
  -> vm_normal_page_pmd() checks against pte_devmap()
- fs/proc/page.c:stable_page_flags()
  -> The reserved bit is simply copied, irrelevant
- drivers/firmware/memmap.c:release_firmware_map_entry()
  -> really only a check to detect bootmem. Not relevant for ZONE_DEVICE
- arch/ia64/kernel/mca_drv.c
- arch/mips/mm/init.c
- arch/mips/mm/ioremap.c
- arch/nios2/mm/ioremap.c
- arch/parisc/mm/ioremap.c
- arch/sparc/mm/tlb.c
- arch/xtensa/mm/cache.c
  -> No ZONE_DEVICE support
- arch/powerpc/mm/init_64.c:vmemmap_free()
  -> Special-cases memmap on altmap
  -> Only a check for bootmem
- arch/x86/kernel/alternative.c:__text_poke()
  -> Only a WARN_ON(!PageReserved(pages[0])) to verify it is bootmem
- arch/x86/mm/init_64.c
  -> Only a check for bootmem

[1] https://lkml.org/lkml/2019/10/21/736
[2] https://lkml.org/lkml/2019/10/21/1034

Cc: Michal Hocko 
Cc: Dan Williams 
Cc: kvm-...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: k...@vger.kernel.org
Cc: linux-hyp...@vger.kernel.org
Cc: de...@driverdev.osuosl.org
Cc: xen-de...@lists.xenproject.org
Cc: x...@kernel.org
Cc: Alexander Duyck 

David Hildenbrand (12):
  mm/memory_hotplug: Don't allow to online/offline memory blocks with
holes
  mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
  KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
  KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  staging/gasket: Prepare gasket_release_page() for PG_reserved changes
  staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved
changes
  powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for
PG_reserved changes
  powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved
changes
  powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes
  x86/mm: Prepare __ior

Re: [PATCH RFCv2 0/4] mm/memory_hotplug: Introduce memory block types

2019-03-27 Thread David Hildenbrand
On 20.12.18 14:08, Michal Hocko wrote:
> On Thu 20-12-18 13:58:16, David Hildenbrand wrote:
>> On 30.11.18 18:59, David Hildenbrand wrote:
>>> This is the second approach, introducing more meaningful memory block
>>> types and not changing online behavior in the kernel. It is based on
>>> latest linux-next.
>>>
>>> As we found out during dicussion, user space should always handle onlining
>>> of memory, in any case. However in order to make smart decisions in user
>>> space about if and how to online memory, we have to export more information
>>> about memory blocks. This way, we can formulate rules in user space.
>>>
>>> One such information is the type of memory block we are talking about.
>>> This helps to answer some questions like:
>>> - Does this memory block belong to a DIMM?
>>> - Can this DIMM theoretically ever be unplugged again?
>>> - Was this memory added by a balloon driver that will rely on balloon
>>>   inflation to remove chunks of that memory again? Which zone is advised?
>>> - Is this special standby memory on s390x that is usually not automatically
>>>   onlined?
>>>
>>> And in short it helps to answer to some extend (excluding zone imbalances)
>>> - Should I online this memory block?
>>> - To which zone should I online this memory block?
>>> ... of course special use cases will result in different anwers. But that's
>>> why user space has control of onlining memory.
>>>
>>> More details can be found in Patch 1 and Patch 3.
>>> Tested on x86 with hotplugged DIMMs. Cross-compiled for PPC and s390x.
>>>
>>>
>>> Example:
>>> $ udevadm info -q all -a /sys/devices/system/memory/memory0
>>> KERNEL=="memory0"
>>> SUBSYSTEM=="memory"
>>> DRIVER==""
>>> ATTR{online}=="1"
>>> ATTR{phys_device}=="0"
>>> ATTR{phys_index}==""
>>> ATTR{removable}=="0"
>>> ATTR{state}=="online"
>>> ATTR{type}=="boot"
>>> ATTR{valid_zones}=="none"
>>> $ udevadm info -q all -a /sys/devices/system/memory/memory90
>>> KERNEL=="memory90"
>>> SUBSYSTEM=="memory"
>>> DRIVER==""
>>> ATTR{online}=="1"
>>> ATTR{phys_device}=="0"
>>> ATTR{phys_index}=="005a"
>>> ATTR{removable}=="1"
>>> ATTR{state}=="online"
>>> ATTR{type}=="dimm"
>>> ATTR{valid_zones}=="Normal"
>>>
>>>
>>> RFC -> RFCv2:
>>> - Now also taking care of PPC (somehow missed it :/ )
>>> - Split the series up to some degree (some ideas on how to split up patch 3
>>>   would be very welcome)
>>> - Introduce more memory block types. Turns out abstracting too much was
>>>   rather confusing and not helpful. Properly document them.
>>>
>>> Notes:
>>> - I wanted to convert the enum of types into a named enum but this
>>>   provoked all kinds of different errors. For now, I am doing it just like
>>>   the other types (e.g. online_type) we are using in that context.
>>> - The "removable" property should never have been named like that. It
>>>   should have been "offlinable". Can we still rename that? E.g. boot memory
>>>   is sometimes marked as removable ...
>>>
>>
>>
>> Any feedback regarding the suggested block types would be very much
>> appreciated!
> 
> I still do not like this much to be honest. I just didn't get to think
> through this properly. My fear is that this is conflating an actual API
> with the current implementation and as such will cause problems in
> future. But I haven't really looked into your patches closely so I might
> be wrong. Anyway I won't be able to look into it by the end of year.
> 

So I started to think about this again, and I guess somehow exposing an
identification of the device driver that added the memory section could
be sufficient.

E.g. "hyperv", "xen", "acpi", "sclp", "virtio-mem" ...

Via separate device driver interfaces, other information about the
memory could be exposed. (e.g. for ACPI: which memory devices belong to
one physical device). So stuff would not have to centered around
/sys/devices/system/memory/ , uglifying it for special cases.

We would have to write udev rules to deal with these values, should be
easy. If no DRIVER is given, it is simply memory detected and detected
during boot. ACPI changing the DRIVER might be tricky (from no DRIVER ->
ACPI), but I guess it could be done.

Now, the question would be how to get the DRIVER value in there. Adding
a bunch of fake device drivers would work, however this might get a
little messy ... and then there is unbining and rebinding which can be
triggered by userspace. Thinks to care about? Most probably not.

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2 3/8] kexec: export PG_offline to VMCOREINFO

2019-03-11 Thread David Hildenbrand
On 11.03.19 10:04, Dave Young wrote:
> Hi David,
> On 11/22/18 at 11:06am, David Hildenbrand wrote:
>> Right now, pages inflated as part of a balloon driver will be dumped
>> by dump tools like makedumpfile. While XEN is able to check in the
>> crash kernel whether a certain pfn is actuall backed by memory in the
>> hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
>> other balloon inflated memory will essentially result in zero pages getting
>> allocated by the hypervisor and the dump getting filled with this data.
>>
>> The allocation and reading of zero pages can directly be avoided if a
>> dumping tool could know which pages only contain stale information not to
>> be dumped.
>>
>> We now have PG_offline which can be (and already is by virtio-balloon)
>> used for marking pages as logically offline. Follow up patches will
>> make use of this flag also in other balloon implementations.
>>
>> Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so
>> makedumpfile can directly skip pages that are logically offline and the
>> content therefore stale. (we export is as a macro to match how it is
>> done for PG_buddy. This way it is clearer that this is not actually a flag
>> but only a very specific mapcount value to represent page types).
>>
>> Please note that this is also helpful for a problem we were seeing under
>> Hyper-V: Dumping logically offline memory (pages kept fake offline while
>> onlining a section via online_page_callback) would under some condicions
>> result in a kernel panic when dumping them.
>>
>> Cc: Andrew Morton 
>> Cc: Dave Young 
>> Cc: "Kirill A. Shutemov" 
>> Cc: Baoquan He 
>> Cc: Omar Sandoval 
>> Cc: Arnd Bergmann 
>> Cc: Matthew Wilcox 
>> Cc: Michal Hocko 
>> Cc: "Michael S. Tsirkin" 
>> Cc: Lianbo Jiang 
>> Cc: Borislav Petkov 
>> Cc: Kazuhito Hagio 
>> Acked-by: Michael S. Tsirkin 
>> Acked-by: Dave Young 
>> Signed-off-by: David Hildenbrand 
>> ---
>>  kernel/crash_core.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
>> index 933cb3e45b98..093c9f917ed0 100644
>> --- a/kernel/crash_core.c
>> +++ b/kernel/crash_core.c
>> @@ -464,6 +464,8 @@ static int __init crash_save_vmcoreinfo_init(void)
>>  VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
>>  #ifdef CONFIG_HUGETLB_PAGE
>>  VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
>> +#define PAGE_OFFLINE_MAPCOUNT_VALUE (~PG_offline)
>> +VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);
>>  #endif
>>  
>>  arch_crash_save_vmcoreinfo();
> 
> The patch has been merged, would you mind to send a documentation patch
> for the vmcoreinfo, which is added recently in 
> Documentation/kdump/vmcoreinfo.txt
> 
> A brief description about how this vmcoreinfo field is used is good to
> have.
> 

Turns out, it was already documented

PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision
|PG_head_mask|PAGE_BUDDY_MAPCOUNT_VALUE(~PG_buddy)
|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline)
-

Page attributes. These flags are used to filter various unnecessary for
dumping pages.


Thanks!

> Thanks
> Dave
> 


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2] makedumpfile: exclude pages that are logically offline

2019-03-07 Thread David Hildenbrand
On 27.11.18 17:32, Kazuhito Hagio wrote:
>> Linux marks pages that are logically offline via a page flag (map count).
>> Such pages e.g. include pages infated as part of a balloon driver or
>> pages that were not actually onlined when onlining the whole section.
>>
>> While the hypervisor usually allows to read such inflated memory, we
>> basically read and dump data that is completely irrelevant. Also, this
>> might result in quite some overhead in the hypervisor. In addition,
>> we saw some problems under Hyper-V, whereby we can crash the kernel by
>> dumping, when reading memory of a partially onlined memory segment
>> (for memory added by the Hyper-V balloon driver).
>>
>> Therefore, don't read and dump pages that are marked as being logically
>> offline.
>>
>> Signed-off-by: David Hildenbrand 
> 
> Thanks for the v2 update.
> I'm going to merge this patch after the kernel patches are merged
> and it tests fine with the kernel.
> 
> Kazu

Hi Kazu,

the patches are now upstream. Thanks!

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] hv_balloon: avoid touching uninitialized struct page during tail onlining

2019-01-08 Thread David Hildenbrand
On 07.01.19 14:44, Vitaly Kuznetsov wrote:
> David Hildenbrand  writes:
> 
>> On 04.01.19 15:19, Vitaly Kuznetsov wrote:
>>> Hyper-V memory hotplug protocol has 2M granularity and in Linux x86 we use
>>> 128M. To deal with it we implement partial section onlining by registering
>>> custom page onlining callback (hv_online_page()). Later, when more memory
>>> arrives we try to online the 'tail' (see hv_bring_pgs_online()).
>>>
>>> It was found that in some cases this 'tail' onlining causes issues:
>>>
>>>  BUG: Bad page state in process kworker/0:2  pfn:109e3a
>>>  page:e08344278e80 count:0 mapcount:1 mapping: index:0x0
>>>  flags: 0xf8000()
>>>  raw: 000f8000 dead0100 dead0200 
>>>  raw:    
>>>  page dumped because: nonzero mapcount
>>>  ...
>>>  Workqueue: events hot_add_req [hv_balloon]
>>>  Call Trace:
>>>   dump_stack+0x5c/0x80
>>>   bad_page.cold.112+0x7f/0xb2
>>>   free_pcppages_bulk+0x4b8/0x690
>>>   free_unref_page+0x54/0x70
>>>   hv_page_online_one+0x5c/0x80 [hv_balloon]
>>>   hot_add_req.cold.24+0x182/0x835 [hv_balloon]
>>>   ...
>>>
>>> Turns out that we now have deferred struct page initialization for memory
>>> hotplug so e.g. memory_block_action() in drivers/base/memory.c does
>>> pages_correctly_probed() check and in that check it avoids inspecting
>>> struct pages and checks sections instead. But in Hyper-V balloon driver we
>>> do PageReserved(pfn_to_page()) check and this is now wrong.
>>>
>>> Switch to checking online_section_nr() instead.
>>>
>>> Signed-off-by: Vitaly Kuznetsov 
>>> ---
>>>  drivers/hv/hv_balloon.c | 10 ++
>>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
>>> index 5301fef16c31..7c6349a50ef1 100644
>>> --- a/drivers/hv/hv_balloon.c
>>> +++ b/drivers/hv/hv_balloon.c
>>> @@ -888,12 +888,14 @@ static unsigned long handle_pg_range(unsigned long 
>>> pg_start,
>>> pfn_cnt -= pgs_ol;
>>> /*
>>>  * Check if the corresponding memory block is already
>>> -* online by checking its last previously backed page.
>>> -* In case it is we need to bring rest (which was not
>>> -* backed previously) online too.
>>> +* online. It is possible to observe struct pages still
>>> +* being uninitialized here so check section instead.
>>> +* In case the section is online we need to bring the
>>> +* rest of pfns (which were not backed previously)
>>> +* online too.
>>>  */
>>> if (start_pfn > has->start_pfn &&
>>> -   !PageReserved(pfn_to_page(start_pfn - 1)))
>>> +   online_section_nr(pfn_to_section_nr(start_pfn)))
>>> hv_bring_pgs_online(has, start_pfn, pgs_ol);
>>>  
>>> }
>>>
>>
>> I wonder if you should use pfn_to_online_page() and check for PageOffline().
>>
>> (I guess online_section_nr() should also do the trick)
> 
> I'm worried a bit about racing with mm code here as we're not doing
> mem_hotplug_begin()/done() so I'd slightly prefer keeping
> online_section_nr() (pfn_to_online_page() also uses it but then it gets
> to the particular struct page). Moreover, with pfn_to_online_page() we
> will be looking at some other pfn - because the start_pfn is definitelly
> offline (pre-patch we were looking at start_pfn-1). Just looking at the
> whole section seems cleaner.

Fine with me. I guess the section can never be offlined as it still
contains reserved pages if not fully "fake-onlined" by HV code already.

But we could still consider mem_hotplug_begin()/done() as we could have
a online section although online_pages() has not completed yet. So we
could actually touch some "semi onlined section".

Acked-by: David Hildenbrand 

> 
> P.S. I still think about bringing mem_hotplug_begin()/done() to
> hv_balloon but that's going to be a separate discussion, here I want to
> have a small fix backportable to stable.
> 
> Thanks,
> 


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] hv_balloon: avoid touching uninitialized struct page during tail onlining

2019-01-07 Thread David Hildenbrand
On 04.01.19 15:19, Vitaly Kuznetsov wrote:
> Hyper-V memory hotplug protocol has 2M granularity and in Linux x86 we use
> 128M. To deal with it we implement partial section onlining by registering
> custom page onlining callback (hv_online_page()). Later, when more memory
> arrives we try to online the 'tail' (see hv_bring_pgs_online()).
> 
> It was found that in some cases this 'tail' onlining causes issues:
> 
>  BUG: Bad page state in process kworker/0:2  pfn:109e3a
>  page:e08344278e80 count:0 mapcount:1 mapping: index:0x0
>  flags: 0xf8000()
>  raw: 000f8000 dead0100 dead0200 
>  raw:    
>  page dumped because: nonzero mapcount
>  ...
>  Workqueue: events hot_add_req [hv_balloon]
>  Call Trace:
>   dump_stack+0x5c/0x80
>   bad_page.cold.112+0x7f/0xb2
>   free_pcppages_bulk+0x4b8/0x690
>   free_unref_page+0x54/0x70
>   hv_page_online_one+0x5c/0x80 [hv_balloon]
>   hot_add_req.cold.24+0x182/0x835 [hv_balloon]
>   ...
> 
> Turns out that we now have deferred struct page initialization for memory
> hotplug so e.g. memory_block_action() in drivers/base/memory.c does
> pages_correctly_probed() check and in that check it avoids inspecting
> struct pages and checks sections instead. But in Hyper-V balloon driver we
> do PageReserved(pfn_to_page()) check and this is now wrong.
> 
> Switch to checking online_section_nr() instead.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  drivers/hv/hv_balloon.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
> index 5301fef16c31..7c6349a50ef1 100644
> --- a/drivers/hv/hv_balloon.c
> +++ b/drivers/hv/hv_balloon.c
> @@ -888,12 +888,14 @@ static unsigned long handle_pg_range(unsigned long 
> pg_start,
>   pfn_cnt -= pgs_ol;
>   /*
>* Check if the corresponding memory block is already
> -  * online by checking its last previously backed page.
> -  * In case it is we need to bring rest (which was not
> -  * backed previously) online too.
> +  * online. It is possible to observe struct pages still
> +  * being uninitialized here so check section instead.
> +  * In case the section is online we need to bring the
> +  * rest of pfns (which were not backed previously)
> +  * online too.
>*/
>   if (start_pfn > has->start_pfn &&
> - !PageReserved(pfn_to_page(start_pfn - 1)))
> + online_section_nr(pfn_to_section_nr(start_pfn)))
>   hv_bring_pgs_online(has, start_pfn, pgs_ol);
>  
>   }
> 

I wonder if you should use pfn_to_online_page() and check for PageOffline().

(I guess online_section_nr() should also do the trick)

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFCv2 0/4] mm/memory_hotplug: Introduce memory block types

2018-12-20 Thread David Hildenbrand
On 20.12.18 14:08, Michal Hocko wrote:
> On Thu 20-12-18 13:58:16, David Hildenbrand wrote:
>> On 30.11.18 18:59, David Hildenbrand wrote:
>>> This is the second approach, introducing more meaningful memory block
>>> types and not changing online behavior in the kernel. It is based on
>>> latest linux-next.
>>>
>>> As we found out during dicussion, user space should always handle onlining
>>> of memory, in any case. However in order to make smart decisions in user
>>> space about if and how to online memory, we have to export more information
>>> about memory blocks. This way, we can formulate rules in user space.
>>>
>>> One such information is the type of memory block we are talking about.
>>> This helps to answer some questions like:
>>> - Does this memory block belong to a DIMM?
>>> - Can this DIMM theoretically ever be unplugged again?
>>> - Was this memory added by a balloon driver that will rely on balloon
>>>   inflation to remove chunks of that memory again? Which zone is advised?
>>> - Is this special standby memory on s390x that is usually not automatically
>>>   onlined?
>>>
>>> And in short it helps to answer to some extend (excluding zone imbalances)
>>> - Should I online this memory block?
>>> - To which zone should I online this memory block?
>>> ... of course special use cases will result in different anwers. But that's
>>> why user space has control of onlining memory.
>>>
>>> More details can be found in Patch 1 and Patch 3.
>>> Tested on x86 with hotplugged DIMMs. Cross-compiled for PPC and s390x.
>>>
>>>
>>> Example:
>>> $ udevadm info -q all -a /sys/devices/system/memory/memory0
>>> KERNEL=="memory0"
>>> SUBSYSTEM=="memory"
>>> DRIVER==""
>>> ATTR{online}=="1"
>>> ATTR{phys_device}=="0"
>>> ATTR{phys_index}==""
>>> ATTR{removable}=="0"
>>> ATTR{state}=="online"
>>> ATTR{type}=="boot"
>>> ATTR{valid_zones}=="none"
>>> $ udevadm info -q all -a /sys/devices/system/memory/memory90
>>> KERNEL=="memory90"
>>> SUBSYSTEM=="memory"
>>> DRIVER==""
>>> ATTR{online}=="1"
>>> ATTR{phys_device}=="0"
>>> ATTR{phys_index}=="005a"
>>> ATTR{removable}=="1"
>>> ATTR{state}=="online"
>>> ATTR{type}=="dimm"
>>> ATTR{valid_zones}=="Normal"
>>>
>>>
>>> RFC -> RFCv2:
>>> - Now also taking care of PPC (somehow missed it :/ )
>>> - Split the series up to some degree (some ideas on how to split up patch 3
>>>   would be very welcome)
>>> - Introduce more memory block types. Turns out abstracting too much was
>>>   rather confusing and not helpful. Properly document them.
>>>
>>> Notes:
>>> - I wanted to convert the enum of types into a named enum but this
>>>   provoked all kinds of different errors. For now, I am doing it just like
>>>   the other types (e.g. online_type) we are using in that context.
>>> - The "removable" property should never have been named like that. It
>>>   should have been "offlinable". Can we still rename that? E.g. boot memory
>>>   is sometimes marked as removable ...
>>>
>>
>>
>> Any feedback regarding the suggested block types would be very much
>> appreciated!
> 
> I still do not like this much to be honest. I just didn't get to think
> through this properly. My fear is that this is conflating an actual API
> with the current implementation and as such will cause problems in
> future. But I haven't really looked into your patches closely so I might
> be wrong. Anyway I won't be able to look into it by the end of year.
> 

I guess as long as we have memory block devices and we expect user space
to make a decision we will have this API and the involved problems.

I am open for alternatives, and as I said, any feedback on how to sort
this out will be highly appreciated.

I'll be on vacation for the next two weeks, so this can wait. Just
wanted to note that I am still interested in feedback :)

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFCv2 0/4] mm/memory_hotplug: Introduce memory block types

2018-12-20 Thread David Hildenbrand
On 30.11.18 18:59, David Hildenbrand wrote:
> This is the second approach, introducing more meaningful memory block
> types and not changing online behavior in the kernel. It is based on
> latest linux-next.
> 
> As we found out during dicussion, user space should always handle onlining
> of memory, in any case. However in order to make smart decisions in user
> space about if and how to online memory, we have to export more information
> about memory blocks. This way, we can formulate rules in user space.
> 
> One such information is the type of memory block we are talking about.
> This helps to answer some questions like:
> - Does this memory block belong to a DIMM?
> - Can this DIMM theoretically ever be unplugged again?
> - Was this memory added by a balloon driver that will rely on balloon
>   inflation to remove chunks of that memory again? Which zone is advised?
> - Is this special standby memory on s390x that is usually not automatically
>   onlined?
> 
> And in short it helps to answer to some extend (excluding zone imbalances)
> - Should I online this memory block?
> - To which zone should I online this memory block?
> ... of course special use cases will result in different anwers. But that's
> why user space has control of onlining memory.
> 
> More details can be found in Patch 1 and Patch 3.
> Tested on x86 with hotplugged DIMMs. Cross-compiled for PPC and s390x.
> 
> 
> Example:
> $ udevadm info -q all -a /sys/devices/system/memory/memory0
>   KERNEL=="memory0"
>   SUBSYSTEM=="memory"
>   DRIVER==""
>   ATTR{online}=="1"
>   ATTR{phys_device}=="0"
>   ATTR{phys_index}==""
>   ATTR{removable}=="0"
>   ATTR{state}=="online"
>   ATTR{type}=="boot"
>   ATTR{valid_zones}=="none"
> $ udevadm info -q all -a /sys/devices/system/memory/memory90
>   KERNEL=="memory90"
>   SUBSYSTEM=="memory"
>   DRIVER==""
>   ATTR{online}=="1"
>   ATTR{phys_device}=="0"
>   ATTR{phys_index}=="005a"
>   ATTR{removable}=="1"
>   ATTR{state}=="online"
>   ATTR{type}=="dimm"
>   ATTR{valid_zones}=="Normal"
> 
> 
> RFC -> RFCv2:
> - Now also taking care of PPC (somehow missed it :/ )
> - Split the series up to some degree (some ideas on how to split up patch 3
>   would be very welcome)
> - Introduce more memory block types. Turns out abstracting too much was
>   rather confusing and not helpful. Properly document them.
> 
> Notes:
> - I wanted to convert the enum of types into a named enum but this
>   provoked all kinds of different errors. For now, I am doing it just like
>   the other types (e.g. online_type) we are using in that context.
> - The "removable" property should never have been named like that. It
>   should have been "offlinable". Can we still rename that? E.g. boot memory
>   is sometimes marked as removable ...
> 


Any feedback regarding the suggested block types would be very much
appreciated!


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFCv2 3/4] mm/memory_hotplug: Introduce and use more memory types

2018-12-04 Thread David Hildenbrand
On 04.12.18 10:44, Michal Suchánek wrote:
> On Fri, 30 Nov 2018 18:59:21 +0100
> David Hildenbrand  wrote:
> 
>> Let's introduce new types for different kinds of memory blocks and use
>> them in existing code. As I don't see an easy way to split this up,
>> do it in one hunk for now.
>>
>> acpi:
>>  Use DIMM or DIMM_UNREMOVABLE depending on hotremove support in the kernel.
>>  Properly change the type when trying to add memory that was already
>>  detected and used during boot (so this memory will correctly end up as
>>  "acpi" in user space).
>>
>> pseries:
>>  Use DIMM or DIMM_UNREMOVABLE depending on hotremove support in the kernel.
>>  As far as I see, handling like in the acpi case for existing blocks is
>>  not required.
>>
>> probed memory from user space:
>>  Use DIMM_UNREMOVABLE as there is no interface to get rid of this code
>>  again.
>>
>> hv_balloon,xen/balloon:
>>  Use BALLOON. As simple as that :)
>>
>> s390x/sclp:
>>  Use a dedicated type S390X_STANDBY as this type of memory and it's
>>  semantics are very s390x specific.
>>
>> powernv/memtrace:
>>  Only allow to use BOOT memory for memtrace. I consider this code in
>>  general dangerous, but we have to keep it working ... most probably just
>>  a debug feature.
> 
> I don't think it should be arbitrarily restricted like that.
> 

Well code that "randomly" offlines/onlines/removes/adds memory blocks
that it does not own (hint: nobody else in the kernel does that), should
be restricted to types we can guarantee to work.

> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFCv2 2/4] mm/memory_hotplug: Replace "bool want_memblock" by "int type"

2018-12-03 Thread David Hildenbrand
On 01.12.18 02:50, Wei Yang wrote:
> On Fri, Nov 30, 2018 at 06:59:20PM +0100, David Hildenbrand wrote:
>> Let's pass a memory block type instead. Pass "MEMORY_BLOCK_NONE" for device
>> memory and for now "MEMORY_BLOCK_UNSPECIFIED" for anything else. No
>> functional change.
> 
> I would suggest to put more words to this.

Sure, makes sense, I'll add more details. Thanks!

> 
> "
> Function arch_add_memory()'s last parameter *want_memblock* is used to
> determin whether it is necessary to create a corresponding memory block
> device. After introducing the memory block type, this patch replaces the
> bool type *want_memblock* with memory block type with following rules
> for now:
> 
>   * Pass "MEMORY_BLOCK_NONE" for device memory
>   * Pass "MEMORY_BLOCK_UNSPECIFIED" for anything else 
> 
> Since this parameter is passed deep to __add_section(), all its
> descendents are effected. Below lists those descendents.
> 
>   arch_add_memory()
> add_pages()
>   __add_pages()
> __add_section()
> 
> "

[...]


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFCv2 1/4] mm/memory_hotplug: Introduce memory block types

2018-12-03 Thread David Hildenbrand
On 01.12.18 02:25, Wei Yang wrote:
> On Fri, Nov 30, 2018 at 06:59:19PM +0100, David Hildenbrand wrote:
>> Memory onlining should always be handled by user space, because only user
>> space knows which use cases it wants to satisfy. E.g. memory might be
>> onlined to the MOVABLE zone even if it can never be removed from the
>> system, e.g. to make usage of huge pages more reliable.
>>
>> However to implement such rules (especially default rules in distributions)
>> we need more information about the memory that was added in user space.
>>
>> E.g. on x86 we want to online memory provided by balloon devices (e.g.
>> XEN, Hyper-V) differently (-> will not be unplugged by offlining the whole
>> block) than ordinary DIMMs (-> might eventually be unplugged by offlining
>> the whole block). This might also become relevat for other architectures.
>>
>> Also, udev rules right now check if running on s390x and treat all added
>> memory blocks as standby memory (-> don't online automatically). As soon as
>> we support other memory hotplug mechanism (e.g. virtio-mem) checks would
>> have to get more involved (e.g. also check if under KVM) but eventually
>> also wrong (e.g. if KVM ever supports standby memory we are doomed).
>>
>> I decided to allow to specify the type of memory that is getting added
>> to the system. Let's start with two types, BOOT and UNSPECIFIED to get the
>> basic infrastructure running. We'll introduce and use further types in
>> follow-up patches. For now we classify any hotplugged memory temporarily
>> as as UNSPECIFIED (which will eventually be dropped later on).
>>
>> Cc: Greg Kroah-Hartman 
>> Cc: "Rafael J. Wysocki" 
>> Cc: Andrew Morton 
>> Cc: Ingo Molnar 
>> Cc: Pavel Tatashin 
>> Cc: Stephen Rothwell 
>> Cc: Andrew Banman 
>> Cc: "mike.tra...@hpe.com" 
>> Cc: Oscar Salvador 
>> Cc: Dave Hansen 
>> Cc: Michal Hocko 
>> Cc: Michal Such??nek 
>> Cc: Vitaly Kuznetsov 
>> Cc: Dan Williams 
>> Cc: Pavel Tatashin 
>> Cc: Martin Schwidefsky 
>> Cc: Heiko Carstens 
>> Signed-off-by: David Hildenbrand 
>> ---
>> drivers/base/memory.c  | 38 +++---
>> include/linux/memory.h | 27 +++
>> 2 files changed, 62 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>> index 0c290f86ab20..17f2985c07c5 100644
>> --- a/drivers/base/memory.c
>> +++ b/drivers/base/memory.c
>> @@ -381,6 +381,29 @@ static ssize_t show_phys_device(struct device *dev,
>>  return sprintf(buf, "%d\n", mem->phys_device);
>> }
>>
>> +static ssize_t type_show(struct device *dev, struct device_attribute *attr,
>> + char *buf)
>> +{
>> +struct memory_block *mem = to_memory_block(dev);
>> +ssize_t len = 0;
>> +
>> +switch (mem->type) {
>> +case MEMORY_BLOCK_UNSPECIFIED:
>> +len = sprintf(buf, "unspecified\n");
>> +break;
>> +case MEMORY_BLOCK_BOOT:
>> +len = sprintf(buf, "boot\n");
>> +break;
>> +default:
>> +len = sprintf(buf, "ERROR-UNKNOWN-%ld\n",
>> +mem->state);
>> +WARN_ON(1);
>> +break;
>> +}
>> +
>> +return len;
>> +}
>> +
>> #ifdef CONFIG_MEMORY_HOTREMOVE
>> static void print_allowed_zone(char *buf, int nid, unsigned long start_pfn,
>>  unsigned long nr_pages, int online_type,
>> @@ -442,6 +465,7 @@ static DEVICE_ATTR(phys_index, 0444, 
>> show_mem_start_phys_index, NULL);
>> static DEVICE_ATTR(state, 0644, show_mem_state, store_mem_state);
>> static DEVICE_ATTR(phys_device, 0444, show_phys_device, NULL);
>> static DEVICE_ATTR(removable, 0444, show_mem_removable, NULL);
>> +static DEVICE_ATTR_RO(type);
> 
> This is correct, while looks not consistent with other attributes.
> 
> Not that beautiful :-)

I might change the other ones first, too (or keep this one consistent to
the existing ones). Thanks!

> 
>>
>> /*
>>  * Block size attribute stuff
>> @@ -620,6 +644,7 @@ static struct attribute *memory_memblk_attrs[] = {
>>  _attr_state.attr,
>>  _attr_phys_device.attr,
>>  _attr_removable.attr,
>> +_attr_type.attr,
>> #ifdef CONFIG_MEMORY_HOTREMOVE
>>  _attr_valid_zones.attr,
>> #endif
>> @@ -657,13 +682,17 @@ int register_memory(struct memory_blo

[PATCH RFCv2 3/4] mm/memory_hotplug: Introduce and use more memory types

2018-11-30 Thread David Hildenbrand
Let's introduce new types for different kinds of memory blocks and use
them in existing code. As I don't see an easy way to split this up,
do it in one hunk for now.

acpi:
 Use DIMM or DIMM_UNREMOVABLE depending on hotremove support in the kernel.
 Properly change the type when trying to add memory that was already
 detected and used during boot (so this memory will correctly end up as
 "acpi" in user space).

pseries:
 Use DIMM or DIMM_UNREMOVABLE depending on hotremove support in the kernel.
 As far as I see, handling like in the acpi case for existing blocks is
 not required.

probed memory from user space:
 Use DIMM_UNREMOVABLE as there is no interface to get rid of this code
 again.

hv_balloon,xen/balloon:
 Use BALLOON. As simple as that :)

s390x/sclp:
 Use a dedicated type S390X_STANDBY as this type of memory and it's
 semantics are very s390x specific.

powernv/memtrace:
 Only allow to use BOOT memory for memtrace. I consider this code in
 general dangerous, but we have to keep it working ... most probably just
 a debug feature.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Greg Kroah-Hartman 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Rashmica Gupta 
Cc: Andrew Morton 
Cc: Pavel Tatashin 
Cc: Balbir Singh 
Cc: Michael Neuling 
Cc: Nathan Fontenot 
Cc: YueHaibing 
Cc: Vasily Gorbik 
Cc: Ingo Molnar 
Cc: Stephen Rothwell 
Cc: "mike.tra...@hpe.com" 
Cc: Oscar Salvador 
Cc: Joonsoo Kim 
Cc: Mathieu Malaterre 
Cc: Michal Hocko 
Cc: Arun KS 
Cc: Andrew Banman 
Cc: Dave Hansen 
Cc: Michal Suchánek 
Cc: Vitaly Kuznetsov 
Cc: Dan Williams 
Signed-off-by: David Hildenbrand 

---

At first I tried to abstract the types quite a lot, but I think there
are subtle differences that are worth differentiating. More details about
the types can be found in the excessive documentation.

It is wort noting that BALLOON_MOVABLE has no user yet, but I have
something in mind that might want to make use of that (virtio-mem).
Just included it to discuss the general approach. I can drop it from
this patch.
---
 arch/powerpc/platforms/powernv/memtrace.c |  9 ++--
 .../platforms/pseries/hotplug-memory.c|  7 ++-
 drivers/acpi/acpi_memhotplug.c| 16 ++-
 drivers/base/memory.c | 18 ++-
 drivers/hv/hv_balloon.c   |  3 +-
 drivers/s390/char/sclp_cmd.c  |  3 +-
 drivers/xen/balloon.c |  2 +-
 include/linux/memory.h| 47 ++-
 include/linux/memory_hotplug.h|  6 +--
 mm/memory_hotplug.c   | 15 +++---
 10 files changed, 104 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index 248a38ad25c7..5d08db87091e 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -54,9 +54,9 @@ static const struct file_operations memtrace_fops = {
.open   = simple_open,
 };
 
-static int check_memblock_online(struct memory_block *mem, void *arg)
+static int check_memblock_boot_and_online(struct memory_block *mem, void *arg)
 {
-   if (mem->state != MEM_ONLINE)
+   if (mem->type != MEM_BLOCK_BOOT || mem->state != MEM_ONLINE)
return -1;
 
return 0;
@@ -77,7 +77,7 @@ static bool memtrace_offline_pages(u32 nid, u64 start_pfn, 
u64 nr_pages)
u64 end_pfn = start_pfn + nr_pages - 1;
 
if (walk_memory_range(start_pfn, end_pfn, NULL,
-   check_memblock_online))
+   check_memblock_boot_and_online))
return false;
 
walk_memory_range(start_pfn, end_pfn, (void *)MEM_GOING_OFFLINE,
@@ -233,7 +233,8 @@ static int memtrace_online(void)
ent->mem = 0;
}
 
-   if (add_memory(ent->nid, ent->start, ent->size)) {
+   if (add_memory(ent->nid, ent->start, ent->size,
+  MEMORY_BLOCK_BOOT)) {
pr_err("Failed to add trace memory to node %d\n",
ent->nid);
ret += 1;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 2a983b5a52e1..5f91359c7993 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -651,7 +651,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, 
u32 drc_index)
 static int dlpar_add_lmb(struct drmem_lmb *lmb)
 {
unsigned long block_sz;
-   int nid, rc;
+   int nid, rc, type = MEMORY_BLOCK_DIMM;
 
if (lmb->flags & DRCONF_MEM_ASSI

[PATCH RFCv2 4/4] mm/memory_hotplug: Drop MEMORY_TYPE_UNSPECIFIED

2018-11-30 Thread David Hildenbrand
We now have proper types for all users, we can drop this one.

Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: Pavel Tatashin 
Cc: Stephen Rothwell 
Cc: Andrew Banman 
Cc: "mike.tra...@hpe.com" 
Cc: Oscar Salvador 
Cc: Dave Hansen 
Cc: Michal Hocko 
Cc: Michal Suchánek 
Cc: Vitaly Kuznetsov 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Signed-off-by: David Hildenbrand 
---
 drivers/base/memory.c  | 3 ---
 include/linux/memory.h | 5 -
 2 files changed, 8 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index c5fdca7a3009..a6e524f0ea38 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -388,9 +388,6 @@ static ssize_t type_show(struct device *dev, struct 
device_attribute *attr,
ssize_t len = 0;
 
switch (mem->type) {
-   case MEMORY_BLOCK_UNSPECIFIED:
-   len = sprintf(buf, "unspecified\n");
-   break;
case MEMORY_BLOCK_BOOT:
len = sprintf(buf, "boot\n");
break;
diff --git a/include/linux/memory.h b/include/linux/memory.h
index a3a1e9764805..11679622f743 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -50,10 +50,6 @@ int set_memory_block_size_order(unsigned int order);
  *  No memory block is to be created (e.g. device memory). Not exposed to
  *  user space.
  *
- * MEMORY_BLOCK_UNSPECIFIED:
- *  The type of memory block was not further specified when adding the
- *  memory block.
- *
  * MEMORY_BLOCK_BOOT:
  *  This memory block was added during boot by the basic system. No
  *  specific device driver takes care of this memory block. This memory
@@ -103,7 +99,6 @@ int set_memory_block_size_order(unsigned int order);
  */
 enum {
MEMORY_BLOCK_NONE = 0,
-   MEMORY_BLOCK_UNSPECIFIED,
MEMORY_BLOCK_BOOT,
MEMORY_BLOCK_DIMM,
MEMORY_BLOCK_DIMM_UNREMOVABLE,
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFCv2 1/4] mm/memory_hotplug: Introduce memory block types

2018-11-30 Thread David Hildenbrand
Memory onlining should always be handled by user space, because only user
space knows which use cases it wants to satisfy. E.g. memory might be
onlined to the MOVABLE zone even if it can never be removed from the
system, e.g. to make usage of huge pages more reliable.

However to implement such rules (especially default rules in distributions)
we need more information about the memory that was added in user space.

E.g. on x86 we want to online memory provided by balloon devices (e.g.
XEN, Hyper-V) differently (-> will not be unplugged by offlining the whole
block) than ordinary DIMMs (-> might eventually be unplugged by offlining
the whole block). This might also become relevat for other architectures.

Also, udev rules right now check if running on s390x and treat all added
memory blocks as standby memory (-> don't online automatically). As soon as
we support other memory hotplug mechanism (e.g. virtio-mem) checks would
have to get more involved (e.g. also check if under KVM) but eventually
also wrong (e.g. if KVM ever supports standby memory we are doomed).

I decided to allow to specify the type of memory that is getting added
to the system. Let's start with two types, BOOT and UNSPECIFIED to get the
basic infrastructure running. We'll introduce and use further types in
follow-up patches. For now we classify any hotplugged memory temporarily
as as UNSPECIFIED (which will eventually be dropped later on).

Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: Pavel Tatashin 
Cc: Stephen Rothwell 
Cc: Andrew Banman 
Cc: "mike.tra...@hpe.com" 
Cc: Oscar Salvador 
Cc: Dave Hansen 
Cc: Michal Hocko 
Cc: Michal Suchánek 
Cc: Vitaly Kuznetsov 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Signed-off-by: David Hildenbrand 
---
 drivers/base/memory.c  | 38 +++---
 include/linux/memory.h | 27 +++
 2 files changed, 62 insertions(+), 3 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 0c290f86ab20..17f2985c07c5 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -381,6 +381,29 @@ static ssize_t show_phys_device(struct device *dev,
return sprintf(buf, "%d\n", mem->phys_device);
 }
 
+static ssize_t type_show(struct device *dev, struct device_attribute *attr,
+char *buf)
+{
+   struct memory_block *mem = to_memory_block(dev);
+   ssize_t len = 0;
+
+   switch (mem->type) {
+   case MEMORY_BLOCK_UNSPECIFIED:
+   len = sprintf(buf, "unspecified\n");
+   break;
+   case MEMORY_BLOCK_BOOT:
+   len = sprintf(buf, "boot\n");
+   break;
+   default:
+   len = sprintf(buf, "ERROR-UNKNOWN-%ld\n",
+   mem->state);
+   WARN_ON(1);
+   break;
+   }
+
+   return len;
+}
+
 #ifdef CONFIG_MEMORY_HOTREMOVE
 static void print_allowed_zone(char *buf, int nid, unsigned long start_pfn,
unsigned long nr_pages, int online_type,
@@ -442,6 +465,7 @@ static DEVICE_ATTR(phys_index, 0444, 
show_mem_start_phys_index, NULL);
 static DEVICE_ATTR(state, 0644, show_mem_state, store_mem_state);
 static DEVICE_ATTR(phys_device, 0444, show_phys_device, NULL);
 static DEVICE_ATTR(removable, 0444, show_mem_removable, NULL);
+static DEVICE_ATTR_RO(type);
 
 /*
  * Block size attribute stuff
@@ -620,6 +644,7 @@ static struct attribute *memory_memblk_attrs[] = {
_attr_state.attr,
_attr_phys_device.attr,
_attr_removable.attr,
+   _attr_type.attr,
 #ifdef CONFIG_MEMORY_HOTREMOVE
_attr_valid_zones.attr,
 #endif
@@ -657,13 +682,17 @@ int register_memory(struct memory_block *memory)
 }
 
 static int init_memory_block(struct memory_block **memory,
-struct mem_section *section, unsigned long state)
+struct mem_section *section, unsigned long state,
+int type)
 {
struct memory_block *mem;
unsigned long start_pfn;
int scn_nr;
int ret = 0;
 
+   if (type == MEMORY_BLOCK_NONE)
+   return -EINVAL;
+
mem = kzalloc(sizeof(*mem), GFP_KERNEL);
if (!mem)
return -ENOMEM;
@@ -675,6 +704,7 @@ static int init_memory_block(struct memory_block **memory,
mem->state = state;
start_pfn = section_nr_to_pfn(mem->start_section_nr);
mem->phys_device = arch_get_memory_phys_device(start_pfn);
+   mem->type = type;
 
ret = register_memory(mem);
 
@@ -699,7 +729,8 @@ static int add_memory_block(int base_section_nr)
 
if (section_count == 0)
return 0;
-   ret = init_memory_block(, __nr_to_section(section_nr), MEM_ONLINE);
+   ret = init_memory_block(, __nr_to_section(se

[PATCH RFCv2 2/4] mm/memory_hotplug: Replace "bool want_memblock" by "int type"

2018-11-30 Thread David Hildenbrand
Let's pass a memory block type instead. Pass "MEMORY_BLOCK_NONE" for device
memory and for now "MEMORY_BLOCK_UNSPECIFIED" for anything else. No
functional change.

Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: "Kirill A. Shutemov" 
Cc: Oscar Salvador 
Cc: Nicholas Piggin 
Cc: Stephen Rothwell 
Cc: Christophe Leroy 
Cc: "Jonathan Neuschäfer" 
Cc: Mauricio Faria de Oliveira 
Cc: Vasily Gorbik 
Cc: Arun KS 
Cc: Rob Herring 
Cc: Pavel Tatashin 
Cc: "mike.tra...@hpe.com" 
Cc: Joonsoo Kim 
Cc: Wei Yang 
Cc: Logan Gunthorpe 
Cc: "Jérôme Glisse" 
Cc: "Jan H. Schönherr" 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Mathieu Malaterre 
Signed-off-by: David Hildenbrand 
---
 arch/ia64/mm/init.c|  4 ++--
 arch/powerpc/mm/mem.c  |  4 ++--
 arch/s390/mm/init.c|  4 ++--
 arch/sh/mm/init.c  |  4 ++--
 arch/x86/mm/init_32.c  |  4 ++--
 arch/x86/mm/init_64.c  |  8 
 drivers/base/memory.c  | 11 +++
 include/linux/memory.h |  2 +-
 include/linux/memory_hotplug.h | 12 ++--
 kernel/memremap.c  |  6 --
 mm/memory_hotplug.c| 16 
 11 files changed, 40 insertions(+), 35 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 904fe55e10fc..408635d2902f 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -646,13 +646,13 @@ mem_init (void)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-   bool want_memblock)
+   int type)
 {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
 
-   ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+   ret = __add_pages(nid, start_pfn, nr_pages, altmap, type);
if (ret)
printk("%s: Problem encountered in __add_pages() as ret=%d\n",
   __func__,  ret);
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index b3c9ee5c4f78..e394637da270 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -118,7 +118,7 @@ int __weak remove_section_mapping(unsigned long start, 
unsigned long end)
 }
 
 int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap 
*altmap,
-   bool want_memblock)
+ int type)
 {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -135,7 +135,7 @@ int __meminit arch_add_memory(int nid, u64 start, u64 size, 
struct vmem_altmap *
}
flush_inval_dcache_range(start, start + size);
 
-   return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+   return __add_pages(nid, start_pfn, nr_pages, altmap, type);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 3e82f66d5c61..ba2c56328e6d 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -225,7 +225,7 @@ device_initcall(s390_cma_mem_init);
 #endif /* CONFIG_CMA */
 
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-   bool want_memblock)
+   int type)
 {
unsigned long start_pfn = PFN_DOWN(start);
unsigned long size_pages = PFN_DOWN(size);
@@ -235,7 +235,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct 
vmem_altmap *altmap,
if (rc)
return rc;
 
-   rc = __add_pages(nid, start_pfn, size_pages, altmap, want_memblock);
+   rc = __add_pages(nid, start_pfn, size_pages, altmap, type);
if (rc)
vmem_remove_mapping(start, size);
return rc;
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 1a483a008872..5fbb8724e0f2 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -419,14 +419,14 @@ void free_initrd_mem(unsigned long start, unsigned long 
end)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-   bool want_memblock)
+   int type)
 {
unsigned long start_pfn = PFN_DOWN(start);
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
 
/* We only have ZONE_NORMAL, so this is easy.. */
-   ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+   ret = __add_pages(nid, start_pfn, nr_pages, altmap, typ

[PATCH RFCv2 0/4] mm/memory_hotplug: Introduce memory block types

2018-11-30 Thread David Hildenbrand
This is the second approach, introducing more meaningful memory block
types and not changing online behavior in the kernel. It is based on
latest linux-next.

As we found out during dicussion, user space should always handle onlining
of memory, in any case. However in order to make smart decisions in user
space about if and how to online memory, we have to export more information
about memory blocks. This way, we can formulate rules in user space.

One such information is the type of memory block we are talking about.
This helps to answer some questions like:
- Does this memory block belong to a DIMM?
- Can this DIMM theoretically ever be unplugged again?
- Was this memory added by a balloon driver that will rely on balloon
  inflation to remove chunks of that memory again? Which zone is advised?
- Is this special standby memory on s390x that is usually not automatically
  onlined?

And in short it helps to answer to some extend (excluding zone imbalances)
- Should I online this memory block?
- To which zone should I online this memory block?
... of course special use cases will result in different anwers. But that's
why user space has control of onlining memory.

More details can be found in Patch 1 and Patch 3.
Tested on x86 with hotplugged DIMMs. Cross-compiled for PPC and s390x.


Example:
$ udevadm info -q all -a /sys/devices/system/memory/memory0
KERNEL=="memory0"
SUBSYSTEM=="memory"
DRIVER==""
ATTR{online}=="1"
ATTR{phys_device}=="0"
ATTR{phys_index}==""
ATTR{removable}=="0"
ATTR{state}=="online"
ATTR{type}=="boot"
ATTR{valid_zones}=="none"
$ udevadm info -q all -a /sys/devices/system/memory/memory90
KERNEL=="memory90"
SUBSYSTEM=="memory"
DRIVER==""
ATTR{online}=="1"
ATTR{phys_device}=="0"
ATTR{phys_index}=="005a"
ATTR{removable}=="1"
ATTR{state}=="online"
ATTR{type}=="dimm"
ATTR{valid_zones}=="Normal"


RFC -> RFCv2:
- Now also taking care of PPC (somehow missed it :/ )
- Split the series up to some degree (some ideas on how to split up patch 3
  would be very welcome)
- Introduce more memory block types. Turns out abstracting too much was
  rather confusing and not helpful. Properly document them.

Notes:
- I wanted to convert the enum of types into a named enum but this
  provoked all kinds of different errors. For now, I am doing it just like
  the other types (e.g. online_type) we are using in that context.
- The "removable" property should never have been named like that. It
  should have been "offlinable". Can we still rename that? E.g. boot memory
  is sometimes marked as removable ...

David Hildenbrand (4):
  mm/memory_hotplug: Introduce memory block types
  mm/memory_hotplug: Replace "bool want_memblock" by "int type"
  mm/memory_hotplug: Introduce and use more memory types
  mm/memory_hotplug: Drop MEMORY_TYPE_UNSPECIFIED

 arch/ia64/mm/init.c   |  4 +-
 arch/powerpc/mm/mem.c |  4 +-
 arch/powerpc/platforms/powernv/memtrace.c |  9 +--
 .../platforms/pseries/hotplug-memory.c|  7 +-
 arch/s390/mm/init.c   |  4 +-
 arch/sh/mm/init.c |  4 +-
 arch/x86/mm/init_32.c |  4 +-
 arch/x86/mm/init_64.c |  8 +--
 drivers/acpi/acpi_memhotplug.c| 16 -
 drivers/base/memory.c | 60 ++--
 drivers/hv/hv_balloon.c   |  3 +-
 drivers/s390/char/sclp_cmd.c  |  3 +-
 drivers/xen/balloon.c |  2 +-
 include/linux/memory.h| 69 ++-
 include/linux/memory_hotplug.h| 18 ++---
 kernel/memremap.c |  6 +-
 mm/memory_hotplug.c   | 29 
 17 files changed, 194 insertions(+), 56 deletions(-)

-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-11-27 Thread David Hildenbrand
On 27.11.18 17:32, Michal Suchánek wrote:
> On Mon, 26 Nov 2018 16:59:14 +0100
> David Hildenbrand  wrote:
> 
>> On 26.11.18 15:20, Michal Suchánek wrote:
>>> On Mon, 26 Nov 2018 14:33:29 +0100
>>> David Hildenbrand  wrote:
>>>   
>>>> On 26.11.18 13:30, David Hildenbrand wrote:  
>>>>> On 23.11.18 19:06, Michal Suchánek wrote:
>>>   
>>>>>>
>>>>>> If we are going to fake the driver information we may as well add the
>>>>>> type attribute and be done with it.
>>>>>>
>>>>>> I think the problem with the patch was more with the semantic than the
>>>>>> attribute itself.
>>>>>>
>>>>>> What is normal, paravirtualized, and standby memory?
>>>>>>
>>>>>> I can understand DIMM device, baloon device, or whatever mechanism for
>>>>>> adding memory you might have.
>>>>>>
>>>>>> I can understand "memory designated as standby by the cluster
>>>>>> administrator".
>>>>>>
>>>>>> However, DIMM vs baloon is orthogonal to standby and should not be
>>>>>> conflated into one property.
>>>>>>
>>>>>> paravirtualized means nothing at all in relationship to memory type and
>>>>>> the desired online policy to me.
>>>>>
>>>>> Right, so with whatever we come up, it should allow to make a decision
>>>>> in user space about
>>>>> - if memory is to be onlined automatically
>>>>
>>>> And I will think about if we really should model standby memory. Maybe
>>>> it is really better to have in user space something like (as Dan noted)  
>>>
>>> If it is possible to designate the memory as standby or online in the
>>> s390 admin interface and the kernel does have access to this
>>> information it makes sense to forward it to userspace (as separate
>>> s390-specific property). If not then you need to make some kind of
>>> assumption like below and the user can tune the script according to
>>> their usecase.  
>>
>> Also true, standby memory really represents a distinct type of memory
>> block (memory seems to be there but really isn't). Right now I am
>> thinking about something like this (tried to formulate it on a very
>> generic level because we can't predict which mechanism might want to
>> make use of these types in the future).
>>
>>
>> /*
>>  * Memory block types allow user space to formulate rules if and how to
>>  * online memory blocks. The types are exposed to user space as text
>>  * strings in sysfs. While the typical online strategies are described
>>  * along with the types, there are use cases where that can differ (e.g.
>>  * use MOVABLE zone for more reliable huge page usage, use NORMAL zone
>>  * due to zone imbalance or because memory unplug is not intended).
>>  *
>>  * MEMORY_BLOCK_NONE:
>>  *  No memory block is to be created (e.g. device memory). Used internally
>>  *  only.
>>  *
>>  * MEMORY_BLOCK_REMOVABLE:
>>  *  This memory block type should be treated as if it can be
>>  *  removed/unplugged from the system again. E.g. there is a hardware
>>  *  interface to unplug such memory. This memory block type is usually
>>  *  onlined to the MOVABLE zone, to e.g. make offlining of it more
>>  *  reliable. Examples include ACPI and PPC DIMMs.
>>  *
>>  * MEMORY_BLOCK_UNREMOVABLE:
>>  *  This memory block type should be treated as if it can not be
>>  *  removed/unplugged again. E.g. there is no hardware interface to
>>  *  unplug such memory. This memory block type is usually onlined to
>>  *  the NORMAL zone, as offlining is not beneficial. Examples include boot
>>  *  memory on most architectures and memory added via balloon devices.
> 
> AFAIK baloon device can be inflated as well so this does not really
> describe how this memory type works in any meaningful way. Also it
> should not be possible to see this kind of memory from userspace. The
> baloon driver just takes existing memory that is properly backed,
> allocates it for itself, and allows the hypervisor to use it. Thus it
> creates the equivalent to s390 standby memory which is not backed in
> the VM. When memory is reclaimed from hypervisor the baloon driver
> frees it making it available to the VM kernel again. However, the whole
> time the memory appears present in the machine and no hotplug events
> should be visible unl

Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-11-26 Thread David Hildenbrand
On 26.11.18 15:20, Michal Suchánek wrote:
> On Mon, 26 Nov 2018 14:33:29 +0100
> David Hildenbrand  wrote:
> 
>> On 26.11.18 13:30, David Hildenbrand wrote:
>>> On 23.11.18 19:06, Michal Suchánek wrote:  
> 
>>>>
>>>> If we are going to fake the driver information we may as well add the
>>>> type attribute and be done with it.
>>>>
>>>> I think the problem with the patch was more with the semantic than the
>>>> attribute itself.
>>>>
>>>> What is normal, paravirtualized, and standby memory?
>>>>
>>>> I can understand DIMM device, baloon device, or whatever mechanism for
>>>> adding memory you might have.
>>>>
>>>> I can understand "memory designated as standby by the cluster
>>>> administrator".
>>>>
>>>> However, DIMM vs baloon is orthogonal to standby and should not be
>>>> conflated into one property.
>>>>
>>>> paravirtualized means nothing at all in relationship to memory type and
>>>> the desired online policy to me.  
>>>
>>> Right, so with whatever we come up, it should allow to make a decision
>>> in user space about
>>> - if memory is to be onlined automatically  
>>
>> And I will think about if we really should model standby memory. Maybe
>> it is really better to have in user space something like (as Dan noted)
> 
> If it is possible to designate the memory as standby or online in the
> s390 admin interface and the kernel does have access to this
> information it makes sense to forward it to userspace (as separate
> s390-specific property). If not then you need to make some kind of
> assumption like below and the user can tune the script according to
> their usecase.

Also true, standby memory really represents a distinct type of memory
block (memory seems to be there but really isn't). Right now I am
thinking about something like this (tried to formulate it on a very
generic level because we can't predict which mechanism might want to
make use of these types in the future).


/*
 * Memory block types allow user space to formulate rules if and how to
 * online memory blocks. The types are exposed to user space as text
 * strings in sysfs. While the typical online strategies are described
 * along with the types, there are use cases where that can differ (e.g.
 * use MOVABLE zone for more reliable huge page usage, use NORMAL zone
 * due to zone imbalance or because memory unplug is not intended).
 *
 * MEMORY_BLOCK_NONE:
 *  No memory block is to be created (e.g. device memory). Used internally
 *  only.
 *
 * MEMORY_BLOCK_REMOVABLE:
 *  This memory block type should be treated as if it can be
 *  removed/unplugged from the system again. E.g. there is a hardware
 *  interface to unplug such memory. This memory block type is usually
 *  onlined to the MOVABLE zone, to e.g. make offlining of it more
 *  reliable. Examples include ACPI and PPC DIMMs.
 *
 * MEMORY_BLOCK_UNREMOVABLE:
 *  This memory block type should be treated as if it can not be
 *  removed/unplugged again. E.g. there is no hardware interface to
 *  unplug such memory. This memory block type is usually onlined to
 *  the NORMAL zone, as offlining is not beneficial. Examples include boot
 *  memory on most architectures and memory added via balloon devices.
 *
 * MEMORY_BLOCK_STANDBY:
 *  The memory block type should be treated as if it can be
 *  removed/unplugged again, however the actual memory hot(un)plug is
 *  performed by onlining/offlining. In virtual environments, such memory
 *  is usually added during boot and never removed. Onlining memory will
 *  result in memory getting allocated to a VM. This memory type is usually
 *  not onlined automatically but explicitly by the administrator. One
 *  example is standby memory on s390x.
 */

> 
>>
>> if (isS390x() && type == "dimm") {
>>  /* don't online, on s390x system DIMMs are standby memory */
>> }
> 
> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-11-26 Thread David Hildenbrand
On 26.11.18 13:30, David Hildenbrand wrote:
> On 23.11.18 19:06, Michal Suchánek wrote:
>> On Fri, 23 Nov 2018 12:13:58 +0100
>> David Hildenbrand  wrote:
>>
>>> On 28.09.18 17:03, David Hildenbrand wrote:
>>>> How to/when to online hotplugged memory is hard to manage for
>>>> distributions because different memory types are to be treated differently.
>>>> Right now, we need complicated udev rules that e.g. check if we are
>>>> running on s390x, on a physical system or on a virtualized system. But
>>>> there is also sometimes the demand to really online memory immediately
>>>> while adding in the kernel and not to wait for user space to make a
>>>> decision. And on virtualized systems there might be different
>>>> requirements, depending on "how" the memory was added (and if it will
>>>> eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
>>>>
>>>> On the one hand, we have physical systems where we sometimes
>>>> want to be able to unplug memory again - e.g. a DIMM - so we have to online
>>>> it to the MOVABLE zone optionally. That decision is usually made in user
>>>> space.
>>>>
>>>> On the other hand, we have memory that should never be onlined
>>>> automatically, only when asked for by an administrator. Such memory only
>>>> applies to virtualized environments like s390x, where the concept of
>>>> "standby" memory exists. Memory is detected and added during boot, so it
>>>> can be onlined when requested by the admininistrator or some tooling.
>>>> Only when onlining, memory will be allocated in the hypervisor.
>>>>
>>>> But then, we also have paravirtualized devices (namely xen and hyper-v
>>>> balloons), that hotplug memory that will never ever be removed from a
>>>> system right now using offline_pages/remove_memory. If at all, this memory
>>>> is logically unplugged and handed back to the hypervisor via ballooning.
>>>>
>>>> For paravirtualized devices it is relevant that memory is onlined as
>>>> quickly as possible after adding - and that it is added to the NORMAL
>>>> zone. Otherwise, it could happen that too much memory in a row is added
>>>> (but not onlined), resulting in out-of-memory conditions due to the
>>>> additional memory for "struct pages" and friends. MOVABLE zone as well
>>>> as delays might be very problematic and lead to crashes (e.g. zone
>>>> imbalance).
>>>>
>>>> Therefore, introduce memory block types and online memory depending on
>>>> it when adding the memory. Expose the memory type to user space, so user
>>>> space handlers can start to process only "normal" memory. Other memory
>>>> block types can be ignored. One thing less to worry about in user space.
>>>>   
>>>
>>> So I was looking into alternatives.
>>>
>>> 1. Provide only "normal" and "standby" memory types to user space. This
>>> way user space can make smarter decisions about how to online memory.
>>> Not really sure if this is the right way to go.
>>>
>>>
>>> 2. Use device driver information (as mentioned by Michal S.).
>>>
>>> The problem right now is that there are no drivers for memory block
>>> devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
>>> will not contain a "DRIVER" information and we ave no idea what kind of
>>> memory block device we hold in our hands.
>>>
>>> $ udevadm info -q all -a /sys/devices/system/memory/memory0
>>>
>>>   looking at device '/devices/system/memory/memory0':
>>> KERNEL=="memory0"
>>> SUBSYSTEM=="memory"
>>> DRIVER==""
>>> ATTR{online}=="1"
>>> ATTR{phys_device}=="0"
>>> ATTR{phys_index}==""
>>> ATTR{removable}=="0"
>>> ATTR{state}=="online"
>>> ATTR{valid_zones}=="none"
>>>
>>>
>>> If we would provide "fake" drivers for the memory block devices we want
>>> to treat in a special way in user space (e.g. standby memory on s390x),
>>> user space could use that information to make smarter decisions.
>>>
>>> Adding such drivers might work. My suggestion would be to let ordinary
>>> DIMMs be without a driver for n

Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-11-26 Thread David Hildenbrand
On 23.11.18 19:06, Michal Suchánek wrote:
> On Fri, 23 Nov 2018 12:13:58 +0100
> David Hildenbrand  wrote:
> 
>> On 28.09.18 17:03, David Hildenbrand wrote:
>>> How to/when to online hotplugged memory is hard to manage for
>>> distributions because different memory types are to be treated differently.
>>> Right now, we need complicated udev rules that e.g. check if we are
>>> running on s390x, on a physical system or on a virtualized system. But
>>> there is also sometimes the demand to really online memory immediately
>>> while adding in the kernel and not to wait for user space to make a
>>> decision. And on virtualized systems there might be different
>>> requirements, depending on "how" the memory was added (and if it will
>>> eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
>>>
>>> On the one hand, we have physical systems where we sometimes
>>> want to be able to unplug memory again - e.g. a DIMM - so we have to online
>>> it to the MOVABLE zone optionally. That decision is usually made in user
>>> space.
>>>
>>> On the other hand, we have memory that should never be onlined
>>> automatically, only when asked for by an administrator. Such memory only
>>> applies to virtualized environments like s390x, where the concept of
>>> "standby" memory exists. Memory is detected and added during boot, so it
>>> can be onlined when requested by the admininistrator or some tooling.
>>> Only when onlining, memory will be allocated in the hypervisor.
>>>
>>> But then, we also have paravirtualized devices (namely xen and hyper-v
>>> balloons), that hotplug memory that will never ever be removed from a
>>> system right now using offline_pages/remove_memory. If at all, this memory
>>> is logically unplugged and handed back to the hypervisor via ballooning.
>>>
>>> For paravirtualized devices it is relevant that memory is onlined as
>>> quickly as possible after adding - and that it is added to the NORMAL
>>> zone. Otherwise, it could happen that too much memory in a row is added
>>> (but not onlined), resulting in out-of-memory conditions due to the
>>> additional memory for "struct pages" and friends. MOVABLE zone as well
>>> as delays might be very problematic and lead to crashes (e.g. zone
>>> imbalance).
>>>
>>> Therefore, introduce memory block types and online memory depending on
>>> it when adding the memory. Expose the memory type to user space, so user
>>> space handlers can start to process only "normal" memory. Other memory
>>> block types can be ignored. One thing less to worry about in user space.
>>>   
>>
>> So I was looking into alternatives.
>>
>> 1. Provide only "normal" and "standby" memory types to user space. This
>> way user space can make smarter decisions about how to online memory.
>> Not really sure if this is the right way to go.
>>
>>
>> 2. Use device driver information (as mentioned by Michal S.).
>>
>> The problem right now is that there are no drivers for memory block
>> devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
>> will not contain a "DRIVER" information and we ave no idea what kind of
>> memory block device we hold in our hands.
>>
>> $ udevadm info -q all -a /sys/devices/system/memory/memory0
>>
>>   looking at device '/devices/system/memory/memory0':
>> KERNEL=="memory0"
>> SUBSYSTEM=="memory"
>> DRIVER==""
>> ATTR{online}=="1"
>> ATTR{phys_device}=="0"
>> ATTR{phys_index}==""
>> ATTR{removable}=="0"
>> ATTR{state}=="online"
>> ATTR{valid_zones}=="none"
>>
>>
>> If we would provide "fake" drivers for the memory block devices we want
>> to treat in a special way in user space (e.g. standby memory on s390x),
>> user space could use that information to make smarter decisions.
>>
>> Adding such drivers might work. My suggestion would be to let ordinary
>> DIMMs be without a driver for now and only special case standby memory
>> and eventually paravirtualized memory devices (XEN and Hyper-V).
>>
>> Any thoughts?
> 
> If we are going to fake the driver information we may as well add the
> type attribute and be done with it.
> 
> I think the problem with the patch was more with the semantic than the
> att

Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-11-23 Thread David Hildenbrand
On 28.09.18 17:03, David Hildenbrand wrote:
> How to/when to online hotplugged memory is hard to manage for
> distributions because different memory types are to be treated differently.
> Right now, we need complicated udev rules that e.g. check if we are
> running on s390x, on a physical system or on a virtualized system. But
> there is also sometimes the demand to really online memory immediately
> while adding in the kernel and not to wait for user space to make a
> decision. And on virtualized systems there might be different
> requirements, depending on "how" the memory was added (and if it will
> eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
> 
> On the one hand, we have physical systems where we sometimes
> want to be able to unplug memory again - e.g. a DIMM - so we have to online
> it to the MOVABLE zone optionally. That decision is usually made in user
> space.
> 
> On the other hand, we have memory that should never be onlined
> automatically, only when asked for by an administrator. Such memory only
> applies to virtualized environments like s390x, where the concept of
> "standby" memory exists. Memory is detected and added during boot, so it
> can be onlined when requested by the admininistrator or some tooling.
> Only when onlining, memory will be allocated in the hypervisor.
> 
> But then, we also have paravirtualized devices (namely xen and hyper-v
> balloons), that hotplug memory that will never ever be removed from a
> system right now using offline_pages/remove_memory. If at all, this memory
> is logically unplugged and handed back to the hypervisor via ballooning.
> 
> For paravirtualized devices it is relevant that memory is onlined as
> quickly as possible after adding - and that it is added to the NORMAL
> zone. Otherwise, it could happen that too much memory in a row is added
> (but not onlined), resulting in out-of-memory conditions due to the
> additional memory for "struct pages" and friends. MOVABLE zone as well
> as delays might be very problematic and lead to crashes (e.g. zone
> imbalance).
> 
> Therefore, introduce memory block types and online memory depending on
> it when adding the memory. Expose the memory type to user space, so user
> space handlers can start to process only "normal" memory. Other memory
> block types can be ignored. One thing less to worry about in user space.
> 

So I was looking into alternatives.

1. Provide only "normal" and "standby" memory types to user space. This
way user space can make smarter decisions about how to online memory.
Not really sure if this is the right way to go.


2. Use device driver information (as mentioned by Michal S.).

The problem right now is that there are no drivers for memory block
devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
will not contain a "DRIVER" information and we ave no idea what kind of
memory block device we hold in our hands.

$ udevadm info -q all -a /sys/devices/system/memory/memory0

  looking at device '/devices/system/memory/memory0':
KERNEL=="memory0"
SUBSYSTEM=="memory"
DRIVER==""
ATTR{online}=="1"
ATTR{phys_device}=="0"
ATTR{phys_index}==""
ATTR{removable}=="0"
ATTR{state}=="online"
ATTR{valid_zones}=="none"


If we would provide "fake" drivers for the memory block devices we want
to treat in a special way in user space (e.g. standby memory on s390x),
user space could use that information to make smarter decisions.

Adding such drivers might work. My suggestion would be to let ordinary
DIMMs be without a driver for now and only special case standby memory
and eventually paravirtualized memory devices (XEN and Hyper-V).

Any thoughts?


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2] makedumpfile: exclude pages that are logically offline

2018-11-22 Thread David Hildenbrand
Linux marks pages that are logically offline via a page flag (map count).
Such pages e.g. include pages infated as part of a balloon driver or
pages that were not actually onlined when onlining the whole section.

While the hypervisor usually allows to read such inflated memory, we
basically read and dump data that is completely irrelevant. Also, this
might result in quite some overhead in the hypervisor. In addition,
we saw some problems under Hyper-V, whereby we can crash the kernel by
dumping, when reading memory of a partially onlined memory segment
(for memory added by the Hyper-V balloon driver).

Therefore, don't read and dump pages that are marked as being logically
offline.

Signed-off-by: David Hildenbrand 
---

v1 -> v2:
- Fix PAGE_BUDDY_MAPCOUNT_VALUE vs. PAGE_OFFLINE_MAPCOUNT_VALUE

 makedumpfile.c | 34 ++
 makedumpfile.h |  1 +
 2 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/makedumpfile.c b/makedumpfile.c
index 8923538..a5f2ea9 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -88,6 +88,7 @@ mdf_pfn_t pfn_cache_private;
 mdf_pfn_t pfn_user;
 mdf_pfn_t pfn_free;
 mdf_pfn_t pfn_hwpoison;
+mdf_pfn_t pfn_offline;
 
 mdf_pfn_t num_dumped;
 
@@ -249,6 +250,21 @@ isHugetlb(unsigned long dtor)
 && (SYMBOL(free_huge_page) == dtor));
 }
 
+static int
+isOffline(unsigned long flags, unsigned int _mapcount)
+{
+   if (NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE) == NOT_FOUND_NUMBER)
+   return FALSE;
+
+   if (flags & (1UL << NUMBER(PG_slab)))
+   return FALSE;
+
+   if (_mapcount == (int)NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE))
+   return TRUE;
+
+   return FALSE;
+}
+
 static int
 is_cache_page(unsigned long flags)
 {
@@ -2287,6 +2303,8 @@ write_vmcoreinfo_data(void)
WRITE_NUMBER("PG_hwpoison", PG_hwpoison);
 
WRITE_NUMBER("PAGE_BUDDY_MAPCOUNT_VALUE", PAGE_BUDDY_MAPCOUNT_VALUE);
+   WRITE_NUMBER("PAGE_OFFLINE_MAPCOUNT_VALUE",
+PAGE_OFFLINE_MAPCOUNT_VALUE);
WRITE_NUMBER("phys_base", phys_base);
 
WRITE_NUMBER("HUGETLB_PAGE_DTOR", HUGETLB_PAGE_DTOR);
@@ -2687,6 +2705,7 @@ read_vmcoreinfo(void)
READ_SRCFILE("pud_t", pud_t);
 
READ_NUMBER("PAGE_BUDDY_MAPCOUNT_VALUE", PAGE_BUDDY_MAPCOUNT_VALUE);
+   READ_NUMBER("PAGE_OFFLINE_MAPCOUNT_VALUE", PAGE_OFFLINE_MAPCOUNT_VALUE);
READ_NUMBER("phys_base", phys_base);
 #ifdef __aarch64__
READ_NUMBER("VA_BITS", VA_BITS);
@@ -6041,6 +6060,12 @@ __exclude_unnecessary_pages(unsigned long mem_map,
else if (isHWPOISON(flags)) {
pfn_counter = _hwpoison;
}
+   /*
+* Exclude pages that are logically offline.
+*/
+   else if (isOffline(flags, _mapcount)) {
+   pfn_counter = _offline;
+   }
/*
 * Unexcludable page
 */
@@ -7522,7 +7547,7 @@ write_elf_pages_cyclic(struct cache_data *cd_header, 
struct cache_data *cd_page)
 */
if (info->flag_cyclic) {
pfn_zero = pfn_cache = pfn_cache_private = 0;
-   pfn_user = pfn_free = pfn_hwpoison = 0;
+   pfn_user = pfn_free = pfn_hwpoison = pfn_offline = 0;
pfn_memhole = info->max_mapnr;
}
 
@@ -8804,7 +8829,7 @@ write_kdump_pages_and_bitmap_cyclic(struct cache_data 
*cd_header, struct cache_d
 * Reset counter for debug message.
 */
pfn_zero = pfn_cache = pfn_cache_private = 0;
-   pfn_user = pfn_free = pfn_hwpoison = 0;
+   pfn_user = pfn_free = pfn_hwpoison = pfn_offline = 0;
pfn_memhole = info->max_mapnr;
 
/*
@@ -9749,7 +9774,7 @@ print_report(void)
pfn_original = info->max_mapnr - pfn_memhole;
 
pfn_excluded = pfn_zero + pfn_cache + pfn_cache_private
-   + pfn_user + pfn_free + pfn_hwpoison;
+   + pfn_user + pfn_free + pfn_hwpoison + pfn_offline;
shrinking = (pfn_original - pfn_excluded) * 100;
shrinking = shrinking / pfn_original;
 
@@ -9763,6 +9788,7 @@ print_report(void)
REPORT_MSG("User process data pages : 0x%016llx\n", pfn_user);
REPORT_MSG("Free pages  : 0x%016llx\n", pfn_free);
REPORT_MSG("Hwpoison pages  : 0x%016llx\n", pfn_hwpoison);
+   REPORT_MSG("Offline pages   : 0x%016llx\n", pfn_offline);
REPORT_MSG("  Remaining pages  : 0x%016llx\n",
pfn_original - pfn_excluded);
REPORT_MSG("  (The number of pages is reduced to %lld%%.)\n",
@@ -9790,7 +9816,7 @@ print_mem_usage(void)
pfn_original = info->max_map

[PATCH v2 6/8] vmw_balloon: mark inflated pages PG_offline

2018-11-22 Thread David Hildenbrand
Mark inflated and never onlined pages PG_offline, to tell the world that
the content is stale and should not be dumped.

Cc: Xavier Deguillard 
Cc: Nadav Amit 
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
Cc: Julien Freche 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Acked-by: Nadav Amit 
Signed-off-by: David Hildenbrand 
---
 drivers/misc/vmw_balloon.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index e6126a4b95d3..877611b5659b 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -556,6 +556,36 @@ vmballoon_page_in_frames(enum vmballoon_page_size_type 
page_size)
return 1 << vmballoon_page_order(page_size);
 }
 
+/**
+ * vmballoon_mark_page_offline() - mark a page as offline
+ * @page: pointer for the page.
+ * @page_size: the size of the page.
+ */
+static void
+vmballoon_mark_page_offline(struct page *page,
+   enum vmballoon_page_size_type page_size)
+{
+   int i;
+
+   for (i = 0; i < vmballoon_page_in_frames(page_size); i++)
+   __SetPageOffline(page + i);
+}
+
+/**
+ * vmballoon_mark_page_online() - mark a page as online
+ * @page: pointer for the page.
+ * @page_size: the size of the page.
+ */
+static void
+vmballoon_mark_page_online(struct page *page,
+  enum vmballoon_page_size_type page_size)
+{
+   int i;
+
+   for (i = 0; i < vmballoon_page_in_frames(page_size); i++)
+   __ClearPageOffline(page + i);
+}
+
 /**
  * vmballoon_send_get_target() - Retrieve desired balloon size from the host.
  *
@@ -612,6 +642,7 @@ static int vmballoon_alloc_page_list(struct vmballoon *b,
 ctl->page_size);
 
if (page) {
+   vmballoon_mark_page_offline(page, ctl->page_size);
/* Success. Add the page to the list and continue. */
list_add(>lru, >pages);
continue;
@@ -850,6 +881,7 @@ static void vmballoon_release_page_list(struct list_head 
*page_list,
 
list_for_each_entry_safe(page, tmp, page_list, lru) {
list_del(>lru);
+   vmballoon_mark_page_online(page, page_size);
__free_pages(page, vmballoon_page_order(page_size));
}
 
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 7/8] PM / Hibernate: use pfn_to_online_page()

2018-11-22 Thread David Hildenbrand
Let's use pfn_to_online_page() instead of pfn_to_page() when checking
for saveable pages to not save/restore offline memory sections.

Cc: "Rafael J. Wysocki" 
Cc: Pavel Machek 
Cc: Len Brown 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Suggested-by: Michal Hocko 
Acked-by: Michal Hocko 
Acked-by: Pavel Machek 
Acked-by: Rafael J. Wysocki 
Signed-off-by: David Hildenbrand 
---
 kernel/power/snapshot.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 640b2034edd6..87e6dd57819f 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1215,8 +1215,8 @@ static struct page *saveable_highmem_page(struct zone 
*zone, unsigned long pfn)
if (!pfn_valid(pfn))
return NULL;
 
-   page = pfn_to_page(pfn);
-   if (page_zone(page) != zone)
+   page = pfn_to_online_page(pfn);
+   if (!page || page_zone(page) != zone)
return NULL;
 
BUG_ON(!PageHighMem(page));
@@ -1277,8 +1277,8 @@ static struct page *saveable_page(struct zone *zone, 
unsigned long pfn)
if (!pfn_valid(pfn))
return NULL;
 
-   page = pfn_to_page(pfn);
-   if (page_zone(page) != zone)
+   page = pfn_to_online_page(pfn);
+   if (!page || page_zone(page) != zone)
return NULL;
 
BUG_ON(PageHighMem(page));
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 4/8] xen/balloon: mark inflated pages PG_offline

2018-11-22 Thread David Hildenbrand
Mark inflated and never onlined pages PG_offline, to tell the world that
the content is stale and should not be dumped.

Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 drivers/xen/balloon.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 12148289debd..14dd6b814db3 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -425,6 +425,7 @@ static int xen_bring_pgs_online(struct page *pg, unsigned 
int order)
for (i = 0; i < size; i++) {
p = pfn_to_page(start_pfn + i);
__online_page_set_limits(p);
+   __SetPageOffline(p);
__balloon_append(p);
}
mutex_unlock(_mutex);
@@ -493,6 +494,7 @@ static enum bp_state increase_reservation(unsigned long 
nr_pages)
xenmem_reservation_va_mapping_update(1, , _list[i]);
 
/* Relinquish the page back to the allocator. */
+   __ClearPageOffline(page);
free_reserved_page(page);
}
 
@@ -519,6 +521,7 @@ static enum bp_state decrease_reservation(unsigned long 
nr_pages, gfp_t gfp)
state = BP_EAGAIN;
break;
}
+   __SetPageOffline(page);
adjust_managed_page_count(page, -1);
xenmem_reservation_scrub_page(page);
list_add(>lru, );
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 8/8] PM / Hibernate: exclude all PageOffline() pages

2018-11-22 Thread David Hildenbrand
The content of pages that are marked PG_offline is not of interest
(e.g. inflated by a balloon driver), let's skip these pages.

In saveable_highmem_page(), move the PageReserved() check to a new
check along with the PageOffline() check to separate it from the
swsusp checks.

Cc: "Rafael J. Wysocki" 
Cc: Pavel Machek 
Cc: Len Brown 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Acked-by: Pavel Machek 
Acked-by: Rafael J. Wysocki 
Signed-off-by: David Hildenbrand 
---
 kernel/power/snapshot.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 87e6dd57819f..4802b039b89f 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1221,8 +1221,10 @@ static struct page *saveable_highmem_page(struct zone 
*zone, unsigned long pfn)
 
BUG_ON(!PageHighMem(page));
 
-   if (swsusp_page_is_forbidden(page) ||  swsusp_page_is_free(page) ||
-   PageReserved(page))
+   if (swsusp_page_is_forbidden(page) ||  swsusp_page_is_free(page))
+   return NULL;
+
+   if (PageReserved(page) || PageOffline(page))
return NULL;
 
if (page_is_guard(page))
@@ -1286,6 +1288,9 @@ static struct page *saveable_page(struct zone *zone, 
unsigned long pfn)
if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page))
return NULL;
 
+   if (PageOffline(page))
+   return NULL;
+
if (PageReserved(page)
&& (!kernel_page_present(page) || pfn_is_nosave(pfn)))
return NULL;
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 5/8] hv_balloon: mark inflated pages PG_offline

2018-11-22 Thread David Hildenbrand
Mark inflated and never onlined pages PG_offline, to tell the world that
the content is stale and should not be dumped.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Kairui Song 
Cc: Vitaly Kuznetsov 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Acked-by: Pankaj gupta 
Signed-off-by: David Hildenbrand 
---
 drivers/hv/hv_balloon.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 211f3fe3a038..47719862e57f 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -681,8 +681,13 @@ static struct notifier_block hv_memory_nb = {
 /* Check if the particular page is backed and can be onlined and online it. */
 static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg)
 {
-   if (!has_pfn_is_backed(has, page_to_pfn(pg)))
+   if (!has_pfn_is_backed(has, page_to_pfn(pg))) {
+   if (!PageOffline(pg))
+   __SetPageOffline(pg);
return;
+   }
+   if (PageOffline(pg))
+   __ClearPageOffline(pg);
 
/* This frame is currently backed; online the page. */
__online_page_set_limits(pg);
@@ -1201,6 +1206,7 @@ static void free_balloon_pages(struct hv_dynmem_device 
*dm,
 
for (i = 0; i < num_pages; i++) {
pg = pfn_to_page(i + start_frame);
+   __ClearPageOffline(pg);
__free_page(pg);
dm->num_pages_ballooned--;
}
@@ -1213,7 +1219,7 @@ static unsigned int alloc_balloon_pages(struct 
hv_dynmem_device *dm,
struct dm_balloon_response *bl_resp,
int alloc_unit)
 {
-   unsigned int i = 0;
+   unsigned int i, j;
struct page *pg;
 
if (num_pages < alloc_unit)
@@ -1245,6 +1251,10 @@ static unsigned int alloc_balloon_pages(struct 
hv_dynmem_device *dm,
if (alloc_unit != 1)
split_page(pg, get_order(alloc_unit << PAGE_SHIFT));
 
+   /* mark all pages offline */
+   for (j = 0; j < (1 << get_order(alloc_unit << PAGE_SHIFT)); j++)
+   __SetPageOffline(pg + j);
+
bl_resp->range_count++;
bl_resp->range_array[i].finfo.start_page =
page_to_pfn(pg);
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 3/8] kexec: export PG_offline to VMCOREINFO

2018-11-22 Thread David Hildenbrand
Right now, pages inflated as part of a balloon driver will be dumped
by dump tools like makedumpfile. While XEN is able to check in the
crash kernel whether a certain pfn is actuall backed by memory in the
hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
other balloon inflated memory will essentially result in zero pages getting
allocated by the hypervisor and the dump getting filled with this data.

The allocation and reading of zero pages can directly be avoided if a
dumping tool could know which pages only contain stale information not to
be dumped.

We now have PG_offline which can be (and already is by virtio-balloon)
used for marking pages as logically offline. Follow up patches will
make use of this flag also in other balloon implementations.

Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so
makedumpfile can directly skip pages that are logically offline and the
content therefore stale. (we export is as a macro to match how it is
done for PG_buddy. This way it is clearer that this is not actually a flag
but only a very specific mapcount value to represent page types).

Please note that this is also helpful for a problem we were seeing under
Hyper-V: Dumping logically offline memory (pages kept fake offline while
onlining a section via online_page_callback) would under some condicions
result in a kernel panic when dumping them.

Cc: Andrew Morton 
Cc: Dave Young 
Cc: "Kirill A. Shutemov" 
Cc: Baoquan He 
Cc: Omar Sandoval 
Cc: Arnd Bergmann 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Cc: Lianbo Jiang 
Cc: Borislav Petkov 
Cc: Kazuhito Hagio 
Acked-by: Michael S. Tsirkin 
Acked-by: Dave Young 
Signed-off-by: David Hildenbrand 
---
 kernel/crash_core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 933cb3e45b98..093c9f917ed0 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -464,6 +464,8 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
 #ifdef CONFIG_HUGETLB_PAGE
VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
+#define PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline)
+   VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);
 #endif
 
arch_crash_save_vmcoreinfo();
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 1/8] mm: balloon: update comment about isolation/migration/compaction

2018-11-22 Thread David Hildenbrand
Commit b1123ea6d3b3 ("mm: balloon: use general non-lru movable page
feature") reworked balloon handling to make use of the general
non-lru movable page feature. The big comment block in
balloon_compaction.h contains quite some outdated information. Let's fix
this.

Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Acked-by: Michael S. Tsirkin 
Signed-off-by: David Hildenbrand 
---
 include/linux/balloon_compaction.h | 26 +-
 1 file changed, 9 insertions(+), 17 deletions(-)

diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
index 53051f3d8f25..cbe50da5a59d 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -4,15 +4,18 @@
  *
  * Common interface definitions for making balloon pages movable by compaction.
  *
- * Despite being perfectly possible to perform ballooned pages migration, they
- * make a special corner case to compaction scans because balloon pages are not
- * enlisted at any LRU list like the other pages we do compact / migrate.
+ * Balloon page migration makes use of the general non-lru movable page
+ * feature.
+ *
+ * page->private is used to reference the responsible balloon device.
+ * page->mapping is used in context of non-lru page migration to reference
+ * the address space operations for page isolation/migration/compaction.
  *
  * As the page isolation scanning step a compaction thread does is a lockless
  * procedure (from a page standpoint), it might bring some racy situations 
while
  * performing balloon page compaction. In order to sort out these racy 
scenarios
  * and safely perform balloon's page compaction and migration we must, always,
- * ensure following these three simple rules:
+ * ensure following these simple rules:
  *
  *   i. when updating a balloon's page ->mapping element, strictly do it under
  *  the following lock order, independently of the far superior
@@ -21,19 +24,8 @@
  *   +--spin_lock_irq(_dev_info->pages_lock);
  * ... page->mapping updates here ...
  *
- *  ii. before isolating or dequeueing a balloon page from the balloon device
- *  pages list, the page reference counter must be raised by one and the
- *  extra refcount must be dropped when the page is enqueued back into
- *  the balloon device page list, thus a balloon page keeps its reference
- *  counter raised only while it is under our special handling;
- *
- * iii. after the lockless scan step have selected a potential balloon page for
- *  isolation, re-test the PageBalloon mark and the PagePrivate flag
- *  under the proper page lock, to ensure isolating a valid balloon page
- *  (not yet isolated, nor under release procedure)
- *
- *  iv. isolation or dequeueing procedure must clear PagePrivate flag under
- *  page lock together with removing page from balloon device page list.
+ *  ii. isolation or dequeueing procedure must remove the page from balloon
+ *  device page list under b_dev_info->pages_lock.
  *
  * The functions provided by this interface are placed to help on coping with
  * the aforementioned balloon page corner case, as well as to ensure the simple
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 0/8] mm/kdump: allow to exclude pages that are logically offline

2018-11-22 Thread David Hildenbrand
Right now, pages inflated as part of a balloon driver will be dumped
by dump tools like makedumpfile. While XEN is able to check in the
crash kernel whether a certain pfn is actually backed by memory in the
hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
virtio-balloon, hv-balloon and VMWare balloon inflated memory will
essentially result in zero pages getting allocated by the hypervisor and
the dump getting filled with this data.

The allocation and reading of zero pages can directly be avoided if a
dumping tool could know which pages only contain stale information not to
be dumped.

Also for XEN, calling into the kernel and asking the hypervisor if a
pfn is backed can be avoided if the duming tool would skip such pages
right from the beginning.

Dumping tools have no idea whether a given page is part of a balloon driver
and shall not be dumped. Esp. PG_reserved cannot be used for that purpose
as all memory allocated during early boot is also PG_reserved, see
discussion at [1]. So some other way of indication is required and a new
page flag is frowned upon.

We have PG_balloon (MAPCOUNT value), which is essentially unused now. I
suggest renaming it to something more generic (PG_offline) to mark pages as
logically offline. This flag can than e.g. also be used by virtio-mem in
the future to mark subsections as offline. Or by other code that wants to
put pages logically offline (e.g. later maybe poisoned pages that shall
no longer be used).

This series converts PG_balloon to PG_offline, allows dumping tools to
query the value to detect such pages and marks pages in the hv-balloon
and XEN balloon properly as PG_offline. Note that virtio-balloon already
set pages to PG_balloon (and now PG_offline).

Please note that this is also helpful for a problem we were seeing under
Hyper-V: Dumping logically offline memory (pages kept fake offline while
onlining a section via online_page_callback) would under some condicions
result in a kernel panic when dumping them.

As I don't have access to neither XEN nor Hyper-V nor VMWare installations,
this was only tested with the virtio-balloon and pages were properly
skipped when dumping. I'll also attach the makedumpfile patch to this
series.

[1] https://lkml.org/lkml/2018/7/20/566

v1 -> v2:
- "kexec: export PG_offline to VMCOREINFO"
-- Add description why it is exported as a macro
- "vmw_balloon: mark inflated pages PG_offline"
-- Use helper function + adapt comments
- "PM / Hibernate: exclude all PageOffline() pages"
-- Perform the check separate from swsusp checks.
- Added RBs/ACKs


David Hildenbrand (8):
  mm: balloon: update comment about isolation/migration/compaction
  mm: convert PG_balloon to PG_offline
  kexec: export PG_offline to VMCOREINFO
  xen/balloon: mark inflated pages PG_offline
  hv_balloon: mark inflated pages PG_offline
  vmw_balloon: mark inflated pages PG_offline
  PM / Hibernate: use pfn_to_online_page()
  PM / Hibernate: exclude all PageOffline() pages

 Documentation/admin-guide/mm/pagemap.rst |  9 ---
 drivers/hv/hv_balloon.c  | 14 --
 drivers/misc/vmw_balloon.c   | 32 ++
 drivers/xen/balloon.c|  3 +++
 fs/proc/page.c   |  4 +--
 include/linux/balloon_compaction.h   | 34 +---
 include/linux/page-flags.h   | 11 +---
 include/uapi/linux/kernel-page-flags.h   |  2 +-
 kernel/crash_core.c  |  2 ++
 kernel/power/snapshot.c  | 17 +++-
 tools/vm/page-types.c|  2 +-
 11 files changed, 90 insertions(+), 40 deletions(-)

-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 2/8] mm: convert PG_balloon to PG_offline

2018-11-22 Thread David Hildenbrand
PG_balloon was introduced to implement page migration/compaction for pages
inflated in virtio-balloon. Nowadays, it is only a marker that a page is
part of virtio-balloon and therefore logically offline.

We also want to make use of this flag in other balloon drivers - for
inflated pages or when onlining a section but keeping some pages offline
(e.g. used right now by XEN and Hyper-V via set_online_page_callback()).

We are going to expose this flag to dump tools like makedumpfile. But
instead of exposing PG_balloon, let's generalize the concept of marking
pages as logically offline, so it can be reused for other purposes
later on.

Rename PG_balloon to PG_offline. This is an indicator that the page is
logically offline, the content stale and that it should not be touched
(e.g. a hypervisor would have to allocate backing storage in order for the
guest to dump an unused page).  We can then e.g. exclude such pages from
dumps.

We replace and reuse KPF_BALLOON (23), as this shouldn't really harm
(and for now the semantics stay the same).  In following patches, we will
make use of this bit also in other balloon drivers. While at it, document
PGTABLE.

Cc: Jonathan Corbet 
Cc: Alexey Dobriyan 
Cc: Mike Rapoport 
Cc: Andrew Morton 
Cc: Christian Hansen 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Stephen Rothwell 
Cc: Matthew Wilcox 
Cc: "Michael S. Tsirkin" 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Alexander Duyck 
Cc: Naoya Horiguchi 
Cc: Miles Chen 
Cc: David Rientjes 
Cc: Konstantin Khlebnikov 
Cc: Kazuhito Hagio 
Acked-by: Konstantin Khlebnikov 
Acked-by: Michael S. Tsirkin 
Acked-by: Pankaj gupta 
Signed-off-by: David Hildenbrand 
---
 Documentation/admin-guide/mm/pagemap.rst |  9 ++---
 fs/proc/page.c   |  4 ++--
 include/linux/balloon_compaction.h   |  8 
 include/linux/page-flags.h   | 11 +++
 include/uapi/linux/kernel-page-flags.h   |  2 +-
 tools/vm/page-types.c|  2 +-
 6 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/Documentation/admin-guide/mm/pagemap.rst 
b/Documentation/admin-guide/mm/pagemap.rst
index 3f7bade2c231..340a5aee9b80 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -75,9 +75,10 @@ number of times a page is mapped.
 20. NOPAGE
 21. KSM
 22. THP
-23. BALLOON
+23. OFFLINE
 24. ZERO_PAGE
 25. IDLE
+26. PGTABLE
 
  * ``/proc/kpagecgroup``.  This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
@@ -118,8 +119,8 @@ Short descriptions to the page flags
 identical memory pages dynamically shared between one or more processes
 22 - THP
 contiguous pages which construct transparent hugepages
-23 - BALLOON
-balloon compaction page
+23 - OFFLINE
+page is logically offline
 24 - ZERO_PAGE
 zero page for pfn_zero or huge_zero page
 25 - IDLE
@@ -128,6 +129,8 @@ Short descriptions to the page flags
 Note that this flag may be stale in case the page was accessed via
 a PTE. To make sure the flag is up-to-date one has to read
 ``/sys/kernel/mm/page_idle/bitmap`` first.
+26 - PGTABLE
+page is in use as a page table
 
 IO related page flags
 -
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 6c517b11acf8..378401af4d9d 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -152,8 +152,8 @@ u64 stable_page_flags(struct page *page)
else if (page_count(page) == 0 && is_free_buddy_page(page))
u |= 1 << KPF_BUDDY;
 
-   if (PageBalloon(page))
-   u |= 1 << KPF_BALLOON;
+   if (PageOffline(page))
+   u |= 1 << KPF_OFFLINE;
if (PageTable(page))
u |= 1 << KPF_PGTABLE;
 
diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
index cbe50da5a59d..f111c780ef1d 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -95,7 +95,7 @@ extern int balloon_page_migrate(struct address_space *mapping,
 static inline void balloon_page_insert(struct balloon_dev_info *balloon,
   struct page *page)
 {
-   __SetPageBalloon(page);
+   __SetPageOffline(page);
__SetPageMovable(page, balloon->inode->i_mapping);
set_page_private(page, (unsigned long)balloon);
list_add(>lru, >pages);
@@ -111,7 +111,7 @@ static inline void balloon_page_insert(struct 
balloon_dev_info *balloon,
  */
 static inline void balloon_page_delete(struct page *page)
 {
-   __ClearPageBalloon(page);
+   __ClearPageOffline(page);
__ClearPageMovable(page);
set_page_private(page, 0);
/*
@@ -141,13 +141,13 @@ static inline gfp_t balloon_mapping_gfp_mask(void)
 static inline void balloon_page_insert(struct balloon_dev_info *b

Re: [PATCH v1] makedumpfile: exclude pages that are logically offline

2018-11-21 Thread David Hildenbrand
On 21.11.18 15:58, Kazuhito Hagio wrote:
> Hi David,
> 
>> Linux marks pages that are logically offline via a page flag (map count).
>> Such pages e.g. include pages infated as part of a balloon driver or
>> pages that were not actually onlined when onlining the whole section.
>>
>> While the hypervisor usually allows to read such inflated memory, we
>> basically read and dump data that is completely irrelevant. Also, this
>> might result in quite some overhead in the hypervisor. In addition,
>> we saw some problems under Hyper-V, whereby we can crash the kernel by
>> dumping, when reading memory of a partially onlined memory segment
>> (for memory added by the Hyper-V balloon driver).
>>
>> Therefore, don't read and dump pages that are marked as being logically
>> offline.
>>
>> Signed-off-by: David Hildenbrand 
>> ---
>>  makedumpfile.c | 34 ++
>>  makedumpfile.h |  1 +
>>  2 files changed, 31 insertions(+), 4 deletions(-)
>>
>> diff --git a/makedumpfile.c b/makedumpfile.c
>> index 8923538..b8bfd4c 100644
>> --- a/makedumpfile.c
>> +++ b/makedumpfile.c
>> @@ -88,6 +88,7 @@ mdf_pfn_t pfn_cache_private;
>>  mdf_pfn_t pfn_user;
>>  mdf_pfn_t pfn_free;
>>  mdf_pfn_t pfn_hwpoison;
>> +mdf_pfn_t pfn_offline;
>>
>>  mdf_pfn_t num_dumped;
>>
>> @@ -249,6 +250,21 @@ isHugetlb(unsigned long dtor)
>>  && (SYMBOL(free_huge_page) == dtor));
>>  }
>>
>> +static int
>> +isOffline(unsigned long flags, unsigned int _mapcount)
>> +{
>> +if (NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE) == NOT_FOUND_NUMBER)
>> +return FALSE;
> 
> This is NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE), isn't it?
> If so, I will correct it when merging.
> 
> Otherwise, looks good to me.
> 
> Thanks!
> Kazu

Indeed,

I will most probably resend either way along with a new mm series!

Thanks a lot!


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 8/8] PM / Hibernate: exclude all PageOffline() pages

2018-11-21 Thread David Hildenbrand
On 21.11.18 12:35, William Kucharski wrote:
> If you are adding PageOffline(page) to the condition list of the already 
> existing if in
> saveable_highmem_page(), why explicitly add it as a separate statement in 
> saveable_page()?
> 
> It would seem more consistent to make the second check:
> 
> - if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page))
> + if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page) ||
> + PageOffline(page))
> 
> instead.
> 
> It's admittedly a nit but it just seems cleaner to either do that or, if your 
> intention
> was to separate the Page checks from the swsusp checks, to break the calls to
> PageReserved() and PageOffline() into their own check in 
> saveable_highmem_page().

I'll split PageReserved() and PageOffline() off from the swsusp checks,
thanks for your comment!

> 
> Thanks!
> -- Bill


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 6/8] vmw_balloon: mark inflated pages PG_offline

2018-11-21 Thread David Hildenbrand
On 21.11.18 04:22, Nadav Amit wrote:
> Thanks for this patch!
> 
>> On Nov 19, 2018, at 2:16 AM, David Hildenbrand  wrote:
>>
>> Mark inflated and never onlined pages PG_offline, to tell the world that
>> the content is stale and should not be dumped.
>>
>> Cc: Xavier Deguillard 
>> Cc: Nadav Amit 
>> Cc: Arnd Bergmann 
>> Cc: Greg Kroah-Hartman 
>> Cc: Julien Freche 
>> Cc: Andrew Morton 
>> Cc: Matthew Wilcox 
>> Cc: Michal Hocko 
>> Cc: "Michael S. Tsirkin" 
>> Signed-off-by: David Hildenbrand 
>> ---
>> drivers/misc/vmw_balloon.c | 32 
>> 1 file changed, 32 insertions(+)
>>
>> diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
>> index e6126a4b95d3..8cc8bd9a4e32 100644
>> --- a/drivers/misc/vmw_balloon.c
>> +++ b/drivers/misc/vmw_balloon.c
>> @@ -544,6 +544,36 @@ unsigned int vmballoon_page_order(enum 
>> vmballoon_page_size_type page_size)
>>  return page_size == VMW_BALLOON_2M_PAGE ? VMW_BALLOON_2M_ORDER : 0;
>> }
>>
>> +/**
>> + * vmballoon_mark_page_offline() - mark a page as offline
>> + * @page: pointer for the page
> 
> If possible, please add a period at the end of the sentence (yes, I know I
> got it wrong in some places too).

Sure :)

> 
>> + * @page_size: the size of the page.
>> + */
>> +static void
>> +vmballoon_mark_page_offline(struct page *page,
>> +enum vmballoon_page_size_type page_size)
>> +{
>> +int i;
>> +
>> +for (i = 0; i < 1ULL << vmballoon_page_order(page_size); i++)
> 
> Can you please do instead:
> 
>   unsigned int;
> 
>   for (i = 0; i < vmballoon_page_in_frames(page_size); i++)
> 

Will do, will have to move both functions a little bit down in the file
(exactly one function).


> We would like to test it in the next few days, but in the meanwhile, after
> you address these minor issues:
> 
> Acked-by: Nadav Amit 

Thanks!

> 
> Thanks again,
> Nadav 
> 


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 3/8] kexec: export PG_offline to VMCOREINFO

2018-11-21 Thread David Hildenbrand
On 21.11.18 07:04, Baoquan He wrote:
> On 11/19/18 at 11:16am, David Hildenbrand wrote:
>> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
>> index 933cb3e45b98..093c9f917ed0 100644
>> --- a/kernel/crash_core.c
>> +++ b/kernel/crash_core.c
>> @@ -464,6 +464,8 @@ static int __init crash_save_vmcoreinfo_init(void)
>>  VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
>>  #ifdef CONFIG_HUGETLB_PAGE
>>  VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
>> +#define PAGE_OFFLINE_MAPCOUNT_VALUE (~PG_offline)
>> +VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);
>>  #endif
> 
> This solution looks good to me. One small concern is why we don't
> export PG_offline to vmcoreinfo directly, then define
> PAGE_OFFLINE_MAPCOUNT_VALUE in makedumpfile. We have been exporting
> kernel data/MACRO directly, why this one is exceptional.
> 

1. We are much more similar to PG_buddy (in contrast to actual page
flags), and for PG_buddy it is historically handled like this (and I
think it makes sense to expose these as actual MAPCOUNT_VALUEs).

2. Right now only one page type per page is supported. Therefore only
exactly one value in mapcount indicates e.g. PageBuddy()/PageOffline().

Now, if we ever decide to change this (e.g. treat them like real flags),
it is much easier to switch to PG_offline/PG_buddy then. We can directly
see in makedumpfile that .*_MAPCOUNT_VALUE is no longer available but
instead e.g. PG_offline and PG_buddy. Instead we would no see a change
in makedumpfile and would have to rely on other properties.

If there are no strong opinions I will leave it like this.

Thanks!

> Thanks
> Baoquan
> 


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 5/8] hv_balloon: mark inflated pages PG_offline

2018-11-20 Thread David Hildenbrand
On 20.11.18 09:45, Pankaj Gupta wrote:
> 
> Hi David,
> 
>>
>> Mark inflated and never onlined pages PG_offline, to tell the world that
>> the content is stale and should not be dumped.
>>
>> Cc: "K. Y. Srinivasan" 
>> Cc: Haiyang Zhang 
>> Cc: Stephen Hemminger 
>> Cc: Kairui Song 
>> Cc: Vitaly Kuznetsov 
>> Cc: Andrew Morton 
>> Cc: Matthew Wilcox 
>> Cc: Michal Hocko 
>> Cc: "Michael S. Tsirkin" 
>> Signed-off-by: David Hildenbrand 
>> ---
>>  drivers/hv/hv_balloon.c | 14 --
>>  1 file changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
>> index 211f3fe3a038..47719862e57f 100644
>> --- a/drivers/hv/hv_balloon.c
>> +++ b/drivers/hv/hv_balloon.c
>> @@ -681,8 +681,13 @@ static struct notifier_block hv_memory_nb = {
>>  /* Check if the particular page is backed and can be onlined and online it.
>>  */
>>  static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg)
>>  {
>> -if (!has_pfn_is_backed(has, page_to_pfn(pg)))
>> +if (!has_pfn_is_backed(has, page_to_pfn(pg))) {
>> +if (!PageOffline(pg))
>> +__SetPageOffline(pg);
>>  return;
>> +}
>> +if (PageOffline(pg))
>> +__ClearPageOffline(pg);
>>  
>>  /* This frame is currently backed; online the page. */
>>  __online_page_set_limits(pg);
>> @@ -1201,6 +1206,7 @@ static void free_balloon_pages(struct hv_dynmem_device
>> *dm,
>>  
>>  for (i = 0; i < num_pages; i++) {
>>  pg = pfn_to_page(i + start_frame);
>> +__ClearPageOffline(pg);
> 
> Just thinking, do we need to care for clearing PageOffline flag before freeing
> a balloon'd page?

Yes we have to otherwise the code will crash when trying to set PageBuddy.

(only one page type at a time may be set right now, and it makes sense.
A page that is offline cannot e.g. be a buddy page)

So PageOffline is completely managed by the page owner.

> 
> Thanks,
> Pankaj


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 2/8] mm: convert PG_balloon to PG_offline

2018-11-19 Thread David Hildenbrand
>  
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 50ce1bddaf56..f91da3d0a67e 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -670,7 +670,7 @@ PAGEFLAG_FALSE(DoubleMap)
>  #define PAGE_TYPE_BASE   0xf000
>  /* Reserve   0x007f to catch underflows of page_mapcount */
>  #define PG_buddy 0x0080
> -#define PG_balloon   0x0100
> +#define PG_offline   0x0100
>  #define PG_kmemcg0x0200
>  #define PG_table 0x0400
>  
> @@ -700,10 +700,13 @@ static __always_inline void __ClearPage##uname(struct 
> page *page)   \
>  PAGE_TYPE_OPS(Buddy, buddy)
>  
>  /*
> - * PageBalloon() is true for pages that are on the balloon page list
> - * (see mm/balloon_compaction.c).
> + * PageOffline() indicates that the pages is logically offline although the

s/pages/page/

:)


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 4/8] xen/balloon: mark inflated pages PG_offline

2018-11-19 Thread David Hildenbrand
On 19.11.18 13:22, Juergen Gross wrote:
> On 19/11/2018 11:16, David Hildenbrand wrote:
>> Mark inflated and never onlined pages PG_offline, to tell the world that
>> the content is stale and should not be dumped.
>>
>> Cc: Boris Ostrovsky 
>> Cc: Juergen Gross 
>> Cc: Stefano Stabellini 
>> Cc: Andrew Morton 
>> Cc: Matthew Wilcox 
>> Cc: Michal Hocko 
>> Cc: "Michael S. Tsirkin" 
>> Signed-off-by: David Hildenbrand 
>> ---
>>  drivers/xen/balloon.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
>> index 12148289debd..14dd6b814db3 100644
>> --- a/drivers/xen/balloon.c
>> +++ b/drivers/xen/balloon.c
>> @@ -425,6 +425,7 @@ static int xen_bring_pgs_online(struct page *pg, 
>> unsigned int order)
>>  for (i = 0; i < size; i++) {
>>  p = pfn_to_page(start_pfn + i);
>>  __online_page_set_limits(p);
>> +__SetPageOffline(p);
>>  __balloon_append(p);
>>  }
> 
> This seems not to be based on current master. Could you please tell
> against which tree this should be reviewed?
> 
Hi Juergen,

this is based on linux-next/master.

Thanks!

> 
> Juergen
> 


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1] makedumpfile: exclude pages that are logically offline

2018-11-19 Thread David Hildenbrand
Linux marks pages that are logically offline via a page flag (map count).
Such pages e.g. include pages infated as part of a balloon driver or
pages that were not actually onlined when onlining the whole section.

While the hypervisor usually allows to read such inflated memory, we
basically read and dump data that is completely irrelevant. Also, this
might result in quite some overhead in the hypervisor. In addition,
we saw some problems under Hyper-V, whereby we can crash the kernel by
dumping, when reading memory of a partially onlined memory segment
(for memory added by the Hyper-V balloon driver).

Therefore, don't read and dump pages that are marked as being logically
offline.

Signed-off-by: David Hildenbrand 
---
 makedumpfile.c | 34 ++
 makedumpfile.h |  1 +
 2 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/makedumpfile.c b/makedumpfile.c
index 8923538..b8bfd4c 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -88,6 +88,7 @@ mdf_pfn_t pfn_cache_private;
 mdf_pfn_t pfn_user;
 mdf_pfn_t pfn_free;
 mdf_pfn_t pfn_hwpoison;
+mdf_pfn_t pfn_offline;
 
 mdf_pfn_t num_dumped;
 
@@ -249,6 +250,21 @@ isHugetlb(unsigned long dtor)
 && (SYMBOL(free_huge_page) == dtor));
 }
 
+static int
+isOffline(unsigned long flags, unsigned int _mapcount)
+{
+   if (NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE) == NOT_FOUND_NUMBER)
+   return FALSE;
+
+   if (flags & (1UL << NUMBER(PG_slab)))
+   return FALSE;
+
+   if (_mapcount == (int)NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE))
+   return TRUE;
+
+   return FALSE;
+}
+
 static int
 is_cache_page(unsigned long flags)
 {
@@ -2287,6 +2303,8 @@ write_vmcoreinfo_data(void)
WRITE_NUMBER("PG_hwpoison", PG_hwpoison);
 
WRITE_NUMBER("PAGE_BUDDY_MAPCOUNT_VALUE", PAGE_BUDDY_MAPCOUNT_VALUE);
+   WRITE_NUMBER("PAGE_OFFLINE_MAPCOUNT_VALUE",
+PAGE_OFFLINE_MAPCOUNT_VALUE);
WRITE_NUMBER("phys_base", phys_base);
 
WRITE_NUMBER("HUGETLB_PAGE_DTOR", HUGETLB_PAGE_DTOR);
@@ -2687,6 +2705,7 @@ read_vmcoreinfo(void)
READ_SRCFILE("pud_t", pud_t);
 
READ_NUMBER("PAGE_BUDDY_MAPCOUNT_VALUE", PAGE_BUDDY_MAPCOUNT_VALUE);
+   READ_NUMBER("PAGE_OFFLINE_MAPCOUNT_VALUE", PAGE_OFFLINE_MAPCOUNT_VALUE);
READ_NUMBER("phys_base", phys_base);
 #ifdef __aarch64__
READ_NUMBER("VA_BITS", VA_BITS);
@@ -6041,6 +6060,12 @@ __exclude_unnecessary_pages(unsigned long mem_map,
else if (isHWPOISON(flags)) {
pfn_counter = _hwpoison;
}
+   /*
+* Exclude pages that are logically offline.
+*/
+   else if (isOffline(flags, _mapcount)) {
+   pfn_counter = _offline;
+   }
/*
 * Unexcludable page
 */
@@ -7522,7 +7547,7 @@ write_elf_pages_cyclic(struct cache_data *cd_header, 
struct cache_data *cd_page)
 */
if (info->flag_cyclic) {
pfn_zero = pfn_cache = pfn_cache_private = 0;
-   pfn_user = pfn_free = pfn_hwpoison = 0;
+   pfn_user = pfn_free = pfn_hwpoison = pfn_offline = 0;
pfn_memhole = info->max_mapnr;
}
 
@@ -8804,7 +8829,7 @@ write_kdump_pages_and_bitmap_cyclic(struct cache_data 
*cd_header, struct cache_d
 * Reset counter for debug message.
 */
pfn_zero = pfn_cache = pfn_cache_private = 0;
-   pfn_user = pfn_free = pfn_hwpoison = 0;
+   pfn_user = pfn_free = pfn_hwpoison = pfn_offline = 0;
pfn_memhole = info->max_mapnr;
 
/*
@@ -9749,7 +9774,7 @@ print_report(void)
pfn_original = info->max_mapnr - pfn_memhole;
 
pfn_excluded = pfn_zero + pfn_cache + pfn_cache_private
-   + pfn_user + pfn_free + pfn_hwpoison;
+   + pfn_user + pfn_free + pfn_hwpoison + pfn_offline;
shrinking = (pfn_original - pfn_excluded) * 100;
shrinking = shrinking / pfn_original;
 
@@ -9763,6 +9788,7 @@ print_report(void)
REPORT_MSG("User process data pages : 0x%016llx\n", pfn_user);
REPORT_MSG("Free pages  : 0x%016llx\n", pfn_free);
REPORT_MSG("Hwpoison pages  : 0x%016llx\n", pfn_hwpoison);
+   REPORT_MSG("Offline pages   : 0x%016llx\n", pfn_offline);
REPORT_MSG("  Remaining pages  : 0x%016llx\n",
pfn_original - pfn_excluded);
REPORT_MSG("  (The number of pages is reduced to %lld%%.)\n",
@@ -9790,7 +9816,7 @@ print_mem_usage(void)
pfn_original = info->max_mapnr - pfn_memhole;
 
pfn_excluded = pfn_zero + pfn_cache + pfn_cache_private
-  

[PATCH v1 8/8] PM / Hibernate: exclude all PageOffline() pages

2018-11-19 Thread David Hildenbrand
The content of pages that are marked PG_offline is not of interest
(e.g. inflated by a balloon driver), let's skip these pages.

Cc: "Rafael J. Wysocki" 
Cc: Pavel Machek 
Cc: Len Brown 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Acked-by: Pavel Machek 
Signed-off-by: David Hildenbrand 
---
 kernel/power/snapshot.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 87e6dd57819f..8d7b4d458842 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1222,7 +1222,7 @@ static struct page *saveable_highmem_page(struct zone 
*zone, unsigned long pfn)
BUG_ON(!PageHighMem(page));
 
if (swsusp_page_is_forbidden(page) ||  swsusp_page_is_free(page) ||
-   PageReserved(page))
+   PageReserved(page) || PageOffline(page))
return NULL;
 
if (page_is_guard(page))
@@ -1286,6 +1286,9 @@ static struct page *saveable_page(struct zone *zone, 
unsigned long pfn)
if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page))
return NULL;
 
+   if (PageOffline(page))
+   return NULL;
+
if (PageReserved(page)
&& (!kernel_page_present(page) || pfn_is_nosave(pfn)))
return NULL;
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 7/8] PM / Hibernate: use pfn_to_online_page()

2018-11-19 Thread David Hildenbrand
Let's use pfn_to_online_page() instead of pfn_to_page() when checking
for saveable pages to not save/restore offline memory sections.

Cc: "Rafael J. Wysocki" 
Cc: Pavel Machek 
Cc: Len Brown 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Suggested-by: Michal Hocko 
Signed-off-by: David Hildenbrand 
---
 kernel/power/snapshot.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 640b2034edd6..87e6dd57819f 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1215,8 +1215,8 @@ static struct page *saveable_highmem_page(struct zone 
*zone, unsigned long pfn)
if (!pfn_valid(pfn))
return NULL;
 
-   page = pfn_to_page(pfn);
-   if (page_zone(page) != zone)
+   page = pfn_to_online_page(pfn);
+   if (!page || page_zone(page) != zone)
return NULL;
 
BUG_ON(!PageHighMem(page));
@@ -1277,8 +1277,8 @@ static struct page *saveable_page(struct zone *zone, 
unsigned long pfn)
if (!pfn_valid(pfn))
return NULL;
 
-   page = pfn_to_page(pfn);
-   if (page_zone(page) != zone)
+   page = pfn_to_online_page(pfn);
+   if (!page || page_zone(page) != zone)
return NULL;
 
BUG_ON(PageHighMem(page));
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 6/8] vmw_balloon: mark inflated pages PG_offline

2018-11-19 Thread David Hildenbrand
Mark inflated and never onlined pages PG_offline, to tell the world that
the content is stale and should not be dumped.

Cc: Xavier Deguillard 
Cc: Nadav Amit 
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
Cc: Julien Freche 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 drivers/misc/vmw_balloon.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index e6126a4b95d3..8cc8bd9a4e32 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -544,6 +544,36 @@ unsigned int vmballoon_page_order(enum 
vmballoon_page_size_type page_size)
return page_size == VMW_BALLOON_2M_PAGE ? VMW_BALLOON_2M_ORDER : 0;
 }
 
+/**
+ * vmballoon_mark_page_offline() - mark a page as offline
+ * @page: pointer for the page
+ * @page_size: the size of the page.
+ */
+static void
+vmballoon_mark_page_offline(struct page *page,
+   enum vmballoon_page_size_type page_size)
+{
+   int i;
+
+   for (i = 0; i < 1ULL << vmballoon_page_order(page_size); i++)
+   __SetPageOffline(page + i);
+}
+
+/**
+ * vmballoon_mark_page_online() - mark a page as online
+ * @page: pointer for the page
+ * @page_size: the size of the page.
+ */
+static void
+vmballoon_mark_page_online(struct page *page,
+  enum vmballoon_page_size_type page_size)
+{
+   int i;
+
+   for (i = 0; i < 1ULL << vmballoon_page_order(page_size); i++)
+   __ClearPageOffline(page + i);
+}
+
 /**
  * vmballoon_page_in_frames() - returns the number of frames in a page.
  * @page_size: the size of the page.
@@ -612,6 +642,7 @@ static int vmballoon_alloc_page_list(struct vmballoon *b,
 ctl->page_size);
 
if (page) {
+   vmballoon_mark_page_offline(page, ctl->page_size);
/* Success. Add the page to the list and continue. */
list_add(>lru, >pages);
continue;
@@ -850,6 +881,7 @@ static void vmballoon_release_page_list(struct list_head 
*page_list,
 
list_for_each_entry_safe(page, tmp, page_list, lru) {
list_del(>lru);
+   vmballoon_mark_page_online(page, page_size);
__free_pages(page, vmballoon_page_order(page_size));
}
 
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 4/8] xen/balloon: mark inflated pages PG_offline

2018-11-19 Thread David Hildenbrand
Mark inflated and never onlined pages PG_offline, to tell the world that
the content is stale and should not be dumped.

Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 drivers/xen/balloon.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 12148289debd..14dd6b814db3 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -425,6 +425,7 @@ static int xen_bring_pgs_online(struct page *pg, unsigned 
int order)
for (i = 0; i < size; i++) {
p = pfn_to_page(start_pfn + i);
__online_page_set_limits(p);
+   __SetPageOffline(p);
__balloon_append(p);
}
mutex_unlock(_mutex);
@@ -493,6 +494,7 @@ static enum bp_state increase_reservation(unsigned long 
nr_pages)
xenmem_reservation_va_mapping_update(1, , _list[i]);
 
/* Relinquish the page back to the allocator. */
+   __ClearPageOffline(page);
free_reserved_page(page);
}
 
@@ -519,6 +521,7 @@ static enum bp_state decrease_reservation(unsigned long 
nr_pages, gfp_t gfp)
state = BP_EAGAIN;
break;
}
+   __SetPageOffline(page);
adjust_managed_page_count(page, -1);
xenmem_reservation_scrub_page(page);
list_add(>lru, );
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 3/8] kexec: export PG_offline to VMCOREINFO

2018-11-19 Thread David Hildenbrand
Right now, pages inflated as part of a balloon driver will be dumped
by dump tools like makedumpfile. While XEN is able to check in the
crash kernel whether a certain pfn is actuall backed by memory in the
hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
other balloon inflated memory will essentially result in zero pages getting
allocated by the hypervisor and the dump getting filled with this data.

The allocation and reading of zero pages can directly be avoided if a
dumping tool could know which pages only contain stale information not to
be dumped.

We now have PG_offline which can be (and already is by virtio-balloon)
used for marking pages as logically offline. Follow up patches will
make use of this flag also in other balloon implementations.

Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so
makedumpfile can directly skip pages that are logically offline and the
content therefore stale.

Please note that this is also helpful for a problem we were seeing under
Hyper-V: Dumping logically offline memory (pages kept fake offline while
onlining a section via online_page_callback) would under some condicions
result in a kernel panic when dumping them.

Cc: Andrew Morton 
Cc: Dave Young 
Cc: "Kirill A. Shutemov" 
Cc: Baoquan He 
Cc: Omar Sandoval 
Cc: Arnd Bergmann 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Cc: Lianbo Jiang 
Cc: Borislav Petkov 
Cc: Kazuhito Hagio 
Signed-off-by: David Hildenbrand 
---
 kernel/crash_core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 933cb3e45b98..093c9f917ed0 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -464,6 +464,8 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
 #ifdef CONFIG_HUGETLB_PAGE
VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
+#define PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline)
+   VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);
 #endif
 
arch_crash_save_vmcoreinfo();
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 1/8] mm: balloon: update comment about isolation/migration/compaction

2018-11-19 Thread David Hildenbrand
Commit b1123ea6d3b3 ("mm: balloon: use general non-lru movable page
feature") reworked balloon handling to make use of the general
non-lru movable page feature. The big comment block in
balloon_compaction.h contains quite some outdated information. Let's fix
this.

Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 include/linux/balloon_compaction.h | 26 +-
 1 file changed, 9 insertions(+), 17 deletions(-)

diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
index 53051f3d8f25..cbe50da5a59d 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -4,15 +4,18 @@
  *
  * Common interface definitions for making balloon pages movable by compaction.
  *
- * Despite being perfectly possible to perform ballooned pages migration, they
- * make a special corner case to compaction scans because balloon pages are not
- * enlisted at any LRU list like the other pages we do compact / migrate.
+ * Balloon page migration makes use of the general non-lru movable page
+ * feature.
+ *
+ * page->private is used to reference the responsible balloon device.
+ * page->mapping is used in context of non-lru page migration to reference
+ * the address space operations for page isolation/migration/compaction.
  *
  * As the page isolation scanning step a compaction thread does is a lockless
  * procedure (from a page standpoint), it might bring some racy situations 
while
  * performing balloon page compaction. In order to sort out these racy 
scenarios
  * and safely perform balloon's page compaction and migration we must, always,
- * ensure following these three simple rules:
+ * ensure following these simple rules:
  *
  *   i. when updating a balloon's page ->mapping element, strictly do it under
  *  the following lock order, independently of the far superior
@@ -21,19 +24,8 @@
  *   +--spin_lock_irq(_dev_info->pages_lock);
  * ... page->mapping updates here ...
  *
- *  ii. before isolating or dequeueing a balloon page from the balloon device
- *  pages list, the page reference counter must be raised by one and the
- *  extra refcount must be dropped when the page is enqueued back into
- *  the balloon device page list, thus a balloon page keeps its reference
- *  counter raised only while it is under our special handling;
- *
- * iii. after the lockless scan step have selected a potential balloon page for
- *  isolation, re-test the PageBalloon mark and the PagePrivate flag
- *  under the proper page lock, to ensure isolating a valid balloon page
- *  (not yet isolated, nor under release procedure)
- *
- *  iv. isolation or dequeueing procedure must clear PagePrivate flag under
- *  page lock together with removing page from balloon device page list.
+ *  ii. isolation or dequeueing procedure must remove the page from balloon
+ *  device page list under b_dev_info->pages_lock.
  *
  * The functions provided by this interface are placed to help on coping with
  * the aforementioned balloon page corner case, as well as to ensure the simple
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 2/8] mm: convert PG_balloon to PG_offline

2018-11-19 Thread David Hildenbrand
PG_balloon was introduced to implement page migration/compaction for pages
inflated in virtio-balloon. Nowadays, it is only a marker that a page is
part of virtio-balloon and therefore logically offline.

We also want to make use of this flag in other balloon drivers - for
inflated pages or when onlining a section but keeping some pages offline
(e.g. used right now by XEN and Hyper-V via set_online_page_callback()).

We are going to expose this flag to dump tools like makedumpfile. But
instead of exposing PG_balloon, let's generalize the concept of marking
pages as logically offline, so it can be reused for other purposes
later on.

Rename PG_balloon to PG_offline. This is an indicator that the page is
logically offline, the content stale and that it should not be touched
(e.g. a hypervisor would have to allocate backing storage in order for the
guest to dump an unused page).  We can then e.g. exclude such pages from
dumps.

We replace and reuse KPF_BALLOON (23), as this shouldn't really harm
(and for now the semantics stay the same).  In following patches, we will
make use of this bit also in other balloon drivers. While at it, document
PGTABLE.

Cc: Jonathan Corbet 
Cc: Alexey Dobriyan 
Cc: Mike Rapoport 
Cc: Andrew Morton 
Cc: Christian Hansen 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Stephen Rothwell 
Cc: Matthew Wilcox 
Cc: "Michael S. Tsirkin" 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Alexander Duyck 
Cc: Naoya Horiguchi 
Cc: Miles Chen 
Cc: David Rientjes 
Cc: Konstantin Khlebnikov 
Cc: Kazuhito Hagio 
Signed-off-by: David Hildenbrand 
---
 Documentation/admin-guide/mm/pagemap.rst |  9 ++---
 fs/proc/page.c   |  4 ++--
 include/linux/balloon_compaction.h   |  8 
 include/linux/page-flags.h   | 11 +++
 include/uapi/linux/kernel-page-flags.h   |  2 +-
 tools/vm/page-types.c|  2 +-
 6 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/Documentation/admin-guide/mm/pagemap.rst 
b/Documentation/admin-guide/mm/pagemap.rst
index 3f7bade2c231..340a5aee9b80 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -75,9 +75,10 @@ number of times a page is mapped.
 20. NOPAGE
 21. KSM
 22. THP
-23. BALLOON
+23. OFFLINE
 24. ZERO_PAGE
 25. IDLE
+26. PGTABLE
 
  * ``/proc/kpagecgroup``.  This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
@@ -118,8 +119,8 @@ Short descriptions to the page flags
 identical memory pages dynamically shared between one or more processes
 22 - THP
 contiguous pages which construct transparent hugepages
-23 - BALLOON
-balloon compaction page
+23 - OFFLINE
+page is logically offline
 24 - ZERO_PAGE
 zero page for pfn_zero or huge_zero page
 25 - IDLE
@@ -128,6 +129,8 @@ Short descriptions to the page flags
 Note that this flag may be stale in case the page was accessed via
 a PTE. To make sure the flag is up-to-date one has to read
 ``/sys/kernel/mm/page_idle/bitmap`` first.
+26 - PGTABLE
+page is in use as a page table
 
 IO related page flags
 -
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 6c517b11acf8..378401af4d9d 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -152,8 +152,8 @@ u64 stable_page_flags(struct page *page)
else if (page_count(page) == 0 && is_free_buddy_page(page))
u |= 1 << KPF_BUDDY;
 
-   if (PageBalloon(page))
-   u |= 1 << KPF_BALLOON;
+   if (PageOffline(page))
+   u |= 1 << KPF_OFFLINE;
if (PageTable(page))
u |= 1 << KPF_PGTABLE;
 
diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
index cbe50da5a59d..f111c780ef1d 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -95,7 +95,7 @@ extern int balloon_page_migrate(struct address_space *mapping,
 static inline void balloon_page_insert(struct balloon_dev_info *balloon,
   struct page *page)
 {
-   __SetPageBalloon(page);
+   __SetPageOffline(page);
__SetPageMovable(page, balloon->inode->i_mapping);
set_page_private(page, (unsigned long)balloon);
list_add(>lru, >pages);
@@ -111,7 +111,7 @@ static inline void balloon_page_insert(struct 
balloon_dev_info *balloon,
  */
 static inline void balloon_page_delete(struct page *page)
 {
-   __ClearPageBalloon(page);
+   __ClearPageOffline(page);
__ClearPageMovable(page);
set_page_private(page, 0);
/*
@@ -141,13 +141,13 @@ static inline gfp_t balloon_mapping_gfp_mask(void)
 static inline void balloon_page_insert(struct balloon_dev_info *balloon,
   struct page *page)
 {
-   __SetPageBalloon(page);
+   __Se

[PATCH v1 5/8] hv_balloon: mark inflated pages PG_offline

2018-11-19 Thread David Hildenbrand
Mark inflated and never onlined pages PG_offline, to tell the world that
the content is stale and should not be dumped.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Kairui Song 
Cc: Vitaly Kuznetsov 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 drivers/hv/hv_balloon.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 211f3fe3a038..47719862e57f 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -681,8 +681,13 @@ static struct notifier_block hv_memory_nb = {
 /* Check if the particular page is backed and can be onlined and online it. */
 static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg)
 {
-   if (!has_pfn_is_backed(has, page_to_pfn(pg)))
+   if (!has_pfn_is_backed(has, page_to_pfn(pg))) {
+   if (!PageOffline(pg))
+   __SetPageOffline(pg);
return;
+   }
+   if (PageOffline(pg))
+   __ClearPageOffline(pg);
 
/* This frame is currently backed; online the page. */
__online_page_set_limits(pg);
@@ -1201,6 +1206,7 @@ static void free_balloon_pages(struct hv_dynmem_device 
*dm,
 
for (i = 0; i < num_pages; i++) {
pg = pfn_to_page(i + start_frame);
+   __ClearPageOffline(pg);
__free_page(pg);
dm->num_pages_ballooned--;
}
@@ -1213,7 +1219,7 @@ static unsigned int alloc_balloon_pages(struct 
hv_dynmem_device *dm,
struct dm_balloon_response *bl_resp,
int alloc_unit)
 {
-   unsigned int i = 0;
+   unsigned int i, j;
struct page *pg;
 
if (num_pages < alloc_unit)
@@ -1245,6 +1251,10 @@ static unsigned int alloc_balloon_pages(struct 
hv_dynmem_device *dm,
if (alloc_unit != 1)
split_page(pg, get_order(alloc_unit << PAGE_SHIFT));
 
+   /* mark all pages offline */
+   for (j = 0; j < (1 << get_order(alloc_unit << PAGE_SHIFT)); j++)
+   __SetPageOffline(pg + j);
+
bl_resp->range_count++;
bl_resp->range_array[i].finfo.start_page =
page_to_pfn(pg);
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 0/8] mm/kdump: allow to exclude pages that are logically offline

2018-11-19 Thread David Hildenbrand
Right now, pages inflated as part of a balloon driver will be dumped
by dump tools like makedumpfile. While XEN is able to check in the
crash kernel whether a certain pfn is actuall backed by memory in the
hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
virtio-balloon, hv-balloon and VMWare balloon inflated memory will
essentially result in zero pages getting allocated by the hypervisor and
the dump getting filled with this data.

The allocation and reading of zero pages can directly be avoided if a
dumping tool could know which pages only contain stale information not to
be dumped.

Also for XEN, calling into the kernel and asking the hypervisor if a
pfn is backed can be avoided if the duming tool would skip such pages
right from the beginning.

Dumping tools have no idea whether a given page is part of a balloon driver
and shall not be dumped. Esp. PG_reserved cannot be used for that purpose
as all memory allocated during early boot is also PG_reserved, see
discussion at [1]. So some other way of indication is required and a new
page flag is frowned upon.

We have PG_balloon (MAPCOUNT value), which is essentially unused now. I
suggest renaming it to something more generic (PG_offline) to mark pages as
logically offline. This flag can than e.g. also be used by virtio-mem in
the future to mark subsections as offline. Or by other code that wants to
put pages logically offline (e.g. later maybe poisoned pages that shall
no longer be used).

This series converts PG_balloon to PG_offline, allows dumping tools to
query the value to detect such pages and marks pages in the hv-balloon
and XEN balloon properly as PG_offline. Note that virtio-balloon already
set pages to PG_balloon (and now PG_offline).

Please note that this is also helpful for a problem we were seeing under
Hyper-V: Dumping logically offline memory (pages kept fake offline while
onlining a section via online_page_callback) would under some condicions
result in a kernel panic when dumping them.

As I don't have access to neither XEN nor Hyper-V nor VMWare installations,
this was only tested with the virtio-balloon and pages were properly
skipped when dumping. I'll also attach the makedumpfile patch to this
series.

[1] https://lkml.org/lkml/2018/7/20/566

RFC -> v1:
- Add "PM / Hibernate: use pfn_to_online_page()"
- Add "vmw_balloon: mark inflated pages PG_offline"
- "mm: convert PG_balloon to PG_offline"
-- After discussions, also rename the UAPI bit name (KPF_BALLOON -> KPF_OFFLINE)

David Hildenbrand (8):
  mm: balloon: update comment about isolation/migration/compaction
  mm: convert PG_balloon to PG_offline
  kexec: export PG_offline to VMCOREINFO
  xen/balloon: mark inflated pages PG_offline
  hv_balloon: mark inflated pages PG_offline
  vmw_balloon: mark inflated pages PG_offline
  PM / Hibernate: use pfn_to_online_page()
  PM / Hibernate: exclude all PageOffline() pages

 Documentation/admin-guide/mm/pagemap.rst |  9 ---
 drivers/hv/hv_balloon.c  | 14 --
 drivers/misc/vmw_balloon.c   | 32 ++
 drivers/xen/balloon.c|  3 +++
 fs/proc/page.c   |  4 +--
 include/linux/balloon_compaction.h   | 34 +---
 include/linux/page-flags.h   | 11 +---
 include/uapi/linux/kernel-page-flags.h   |  2 +-
 kernel/crash_core.c  |  2 ++
 kernel/power/snapshot.c  | 13 +
 tools/vm/page-types.c|  2 +-
 11 files changed, 87 insertions(+), 39 deletions(-)

-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 0/6] mm/kdump: allow to exclude pages that are logically offline

2018-11-16 Thread David Hildenbrand
On 14.11.18 22:16, David Hildenbrand wrote:
> Right now, pages inflated as part of a balloon driver will be dumped
> by dump tools like makedumpfile. While XEN is able to check in the
> crash kernel whether a certain pfn is actuall backed by memory in the
> hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
> virtio-balloon and hv-balloon inflated memory will essentially result in
> zero pages getting allocated by the hypervisor and the dump getting
> filled with this data.
> 
> The allocation and reading of zero pages can directly be avoided if a
> dumping tool could know which pages only contain stale information not to
> be dumped.
> 
> Also for XEN, calling into the kernel and asking the hypervisor if a
> pfn is backed can be avoided if the duming tool would skip such pages
> right from the beginning.
> 
> Dumping tools have no idea whether a given page is part of a balloon driver
> and shall not be dumped. Esp. PG_reserved cannot be used for that purpose
> as all memory allocated during early boot is also PG_reserved, see
> discussion at [1]. So some other way of indication is required and a new
> page flag is frowned upon.
> 
> We have PG_balloon (MAPCOUNT value), which is essentially unused now. I
> suggest renaming it to something more generic (PG_offline) to mark pages as
> logically offline. This flag can than e.g. also be used by virtio-mem in
> the future to mark subsections as offline. Or by other code that wants to
> put pages logically offline (e.g. later maybe poisoned pages that shall
> no longer be used).
> 
> This series converts PG_balloon to PG_offline, allows dumping tools to
> query the value to detect such pages and marks pages in the hv-balloon
> and XEN balloon properly as PG_offline. Note that virtio-balloon already
> set pages to PG_balloon (and now PG_offline).
> 
> Please note that this is also helpful for a problem we were seeing under
> Hyper-V: Dumping logically offline memory (pages kept fake offline while
> onlining a section via online_page_callback) would under some condicions
> result in a kernel panic when dumping them.
> 
> As I don't have access to neither XEN nor Hyper-V installation, this was
> not tested yet (and a makedumpfile change will be required to skip
> dumping these pages).
> 
> [1] https://lkml.org/lkml/2018/7/20/566
> 
> David Hildenbrand (6):
>   mm: balloon: update comment about isolation/migration/compaction
>   mm: convert PG_balloon to PG_offline
>   kexec: export PG_offline to VMCOREINFO
>   xen/balloon: mark inflated pages PG_offline
>   hv_balloon: mark inflated pages PG_offline
>   PM / Hibernate: exclude all PageOffline() pages
> 
>  Documentation/admin-guide/mm/pagemap.rst |  6 +
>  drivers/hv/hv_balloon.c  | 14 --
>  drivers/xen/balloon.c|  3 +++
>  fs/proc/page.c   |  4 +--
>  include/linux/balloon_compaction.h   | 34 +---
>  include/linux/page-flags.h   | 11 +---
>  include/uapi/linux/kernel-page-flags.h   |  1 +
>  kernel/crash_core.c  |  2 ++
>  kernel/power/snapshot.c  |  5 +++-
>  tools/vm/page-types.c|  1 +
>  10 files changed, 51 insertions(+), 30 deletions(-)
> 

I just did a test with virtio-balloon (and a very simple makedumpfile
patch which I can supply on demand).

1. Guest with 8GB. Inflate balloon to 4GB via
 sudo virsh setmem f29 --size 4096M --live

2. Trigger a kernel panic in the guest
echo 1 > /proc/sys/kernel/sysrq
echo c > /proc/sysrq-trigger

Original pages  : 0x001e1da8
  Excluded pages   : 0x001c9221
Pages filled with zero  : 0x50b0
Non-private cache pages : 0x00046547
Private cache pages : 0x2165
User process data pages : 0x48cf
Free pages  : 0x000771f6
Hwpoison pages  : 0x
Offline pages   : 0x0010
  Remaining pages  : 0x00018b87
  (The number of pages is reduced to 5%.)
Memory Hole : 0x0009e258
--
Total pages : 0x0028

(Offline patches matches the 4GB)

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 6/6] PM / Hibernate: exclude all PageOffline() pages

2018-11-15 Thread David Hildenbrand
On 15.11.18 13:23, Michal Hocko wrote:
> On Wed 14-11-18 22:17:04, David Hildenbrand wrote:
> [...]
>> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
>> index b0308a2c6000..01db1d13481a 100644
>> --- a/kernel/power/snapshot.c
>> +++ b/kernel/power/snapshot.c
>> @@ -1222,7 +1222,7 @@ static struct page *saveable_highmem_page(struct zone 
>> *zone, unsigned long pfn)
>>  BUG_ON(!PageHighMem(page));
>>  
>>  if (swsusp_page_is_forbidden(page) ||  swsusp_page_is_free(page) ||
>> -PageReserved(page))
>> +PageReserved(page) || PageOffline(page))
>>  return NULL;
>>  
>>  if (page_is_guard(page))
>> @@ -1286,6 +1286,9 @@ static struct page *saveable_page(struct zone *zone, 
>> unsigned long pfn)
>>  if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page))
>>  return NULL;
>>  
>> +if (PageOffline(page))
>> +return NULL;
>> +
>>  if (PageReserved(page)
>>  && (!kernel_page_present(page) || pfn_is_nosave(pfn)))
>>  return NULL;
> 
> Btw. now that you are touching this file could you also make
> s@pfn_to_page@pfn_to_online_page@ please? We really do not want to touch
> offline pfn ranges in general. A separate patch for that of course.
> 
> Thanks!
> 

Sure thing, will look into that!

Thanks!

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 3/6] kexec: export PG_offline to VMCOREINFO

2018-11-15 Thread David Hildenbrand
On 15.11.18 12:52, Borislav Petkov wrote:
> On Thu, Nov 15, 2018 at 12:20:40PM +0100, David Hildenbrand wrote:
>> Sorry to say, but that is the current practice without which
>> makedumpfile would not be able to work at all. (exclude user pages,
>> exclude page cache, exclude buddy pages). Let's not reinvent the wheel
>> here. This is how dumping works forever.
> 
> Sorry, but "we've always done this in the past" doesn't make it better.

Just saying that "I'm not the first to do it, don't hit me with a stick" :)

> 
>> I don't see how there should be "set of pages which do not have
>> PG_offline".
> 
> It doesn't have to be a set of pages. Think a (mmconfig perhaps) region
> which the kdump kernel should completely skip because poking in it in
> the kdump kernel, causes all kinds of havoc like machine checks. etc.
> We've had and still have one issue like that.

Indeed. And we still have without makedumpfile. I think you are aware of
this, but I'll explain it just for consistency: PG_hwpoison

At some point we detect a HW error and mask a page as PG_hwpoison.

makedumpfile knows how to treat that flag and can exclude it from the
dump (== not access it). No crash.

kdump itself has no clue about old "struct pages". Especially:
a) Where they are located in memory (e.g. SPARSE)
b) What their format is ("where are the flags")
c) What the meaning of flags is ("what does bit X mean")

In order to know such information, we would have to do parsing of quite
some information inside the kernel in kdump. Basically what makedumpfile
does just now. Is this feasible? I don't think so.

So we would need another approach to communicate such information as you
said. I can't think of any, but if anybody reading this has an idea,
please speak up. I am interested.


The *only* way right now we would have to handle such scenarios:

1. While dumping memory and we get a machine check, fake reading a zero
page instead of crashing.
2. While dumping memory and we get a fault, fake reading a zero page
instead of crashing.

> 
> But let me clarify my note: I don't want to be discussing with you the
> design of makedumpfile and how it should or should not work - that ship
> has already sailed. Apparently there are valid reasons to do it this
> way.

Indeed, and the basic design is to export these flags. (let's say
"unfortunately", being able to handle such stuff in kdump directly would
be the dream).

> I was *simply* stating that it feels wrong to export mm flags like that.
> 
> But as I said already, that is mm guys' call and looking at how we're
> already exporting a bunch of stuff in the vmcoreinfo - including other
> mm flags - I guess one more flag doesn't matter anymore.

Fair enough, noted. If you have an idea how to handle this in kdump,
please let me know.

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 3/6] kexec: export PG_offline to VMCOREINFO

2018-11-15 Thread David Hildenbrand
On 15.11.18 12:10, Borislav Petkov wrote:
> On Thu, Nov 15, 2018 at 02:19:23PM +0800, Dave Young wrote:
>> It would be good to copy some background info from cover letter to the
>> patch description so that we can get better understanding why this is
>> needed now.
>>
>> BTW, Lianbo is working on a documentation of the vmcoreinfo exported
>> fields. Ccing him so that he is aware of this.
>>
>> Also cc Boris,  although I do not want the doc changes blocks this
>> he might have different opinion :)
> 
> Yeah, my initial reaction is that exporting an mm-internal flag to
> userspace is a no-no.

Sorry to say, but that is the current practice without which
makedumpfile would not be able to work at all. (exclude user pages,
exclude page cache, exclude buddy pages). Let's not reinvent the wheel
here. This is how dumping works forever.

Also see how hwpoisoned pages are handled.

(please note that most distributions only support dumping via
makedumpfile, for said reasons)

> 
> What would be better, IMHO, is having a general method of telling the
> kdump kernel - maybe ranges of physical addresses - which to skip.

And that has to be updated whenever we change a page flag. But then the
dump kernel would have to be aware about "struct page" location and
format of some old kernel to be dump. Let's just not even discuss going
down that path.

> 
> Because the moment there's a set of pages which do not have PG_offline
> set but kdump would still like to skip, this breaks.

I don't understand your concern. PG_offline is only an optimization for
sections that are online. Offline sections can already be completely
ignored when dumping.

I don't see how there should be "set of pages which do not have
PG_offline". I mean if they don't have PG_offline they will simply be
dumped like before. Which worked forever. Sorry, I don't get your point.

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 3/6] kexec: export PG_offline to VMCOREINFO

2018-11-15 Thread David Hildenbrand
On 15.11.18 07:19, Dave Young wrote:
> Hi David,
> 
> On 11/14/18 at 10:17pm, David Hildenbrand wrote:
>> Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so
>> makedumpfile can directly skip pages that are logically offline and the
>> content therefore stale.
> 
> It would be good to copy some background info from cover letter to the
> patch description so that we can get better understanding why this is
> needed now.

Yes, will add more detail!

> 
> BTW, Lianbo is working on a documentation of the vmcoreinfo exported
> fields. Ccing him so that he is aware of this.
> 
> Also cc Boris,  although I do not want the doc changes blocks this
> he might have different opinion :)

I'll be happy to help updating documentation (or updating it myself if
the doc updates go in first).

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 2/6] mm: convert PG_balloon to PG_offline

2018-11-15 Thread David Hildenbrand
On 15.11.18 03:07, Mike Rapoport wrote:
> On Wed, Nov 14, 2018 at 11:49:15PM +0100, David Hildenbrand wrote:
>> On 14.11.18 23:23, Matthew Wilcox wrote:
>>> On Wed, Nov 14, 2018 at 10:17:00PM +0100, David Hildenbrand wrote:
>>>> Rename PG_balloon to PG_offline. This is an indicator that the page is
>>>> logically offline, the content stale and that it should not be touched
>>>> (e.g. a hypervisor would have to allocate backing storage in order for the
>>>> guest to dump an unused page).  We can then e.g. exclude such pages from
>>>> dumps.
>>>>
>>>> In following patches, we will make use of this bit also in other balloon
>>>> drivers.  While at it, document PGTABLE.
>>>
>>> Thank you for documenting PGTABLE.  I didn't realise I also had this
>>> document to update when I added PGTABLE.
>>
>> Thank you for looking into this :)
>>
>>>
>>>> +++ b/Documentation/admin-guide/mm/pagemap.rst
>>>> @@ -78,6 +78,8 @@ number of times a page is mapped.
>>>>  23. BALLOON
>>>>  24. ZERO_PAGE
>>>>  25. IDLE
>>>> +26. PGTABLE
>>>> +27. OFFLINE
>>>
>>> So the offline *user* bit is new ... even though the *kernel* bit
>>> just renames the balloon bit.  I'm not sure how I feel about this.
>>> I'm going to think about it some more.  Could you share your decision
>>> process with us?
>>
>> BALLOON was/is documented as
>>
>> "23 - BALLOON
>> balloon compaction page
>> "
>>
>> and only includes all virtio-ballon pages after the non-lru migration
>> feature has been implemented for ballooned pages. Since then, this flag
>> does basically no longer stands for what it actually was supposed to do.
> 
> Perhaps I missing something, but how the user should interpret "23" when he
> reads /proc/kpageflags?

Looking at the history in more detail:

commit 09316c09dde33aae14f34489d9e3d243ec0d5938
Author: Konstantin Khlebnikov 
Date:   Thu Oct 9 15:29:32 2014 -0700

mm/balloon_compaction: add vmstat counters and kpageflags bit

Always mark pages with PageBalloon even if balloon compaction is
disabled
and expose this mark in /proc/kpageflags as KPF_BALLOON.


So KPF_BALLOON was exposed when virtio-balloon pages were always marked
with PG_balloon. So the documentation is actually wrong ("balloon page"
vs. "balloon compaction page").

I have no idea who actually used that information. I suspect this was
just some debugging aid.

> 
>> To not break uapi I decided to not rename it but instead to add a new flag.
> 
> I've got a feeling that uapi was anyway changed for the BALLON flag
> meaning.

Yes. If we *replace* KPF_BALLOON by KPF_OFFLINE

a) Some applications might no longer compile (I guess that's ok)
b) Some old applications will treat KPF_OFFLINE like KPF_BALLOON (which
should at least for virtio-balloon usage until now be fine - it is just
more generic)

So I guess it's up to Maintainers/Matthew to decide :)

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


  1   2   >