date:20200923

Hi Qing,

It looks like all vivo guys patch has 'charset=y' problem and
mess code which fails on a success 'git am'. I have to repeat
the same reminder again and again...

Could you guys double your patches before send out? and make 
sure docs looks good on webpage, like 
https://www.kernel.org/doc/html/v5.9-rc3/translations/zh_CN/filesystems/debugfs.html

Thanks
Alex

在 2020/9/22 下午8:03, Wang Qing 写道:
> Translate Documentation/filesystems/btrfs.rst into Chinese.
> 
> Signed-off-by: Wang Qing 
> ---

[PATCH v3] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-23 Thread Chris Goldsworthy

V1: Introduces a retry loop that attempts a CMA allocation a finite
number of times before giving up:
 
https://lkml.org/lkml/2020/8/5/1097
https://lkml.org/lkml/2020/8/11/893

V2: Introduces an indefinite retry for CMA allocations.  David Hildenbrand
raised a page pinning example which precludes doing this infite-retrying
for all CMA users:

https://lkml.org/lkml/2020/9/17/984

V3: Re-introduce a GFP mask argument for cma_alloc(), that can take in
__GFP_NOFAIL as an argument to indicate that a CMA allocation should be
retried indefinitely. This lets callers of cma_alloc() decide if they want
to perform indefinite retires. Also introduces a config option for
controlling the duration of the sleep between retries.

Chris Goldsworthy (1):
  mm: cma: indefinitely retry allocations in cma_alloc

 arch/powerpc/kvm/book3s_hv_builtin.c   |  2 +-
 drivers/dma-buf/heaps/cma_heap.c   |  2 +-
 drivers/s390/char/vmcp.c   |  2 +-
 drivers/staging/android/ion/ion_cma_heap.c |  2 +-
 include/linux/cma.h|  2 +-
 kernel/dma/contiguous.c|  4 ++--
 mm/Kconfig | 11 ++
 mm/cma.c   | 35 +-
 mm/cma_debug.c |  2 +-
 mm/hugetlb.c   |  4 ++--
 10 files changed, 50 insertions(+), 16 deletions(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

[PATCH v3] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-23 Thread Chris Goldsworthy

CMA allocations will fail if 'pinned' pages are in a CMA area, since we
cannot migrate pinned pages. The _refcount of a struct page being greater
than _mapcount for that page can cause pinning for anonymous pages.  This
is because try_to_unmap(), which (1) is called in the CMA allocation path,
and (2) decrements both _refcount and _mapcount for a page, will stop
unmapping a page from VMAs once the _mapcount for a page reaches 0.  This
implies that after try_to_unmap() has finished successfully for a page
where _recount > _mapcount, that _refcount will be greater than 0.  Later
in the CMA allocation path in migrate_page_move_mapping(), we will have one
more reference count than intended for anonymous pages, meaning the
allocation will fail for that page.

If a process ends up causing _refcount > _mapcount for a page (by either
incrementing _recount or decrementing _mapcount), such that the process is
context switched out after modifying one refcount but before modifying the
other, the page will be temporarily pinned.

One example of where _refcount can be greater than _mapcount is inside of
zap_pte_range(), which is called for all the entries of a PMD when a
process is exiting, to unmap the process's memory.  Inside of
zap_pte_range(), after unammping a page with page_remove_rmap(), we have
that _recount > _mapcount.  _refcount can only be decremented after a TLB
flush is performed for the page - this doesn't occur until enough pages
have been batched together for flushing.  The flush can either occur inside
of zap_pte_range() (during the same invocation or a later one), or if there
aren't enough pages collected by the time we unmap all of the pages in a
process, the flush will occur in tlb_finish_mmu() in exit_mmap().  After
the flush has occurred, tlb_batch_pages_flush() will decrement the
references on the flushed pages.

Another such example like the above is inside of copy_one_pte(), which is
called during a fork. For PTEs for which pte_present(pte) == true,
copy_one_pte() will increment the _refcount field followed by the
_mapcount field of a page.

So, inside of cma_alloc(), add the option of letting users pass in
__GFP_NOFAIL to indicate that we should retry CMA allocations indefinitely,
in the event that alloc_contig_range() returns -EBUSY after having scanned
a whole CMA-region bitmap.

Signed-off-by: Chris Goldsworthy 
Co-developed-by: Vinayak Menon 
Signed-off-by: Vinayak Menon 
---
 arch/powerpc/kvm/book3s_hv_builtin.c   |  2 +-
 drivers/dma-buf/heaps/cma_heap.c   |  2 +-
 drivers/s390/char/vmcp.c   |  2 +-
 drivers/staging/android/ion/ion_cma_heap.c |  2 +-
 include/linux/cma.h|  2 +-
 kernel/dma/contiguous.c|  4 ++--
 mm/Kconfig | 11 ++
 mm/cma.c   | 35 +-
 mm/cma_debug.c |  2 +-
 mm/hugetlb.c   |  4 ++--
 10 files changed, 50 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index 073617c..21c3f6a 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -74,7 +74,7 @@ struct page *kvm_alloc_hpt_cma(unsigned long nr_pages)
VM_BUG_ON(order_base_2(nr_pages) < KVM_CMA_CHUNK_ORDER - PAGE_SHIFT);
 
return cma_alloc(kvm_cma, nr_pages, order_base_2(HPT_ALIGN_PAGES),
-false);
+0);
 }
 EXPORT_SYMBOL_GPL(kvm_alloc_hpt_cma);
 
diff --git a/drivers/dma-buf/heaps/cma_heap.c b/drivers/dma-buf/heaps/cma_heap.c
index 626cf7f..7657359 100644
--- a/drivers/dma-buf/heaps/cma_heap.c
+++ b/drivers/dma-buf/heaps/cma_heap.c
@@ -66,7 +66,7 @@ static int cma_heap_allocate(struct dma_heap *heap,
helper_buffer->heap = heap;
helper_buffer->size = len;
 
-   cma_pages = cma_alloc(cma_heap->cma, nr_pages, align, false);
+   cma_pages = cma_alloc(cma_heap->cma, nr_pages, align, 0);
if (!cma_pages)
goto free_buf;
 
diff --git a/drivers/s390/char/vmcp.c b/drivers/s390/char/vmcp.c
index 9e06628..11c4e3b 100644
--- a/drivers/s390/char/vmcp.c
+++ b/drivers/s390/char/vmcp.c
@@ -70,7 +70,7 @@ static void vmcp_response_alloc(struct vmcp_session *session)
 * anymore the system won't work anyway.
 */
if (order > 2)
-   page = cma_alloc(vmcp_cma, nr_pages, 0, false);
+   page = cma_alloc(vmcp_cma, nr_pages, 0, 0);
if (page) {
session->response = (char *)page_to_phys(page);
session->cma_alloc = 1;
diff --git a/drivers/staging/android/ion/ion_cma_heap.c 
b/drivers/staging/android/ion/ion_cma_heap.c
index bf65e67..128d3a5 100644
--- a/drivers/staging/android/ion/ion_cma_heap.c
+++ b/drivers/staging/android/ion/ion_cma_heap.c
@@ -39,7 +39,7 @@ static int ion_cma_allocate(struct ion_heap *heap, struct

[PATCH] rpadlpar_io:Add MODULE_DESCRIPTION entries to kernel modules

2020-09-23 Thread Mamatha Inamdar

This patch adds a brief MODULE_DESCRIPTION to rpadlpar_io kernel modules
(descriptions taken from Kconfig file)

Signed-off-by: Mamatha Inamdar 
---
 drivers/pci/hotplug/rpadlpar_core.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/pci/hotplug/rpadlpar_core.c 
b/drivers/pci/hotplug/rpadlpar_core.c
index f979b70..bac65ed 100644
--- a/drivers/pci/hotplug/rpadlpar_core.c
+++ b/drivers/pci/hotplug/rpadlpar_core.c
@@ -478,3 +478,4 @@ static void __exit rpadlpar_io_exit(void)
 module_init(rpadlpar_io_init);
 module_exit(rpadlpar_io_exit);
 MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("RPA Dynamic Logical Partitioning driver for I/O slots");

Re: [PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-23 Thread Chris Goldsworthy


On 2020-09-17 10:54, Chris Goldsworthy wrote:

On 2020-09-15 00:53, David Hildenbrand wrote:

On 14.09.20 20:33, Chris Goldsworthy wrote:

On 2020-09-14 02:31, David Hildenbrand wrote:

On 11.09.20 21:17, Chris Goldsworthy wrote:


So, inside of cma_alloc(), instead of giving up when
alloc_contig_range()
returns -EBUSY after having scanned a whole CMA-region bitmap,
perform
retries indefinitely, with sleeps, to give the system an 
opportunity

to
unpin any pinned pages.

Signed-off-by: Chris Goldsworthy 
Co-developed-by: Vinayak Menon 
Signed-off-by: Vinayak Menon 
---
 mm/cma.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index 7f415d7..90bb505 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t
count, unsigned int align,
bitmap_maxno, start, bitmap_count, mask,
offset);
if (bitmap_no >= bitmap_maxno) {
-   mutex_unlock(>lock);
-   break;
+   if (ret == -EBUSY) {
+   mutex_unlock(>lock);
+
+   /*
+* Page may be momentarily pinned by some other
+* process which has been scheduled out, e.g.
+* in exit path, during unmap call, or process
+* fork and so cannot be freed there. Sleep
+* for 100ms and retry the allocation.
+*/
+   start = 0;
+   ret = -ENOMEM;
+   msleep(100);
+   continue;
+   } else {
+   /*
+* ret == -ENOMEM - all bits in cma->bitmap are
+* set, so we break accordingly.
+*/
+   mutex_unlock(>lock);
+   break;
+   }
}
bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
/*



What about long-term pinnings? IIRC, that can happen easily e.g.,
with
vfio (and I remember there is a way via vmsplice).

Not convinced trying forever is a sane approach in the general case
...


V1:
[1] https://lkml.org/lkml/2020/8/5/1097
[2] https://lkml.org/lkml/2020/8/6/1040
[3] https://lkml.org/lkml/2020/8/11/893
[4] https://lkml.org/lkml/2020/8/21/1490
[5] https://lkml.org/lkml/2020/9/11/1072

We're fine with doing indefinite retries, on the grounds that if 
there

is some long-term pinning that occurs when alloc_contig_range returns
-EBUSY, that it should be debugged and fixed.  Would it be possible 
to

make this infinite-retrying something that could be enabled or
disabled
by a defconfig option?


Two thoughts:

This means I strongly prefer something like [3] if feasible.


_Resending so that this ends up on LKML_

I can give [3] some further thought then.  Also, I realized [3] will 
not

completely solve the problem, it just reduces the window in which
_refcount > _mapcount (as mentioned in earlier threads, we encountered
the pinning when a task in copy_one_pte() or in the exit_mmap() path
gets context switched out).  If we were to try a sleeping-lock based
solution, do you think it would be permissible to add another lock to
struct page?


I have not been able to think of a clean way of introducing calls to 
preempt_disable() in exit_mmap(), which is the more problematic case.  
We would need to track state across multiple invocations of 
zap_pte_range() (which is called for each entry in a PMD when a 
process's memory is being unmapped), and would also need to extend this 
to tlb_finish_mmu(), which is called after all the process's memory has 
been unmapped: 
https://elixir.bootlin.com/linux/v5.8.10/source/mm/mmap.c#L3164.  As a 
follow-up to this patch, I'm submitting a patch that re-introduces the 
GFP mask for cma_alloc, that will perform indefinite retires if 
__GFP_NOFAIL is passed to the function.


--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project

Re: [PATCH v2] powerpc/pci: unmap legacy INTx interrupts when a PHB is removed

2020-09-23 Thread Alexey Kardashevskiy




On 23/09/2020 17:06, Cédric Le Goater wrote:
> On 9/23/20 2:33 AM, Qian Cai wrote:
>> On Fri, 2020-08-07 at 12:18 +0200, Cédric Le Goater wrote:
>>> When a passthrough IO adapter is removed from a pseries machine using
>>> hash MMU and the XIVE interrupt mode, the POWER hypervisor expects the
>>> guest OS to clear all page table entries related to the adapter. If
>>> some are still present, the RTAS call which isolates the PCI slot
>>> returns error 9001 "valid outstanding translations" and the removal of
>>> the IO adapter fails. This is because when the PHBs are scanned, Linux
>>> maps automatically the INTx interrupts in the Linux interrupt number
>>> space but these are never removed.
>>>
>>> To solve this problem, we introduce a PPC platform specific
>>> pcibios_remove_bus() routine which clears all interrupt mappings when
>>> the bus is removed. This also clears the associated page table entries
>>> of the ESB pages when using XIVE.
>>>
>>> For this purpose, we record the logical interrupt numbers of the
>>> mapped interrupt under the PHB structure and let pcibios_remove_bus()
>>> do the clean up.
>>>
>>> Since some PCI adapters, like GPUs, use the "interrupt-map" property
>>> to describe interrupt mappings other than the legacy INTx interrupts,
>>> we can not restrict the size of the mapping array to PCI_NUM_INTX. The
>>> number of interrupt mappings is computed from the "interrupt-map"
>>> property and the mapping array is allocated accordingly.
>>>
>>> Cc: "Oliver O'Halloran" 
>>> Cc: Alexey Kardashevskiy 
>>> Signed-off-by: Cédric Le Goater 
>>
>> Some syscall fuzzing will trigger this on POWER9 NV where the traces pointed 
>> to
>> this patch.
>>
>> .config: https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config
> 
> OK. The patch is missing a NULL assignement after kfree() and that
> might be the issue. 
> 
> I did try PHB removal under PowerNV, so I would like to understand 
> how we managed to remove twice the PCI bus and possibly reproduce. 
> Any chance we could grab what the syscall fuzzer (syzkaller) did ? 



My guess would be it is doing this in parallel to provoke races.



-- 
Alexey

Re: [PATCH v3 2/3] misc: bcm-vk: add Broadcom VK driver

2020-09-23 Thread Greg Kroah-Hartman

On Wed, Sep 23, 2020 at 09:43:55PM -0700, Scott Branden wrote:
> >> +struct bcm_vk_tty {
> >> +  struct tty_port port;
> >> +  uint32_t to_offset; /* bar offset to use */
> >> +  uint32_t to_size;   /* to VK buffer size */
> >> +  uint32_t wr;/* write offset shadow */
> >> +  uint32_t from_offset;   /* bar offset to use */
> >> +  uint32_t from_size; /* from VK buffer size */
> >> +  uint32_t rd;/* read offset shadow */
> > nit, these "unit32_t" stuff really doesn't matter in the kernel, 'u32'
> > is a better choice overall.  Same for u8 and others, for this whole
> > driver.
> Other than personal preference, I don't understand how 'u32' is better.
> uint32_t follows the ANSI stdint.h.  It allows for portable code without
> the need to define custom u32 types.

The ANSI namespace does not work in the kernel, which is why we have our
own types that pre-date those, and work properly everywhere in the
kernel.

> stdint types are used in many drivers in the linux kernel already.
> We would prefer to keep our code as portable as possible and use
> stdint types in the driver.

You aren't porting this code to other operating systems easily, please
use the kernel types :)

And yes, these types are used in other parts, but when you have 25
million lines of code, some crud does slip in at times...

> >> +  pid_t pid;
> >> +  bool irq_enabled;
> >> +  bool is_opened; /* tracks tty open/close */
> > Why do you need to track this?  Doesn't the tty core handle this for
> > you?
> I have tried using tty_port_kopened() and it doesn't seem to work.
> Will need to debug some more unless you have another suggested function to 
> use.

You didn't answer _why_ you need to track this.  A tty driver shouldn't
care about this type of thing.

> >> +  struct workqueue_struct *tty_wq_thread;
> >> +  struct work_struct tty_wq_work;
> >> +
> >> +  /* Reference-counting to handle file operations */
> >> +  struct kref kref;
> > And a kref?
> >
> > What is controlling the lifetime rules of your structure?
> >
> > Why a kref?
> >
> > Why the tty ports?
> >
> > Why the misc device?
> >
> > This feels really crazy to me...
> Comments mostly from Desmond here:
> 
> Yes, we have created a PCIe centric driver that combines with both a misc 
> devices on top (for the read/write/ioctrl), and also ttys.
> The device sits on PCIe but we are using the misc device for accessing it.
> tty is just another on top.  I don't think this is that uncommon to have a 
> hybrid driver.

Ugh, yes, it is uncommon because those are two different things.  Why do
you need/want a misc driver to control a tty device?  Why do you need a
tty device?  What really is this beast?

We got rid of the old "control path" device nodes for tty devices a long
time ago, this feels like a return to that old model, is that why you
are doing this?

But again, I really don't understand what this driver is trying to
control/manage, so it's hard to review it without that knowledge.

> Since we have a hybrid of PCIe + misc + tty, it means that we could 
> simultaneously have opening dev/node to read/write (multiple) + tty o going.

That's almost always a bad idea.

> Since the struct is embedded inside the primary PCIe structure, we need a way 
> to know when all the references are done, and then at that point we could 
> free the primary structure.
> That is the reason for the kref.  On PCIe device removal, we signal the user 
> space process to stop first, but the data structure can not be freed until 
> the ref goes to 0.

Again, you can not have multiple reference count objects controling a
single object.  That way is madness and buggy and will never work
properly.

You can have different objects with different lifespans, which, if you
really really want to do this, is the correct way.  Otherwise, stick
with one object and one reference count please.

thanks,

greg k-h

Re: [PATCH -next] crypto: qat - remove unnecessary mutex_init()

2020-09-23 Thread Giovanni Cabiddu

On Wed, Sep 16, 2020 at 07:21:21AM +0100, Qinglang Miao wrote:
> The mutex adf_ctl_lock is initialized statically. It is
> unnecessary to initialize by mutex_init().
> 
> Signed-off-by: Qinglang Miao 

Acked-by: Giovanni Cabiddu 

> ---
>  drivers/crypto/qat/qat_common/adf_ctl_drv.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/crypto/qat/qat_common/adf_ctl_drv.c 
> b/drivers/crypto/qat/qat_common/adf_ctl_drv.c
> index 71d0c44aa..eb9b3be9d 100644
> --- a/drivers/crypto/qat/qat_common/adf_ctl_drv.c
> +++ b/drivers/crypto/qat/qat_common/adf_ctl_drv.c
> @@ -416,8 +416,6 @@ static long adf_ctl_ioctl(struct file *fp, unsigned int 
> cmd, unsigned long arg)
>  
>  static int __init adf_register_ctl_device_driver(void)
>  {
> - mutex_init(_ctl_lock);
> -
>   if (adf_chr_drv_create())
>   goto err_chr_dev;
>  
> -- 
> 2.23.0
>

Re: [PATCH -next] crypto: qat - convert to use DEFINE_SEQ_ATTRIBUTE macro

2020-09-23 Thread Giovanni Cabiddu

On Wed, Sep 16, 2020 at 03:50:17AM +0100, Liu Shixin wrote:
> Use DEFINE_SEQ_ATTRIBUTE macro to simplify the code.
> 
> Signed-off-by: Liu Shixin 

Acked-by: Giovanni Cabiddu 

> ---
>  drivers/crypto/qat/qat_common/adf_cfg.c   | 19 +
>  .../qat/qat_common/adf_transport_debug.c  | 42 ++-
>  2 files changed, 5 insertions(+), 56 deletions(-)
> 
> diff --git a/drivers/crypto/qat/qat_common/adf_cfg.c 
> b/drivers/crypto/qat/qat_common/adf_cfg.c
> index ac462796cefc..22ae32838113 100644
> --- a/drivers/crypto/qat/qat_common/adf_cfg.c
> +++ b/drivers/crypto/qat/qat_common/adf_cfg.c
> @@ -52,24 +52,7 @@ static const struct seq_operations qat_dev_cfg_sops = {
>   .show = qat_dev_cfg_show
>  };
>  
> -static int qat_dev_cfg_open(struct inode *inode, struct file *file)
> -{
> - int ret = seq_open(file, _dev_cfg_sops);
> -
> - if (!ret) {
> - struct seq_file *seq_f = file->private_data;
> -
> - seq_f->private = inode->i_private;
> - }
> - return ret;
> -}
> -
> -static const struct file_operations qat_dev_cfg_fops = {
> - .open = qat_dev_cfg_open,
> - .read = seq_read,
> - .llseek = seq_lseek,
> - .release = seq_release
> -};
> +DEFINE_SEQ_ATTRIBUTE(qat_dev_cfg);
>  
>  /**
>   * adf_cfg_dev_add() - Create an acceleration device configuration table.
> diff --git a/drivers/crypto/qat/qat_common/adf_transport_debug.c 
> b/drivers/crypto/qat/qat_common/adf_transport_debug.c
> index 2a2eccbf56ec..dac25ba47260 100644
> --- a/drivers/crypto/qat/qat_common/adf_transport_debug.c
> +++ b/drivers/crypto/qat/qat_common/adf_transport_debug.c
> @@ -77,31 +77,14 @@ static void adf_ring_stop(struct seq_file *sfile, void *v)
>   mutex_unlock(_read_lock);
>  }
>  
> -static const struct seq_operations adf_ring_sops = {
> +static const struct seq_operations adf_ring_debug_sops = {
>   .start = adf_ring_start,
>   .next = adf_ring_next,
>   .stop = adf_ring_stop,
>   .show = adf_ring_show
>  };
>  
> -static int adf_ring_open(struct inode *inode, struct file *file)
> -{
> - int ret = seq_open(file, _ring_sops);
> -
> - if (!ret) {
> - struct seq_file *seq_f = file->private_data;
> -
> - seq_f->private = inode->i_private;
> - }
> - return ret;
> -}
> -
> -static const struct file_operations adf_ring_debug_fops = {
> - .open = adf_ring_open,
> - .read = seq_read,
> - .llseek = seq_lseek,
> - .release = seq_release
> -};
> +DEFINE_SEQ_ATTRIBUTE(adf_ring_debug);
>  
>  int adf_ring_debugfs_add(struct adf_etr_ring_data *ring, const char *name)
>  {
> @@ -188,31 +171,14 @@ static void adf_bank_stop(struct seq_file *sfile, void 
> *v)
>   mutex_unlock(_read_lock);
>  }
>  
> -static const struct seq_operations adf_bank_sops = {
> +static const struct seq_operations adf_bank_debug_sops = {
>   .start = adf_bank_start,
>   .next = adf_bank_next,
>   .stop = adf_bank_stop,
>   .show = adf_bank_show
>  };
>  
> -static int adf_bank_open(struct inode *inode, struct file *file)
> -{
> - int ret = seq_open(file, _bank_sops);
> -
> - if (!ret) {
> - struct seq_file *seq_f = file->private_data;
> -
> - seq_f->private = inode->i_private;
> - }
> - return ret;
> -}
> -
> -static const struct file_operations adf_bank_debug_fops = {
> - .open = adf_bank_open,
> - .read = seq_read,
> - .llseek = seq_lseek,
> - .release = seq_release
> -};
> +DEFINE_SEQ_ATTRIBUTE(adf_bank_debug);
>  
>  int adf_bank_debugfs_add(struct adf_etr_bank_data *bank)
>  {
> -- 
> 2.25.1
>

Re: [PATCH V2] rcu/tree: Correctly handle single cpu check in rcu_blocking_is_gp

2020-09-23 Thread Neeraj Upadhyay


Hi Paul,

On 9/24/2020 2:33 AM, Paul E. McKenney wrote:

On Wed, Sep 23, 2020 at 12:59:33PM +0530, Neeraj Upadhyay wrote:

Currently, for non-preempt kernels (with CONFIG_PREEMPTION=n),
rcu_blocking_is_gp() checks (with preemption disabled), whether
there is only one cpu online. It uses num_online_cpus() to
decide whether only one cpu is online. If there is only single
cpu online, synchronize_rcu() is optimized to return without
doing all the work to wait for grace period. However, there are
few issues with the num_online_cpus() check used, for transition
of __num_online_cpus from 2 -> 1 for cpu down path and 1 -> 2
for cpu up path.


Again, good catch!

As usual, I could not resist editing the commit log and comments, so
could you please look the following over to make sure that I did not
mess anything up?



The commit log and comments look very good! Thanks!


Thanks
Neeraj


Thanx, Paul



commit 7af8c1c8d13c6c9c7013fbfef77f843dbc430c4b
Author: Neeraj Upadhyay 
Date:   Wed Sep 23 12:59:33 2020 +0530

 rcu: Fix single-CPU check in rcu_blocking_is_gp()
 
 Currently, for CONFIG_PREEMPTION=n kernels, rcu_blocking_is_gp() uses

 num_online_cpus() to determine whether there is only one CPU online.  When
 there is only single CPU online, the simple fact that synchronize_rcu()
 could be legally called implies that a full grace period has elapsed.
 Therefore, in the single-CPU case, synchronize_rcu() simply returns
 immediately.  Unfortunately, num_online_cpus() is unreliable while a
 CPU-hotplug operation is transitioning to or from single-CPU operation
 because:
 
 1.  num_online_cpus() uses atomic_read(&__num_online_cpus) to

 locklessly sample the number of online CPUs.  The hotplug locks
 are not held, which means that an incoming CPU can concurrently
 update this count.  This in turn means that an RCU read-side
 critical section on the incoming CPU might observe updates
 prior to the grace period, but also that this critical section
 might extend beyond the end of the optimized synchronize_rcu().
 This breaks RCU's fundamental guarantee.
 
 2.  In addition, num_online_cpus() does no ordering, thus providing

 another way that RCU's fundamental guarantee can be broken by
 the current code.
 
 3.  The most probable failure mode happens on outgoing CPUs.

 The outgoing CPU updates the count of online CPUs in the
 CPUHP_TEARDOWN_CPU stop-machine handler, which is fine in
 and of itself due to preemption being disabled at the call
 to num_online_cpus().  Unfortunately, after that stop-machine
 handler returns, the CPU takes one last trip through the
 scheduler (which has RCU readers) and, after the resulting
 context switch, one final dive into the idle loop.  During this
 time, RCU needs to keep track of two CPUs, but num_online_cpus()
 will say that there is only one, which in turn means that the
 surviving CPU will incorrectly ignore the outgoing CPU's RCU
 read-side critical sections.
 
 This problem is illustrated by the following litmus test in which P0()

 corresponds to synchronize_rcu() and P1() corresponds to the incoming CPU.
 The herd7 tool confirms that the "exists" clause can be satisfied,
 thus demonstrating that this breakage can happen according to the Linux
 kernel memory model.
 
{

  int x = 0;
  atomic_t numonline = ATOMIC_INIT(1);
}
 
P0(int *x, atomic_t *numonline)

{
  int r0;
  WRITE_ONCE(*x, 1);
  r0 = atomic_read(numonline);
  if (r0 == 1) {
smp_mb();
  } else {
synchronize_rcu();
  }
  WRITE_ONCE(*x, 2);
}
 
P1(int *x, atomic_t *numonline)

{
  int r0; int r1;
 
  atomic_inc(numonline);

  smp_mb();
  rcu_read_lock();
  r0 = READ_ONCE(*x);
  smp_rmb();
  r1 = READ_ONCE(*x);
  rcu_read_unlock();
}
 
locations [x;numonline;]
 
exists (1:r0=0 /\ 1:r1=2)
 
 It is important to note that these problems arise only when the system

 is transitioning to or from single-CPU operation.
 
 One solution would be to hold the CPU-hotplug locks while sampling

 num_online_cpus(), which was in fact the intent of the (redundant)
 preempt_disable() and preempt_enable() surrounding this call to
 num_online_cpus().  Actually blocking CPU hotplug would not only result
 in excessive overhead, but would also unnecessarily impede CPU-hotplug

[PATCH v2 1/2] dt-bindings: arm: fsl: add Protonic WD3 board

2020-09-23 Thread Oleksij Rempel

Add Protonic Holland WD3 iMX6qp based board

Signed-off-by: Oleksij Rempel 
---
 Documentation/devicetree/bindings/arm/fsl.yaml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/arm/fsl.yaml 
b/Documentation/devicetree/bindings/arm/fsl.yaml
index e94a455eeab9..57dcd061d4c9 100644
--- a/Documentation/devicetree/bindings/arm/fsl.yaml
+++ b/Documentation/devicetree/bindings/arm/fsl.yaml
@@ -161,6 +161,7 @@ properties:
   - enum:
   - fsl,imx6qp-sabreauto  # i.MX6 Quad Plus SABRE Automotive 
Board
   - fsl,imx6qp-sabresd# i.MX6 Quad Plus SABRE Smart Device 
Board
+  - prt,prtwd3# Protonic WD3 board
   - const: fsl,imx6qp
 
   - description: i.MX6DL based Boards
-- 
2.28.0

[PATCH v2 0/2] mainline Protonic WD3 board

2020-09-23 Thread Oleksij Rempel

changes v2:
- fix comment: WD2 -> WD3

Oleksij Rempel (2):
  dt-bindings: arm: fsl: add Protonic WD3 board
  ARM: dts: add Protonic WD3 board

 .../devicetree/bindings/arm/fsl.yaml  |   1 +
 arch/arm/boot/dts/Makefile|   1 +
 arch/arm/boot/dts/imx6qp-prtwd3.dts   | 553 ++
 3 files changed, 555 insertions(+)
 create mode 100644 arch/arm/boot/dts/imx6qp-prtwd3.dts

-- 
2.28.0

[PATCH v2 2/2] ARM: dts: add Protonic WD3 board

2020-09-23 Thread Oleksij Rempel

Protonic WD3 is a proof of concept platform for tractor e-cockpit applications

Signed-off-by: Oleksij Rempel 
---
 arch/arm/boot/dts/Makefile  |   1 +
 arch/arm/boot/dts/imx6qp-prtwd3.dts | 553 
 2 files changed, 554 insertions(+)
 create mode 100644 arch/arm/boot/dts/imx6qp-prtwd3.dts

diff --git a/arch/arm/boot/dts/Makefile b/arch/arm/boot/dts/Makefile
index 0730cbfe13db..2355b43a0f4b 100644
--- a/arch/arm/boot/dts/Makefile
+++ b/arch/arm/boot/dts/Makefile
@@ -578,6 +578,7 @@ dtb-$(CONFIG_SOC_IMX6Q) += \
imx6qp-nitrogen6_max.dtb \
imx6qp-nitrogen6_som2.dtb \
imx6qp-phytec-mira-rdk-nand.dtb \
+   imx6qp-prtwd3.dtb \
imx6qp-sabreauto.dtb \
imx6qp-sabresd.dtb \
imx6qp-tx6qp-8037.dtb \
diff --git a/arch/arm/boot/dts/imx6qp-prtwd3.dts 
b/arch/arm/boot/dts/imx6qp-prtwd3.dts
new file mode 100644
index ..1957d5686dee
--- /dev/null
+++ b/arch/arm/boot/dts/imx6qp-prtwd3.dts
@@ -0,0 +1,553 @@
+// SPDX-License-Identifier: GPL-2.0-or-later OR MIT
+/*
+ * Copyright (c) 2018 Protonic Holland
+ */
+
+/dts-v1/;
+#include 
+#include "imx6qp.dtsi"
+
+/ {
+   model = "Protonic WD3 board";
+   compatible = "prt,prtwd3", "fsl,imx6qp";
+
+   chosen {
+   stdout-path = 
+   };
+
+   memory@1000 {
+   device_type = "memory";
+   reg = <0x1000 0x2000>;
+   };
+
+   memory@8000 {
+   device_type = "memory";
+   reg = <0x8000 0x2000>;
+   };
+
+   clock_ksz8081: clock-ksz8081 {
+   compatible = "fixed-clock";
+   #clock-cells = <0>;
+   clock-frequency = <5000>;
+   };
+
+   clock_ksz9031: clock-ksz9031 {
+   compatible = "fixed-clock";
+   #clock-cells = <0>;
+   clock-frequency = <2500>;
+   };
+
+   clock_mcp25xx: clock-mcp25xx {
+   compatible = "fixed-clock";
+   #clock-cells = <0>;
+   clock-frequency = <2000>;
+   };
+
+   clock_sja1105: clock-sja1105 {
+   compatible = "fixed-clock";
+   #clock-cells = <0>;
+   clock-frequency = <2500>;
+   };
+
+   mdio {
+   compatible = "virtual,mdio-gpio";
+   pinctrl-names = "default";
+   pinctrl-0 = <_mdio>;
+
+   #address-cells = <1>;
+   #size-cells = <0>;
+   gpios = < 6 GPIO_ACTIVE_HIGH
+ 7 GPIO_ACTIVE_HIGH>;
+
+   /* Microchip KSZ8081 */
+   usbeth_phy: ethernet-phy@3 {
+   reg = <0x3>;
+
+   interrupts-extended = < 12 IRQ_TYPE_LEVEL_LOW>;
+   reset-gpios = < 11 GPIO_ACTIVE_LOW>;
+   reset-assert-us = <500>;
+   reset-deassert-us = <1000>;
+   clocks = <_ksz8081>;
+   clock-names = "rmii-ref";
+   micrel,led-mode = <0>;
+   };
+
+   tja1102_phy0: ethernet-phy@4 {
+   compatible = "ethernet-phy-id0180.dc80";
+   reg = <0x4>;
+
+   interrupts-extended = < 8 IRQ_TYPE_LEVEL_LOW>;
+   reset-gpios = < 9 GPIO_ACTIVE_LOW>;
+   reset-assert-us = <20>;
+   reset-deassert-us = <2000>;
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   tja1102_phy1: ethernet-phy@5 {
+   reg = <0x5>;
+
+   interrupts-extended = < 8 
IRQ_TYPE_LEVEL_LOW>;
+   };
+   };
+   };
+
+   reg_5v0: regulator-5v0 {
+   compatible = "regulator-fixed";
+   regulator-name = "5v0";
+   regulator-min-microvolt = <500>;
+   regulator-max-microvolt = <500>;
+   };
+
+   reg_otg_vbus: regulator-otg-vbus {
+   compatible = "regulator-fixed";
+   regulator-name = "otg-vbus";
+   regulator-min-microvolt = <500>;
+   regulator-max-microvolt = <500>;
+   gpio = < 22 GPIO_ACTIVE_HIGH>;
+   enable-active-high;
+   };
+
+   usdhc2_wifi_pwrseq: usdhc2-wifi-pwrseq {
+   compatible = "mmc-pwrseq-simple";
+   pinctrl-names = "default";
+   pinctrl-0 = <_wifi_npd>;
+   reset-gpios = < 10 GPIO_ACTIVE_LOW>;
+   };
+};
+
+ {
+   pinctrl-names = "default";
+   pinctrl-0 = <_can1>;
+   xceiver-supply = <_5v0>;
+   status = "okay";
+};
+
+ {
+   cs-gpios = < 26 GPIO_ACTIVE_HIGH>;
+   pinctrl-names = "default";
+   pinctrl-0 = <_ecspi2>;
+   status = "okay";
+
+   switch@0 {
+   compatible = "nxp,sja1105q";
+

[PATCH 2/2] Register IPI_CPU_CRASH_STOP IPI as pseudo-NMI

2020-09-23 Thread Yuichi Ito

Register IPI_CPU_CRASH_STOP IPI as pseudo-NMI.
For systems that do not support pseudo-NMI, register as a normal IRQ.

Signed-off-by: Yuichi Ito 
---
 arch/arm64/kernel/smp.c | 39 +++
 1 file changed, 31 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index b6bde2675ccc..d929dd7221ff 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -79,6 +79,8 @@ enum ipi_msg_type {
 static int ipi_irq_base __read_mostly;
 static int nr_ipi __read_mostly = NR_IPI;
 static struct irq_desc *ipi_desc[NR_IPI] __read_mostly;
+static int ipi_crash_stop = -1;
+static int ipi_crash_stop_enable_nmi;
 
 static void ipi_setup(int cpu);
 static void ipi_teardown(int cpu);
@@ -954,8 +956,16 @@ static void ipi_setup(int cpu)
if (WARN_ON_ONCE(!ipi_irq_base))
return;
 
-   for (i = 0; i < nr_ipi; i++)
-   enable_percpu_irq(ipi_irq_base + i, 0);
+   for (i = 0; i < nr_ipi; i++) {
+   if (ipi_irq_base + i == ipi_crash_stop) {
+   if (!prepare_percpu_nmi(ipi_irq_base + i)) {
+   enable_percpu_nmi(ipi_irq_base + i, 0);
+   ipi_crash_stop_enable_nmi = 1;
+   } else
+   pr_crit("CPU%u: IPI_CPU_CRASH_STOP cannot be 
enabled NMI.\n", cpu);
+   } else
+   enable_percpu_irq(ipi_irq_base + i, 0);
+   }
 }
 
 static void ipi_teardown(int cpu)
@@ -965,23 +975,37 @@ static void ipi_teardown(int cpu)
if (WARN_ON_ONCE(!ipi_irq_base))
return;
 
-   for (i = 0; i < nr_ipi; i++)
-   disable_percpu_irq(ipi_irq_base + i);
+   for (i = 0; i < nr_ipi; i++) {
+   if (ipi_irq_base + i == ipi_crash_stop) {
+   if (ipi_crash_stop_enable_nmi) {
+   disable_percpu_nmi(ipi_irq_base + i);
+   teardown_percpu_nmi(ipi_irq_base + i);
+   }
+   } else
+   disable_percpu_irq(ipi_irq_base + i);
+   }
 }
 
 void __init set_smp_ipi_range(int ipi_base, int n)
 {
-   int i;
+   int i, ret;
 
WARN_ON(n < NR_IPI);
nr_ipi = min(n, NR_IPI);
 
+   ret = request_percpu_nmi(ipi_base + IPI_CPU_CRASH_STOP,
+ipi_handler, "IPI", _number);
+   if (!ret)
+   ipi_crash_stop = ipi_base + IPI_CPU_CRASH_STOP;
+
for (i = 0; i < nr_ipi; i++) {
int err;
 
-   err = request_percpu_irq(ipi_base + i, ipi_handler,
-"IPI", _number);
-   WARN_ON(err);
+   if (ipi_base + i != ipi_crash_stop) {
+   err = request_percpu_irq(ipi_base + i, ipi_handler,
+"IPI", _number);
+   WARN_ON(err);
+   }
 
ipi_desc[i] = irq_to_desc(ipi_base + i);
irq_set_status_flags(ipi_base + i, IRQ_HIDDEN);
-- 
2.25.1

[PATCH 0/2] Enable support IPI_CPU_CRASH_STOP to be pseudo-NMI

2020-09-23 Thread Yuichi Ito

Enable support IPI_CPU_CRASH_STOP to be pseudo-NMI

This patchset enables IPI_CPU_CRASH_STOP IPI to be pseudo-NMI.
This allows kdump to collect system information even when the CPU is in
a HARDLOCKUP state.

Only IPI_CPU_CRASH_STOP uses NMI and the other IPIs remain normal IRQs.

The patch has been tested on ThunderX.

This patch assumes Marc's latest IPIs patch-set. [1]
It also uses some of Sumit's IPI patch set for NMI.[2]

[1] 
https://lore.kernel.org/linux-arm-kernel/20200901144324.1071694-1-...@kernel.org/
[2] 
https://lore.kernel.org/linux-arm-kernel/1599830924-13990-3-git-send-email-sumit.g...@linaro.org/

$ echo 1 > /proc/sys/kernel/panic_on_rcu_stal
$ echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
   : kernel panics and crash kernel boot
   : makedumpfile saves the system state at HARDLOCKUP in vmcore.

crash utility:
crash> bt
  PID: 3213   TASK: fd001adc5940  CPU: 8   COMMAND: "bash"
  #0 [fe0022fefcf0] lkdtm_HARDLOCKUP at fe0010888ab4
  #1 [fe0022fefd10] lkdtm_do_action at fe00108882bc
  #2 [fe0022fefd20] direct_entry at fe0010888720
  #3 [fe0022fefd70] full_proxy_write at fe001058cfe4
  #4 [fe0022fefdb0] vfs_write at fe00104a4c2c
  #5 [fe0022fefdf0] ksys_write at fe00104a4f0c
  #6 [fe0022fefe40] __arm64_sys_write at fe00104a4fbc
  #7 [fe0022fefe50] el0_svc_common.constprop.0 at fe0010159e38
  #8 [fe0022fefe80] do_el0_svc at fe0010159fa0
  #9 [fe0022fefe90] el0_svc at fe00101481d0
  #10 [fe0022fefea0] el0_sync_handler at fe00101484b4
  #11 [fe0022fefff0] el0_sync at fe0010142b7c


Sumit Garg (1):
  irqchip/gic-v3: Enable support for SGIs to act as NMIs

Yuichi Ito (1):
  Register IPI_CPU_CRASH_STOP IPI as pseudo-NMI

 arch/arm64/kernel/smp.c  | 39 
 drivers/irqchip/irq-gic-v3.c | 13 ++--
 2 files changed, 42 insertions(+), 10 deletions(-)

-- 
2.25.1

[PATCH 1/2] irqchip/gic-v3: Enable support for SGIs to act as NMIs

2020-09-23 Thread Yuichi Ito

From: Sumit Garg 

Add support to handle SGIs as regular NMIs. As SGIs or IPIs defaults to a
special flow handler: handle_percpu_devid_fasteoi_ipi(), so skip NMI
handler update in case of SGIs.

Also, enable NMI support prior to gic_smp_init() as allocation of SGIs
as IRQs/NMIs happen as part of this routine.

Signed-off-by: Sumit Garg 
---
 drivers/irqchip/irq-gic-v3.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 262bc7377abd..2eda18d1df59 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -477,6 +477,11 @@ static int gic_irq_nmi_setup(struct irq_data *d)
if (WARN_ON(gic_irq(d) >= 8192))
return -EINVAL;
 
+   if (get_intid_range(d) == SGI_RANGE) {
+   gic_irq_set_prio(d, GICD_INT_NMI_PRI);
+   return 0;
+   }
+
/* desc lock should already be held */
if (gic_irq_in_rdist(d)) {
u32 idx = gic_get_ppi_index(d);
@@ -514,6 +519,11 @@ static void gic_irq_nmi_teardown(struct irq_data *d)
if (WARN_ON(gic_irq(d) >= 8192))
return;
 
+   if (get_intid_range(d) == SGI_RANGE) {
+   gic_irq_set_prio(d, GICD_INT_DEF_PRI);
+   return;
+   }
+
/* desc lock should already be held */
if (gic_irq_in_rdist(d)) {
u32 idx = gic_get_ppi_index(d);
@@ -1708,6 +1718,7 @@ static int __init gic_init_bases(void __iomem *dist_base,
 
gic_dist_init();
gic_cpu_init();
+   gic_enable_nmi_support();
gic_smp_init();
gic_cpu_pm_init();
 
@@ -1719,8 +1730,6 @@ static int __init gic_init_bases(void __iomem *dist_base,
gicv2m_init(handle, gic_data.domain);
}
 
-   gic_enable_nmi_support();
-
return 0;
 
 out_free:
-- 
2.25.1

Re: [PATCH v2 0/6] clk: axi-clk-gen: misc updates to the driver

2020-09-23 Thread Moritz Fischer

Hi Stephen,

On Wed, Sep 23, 2020 at 04:58:33PM -0700, Stephen Boyd wrote:
> Quoting Alexandru Ardelean (2020-09-22 23:22:33)
> > On Tue, Sep 22, 2020 at 10:42 PM Stephen Boyd  wrote:
> > >
> > > Quoting Moritz Fischer (2020-09-14 19:41:38)
> > > > On Mon, Sep 14, 2020 at 11:11:05AM +0300, Alexandru Ardelean wrote:
> > > > > On Mon, Aug 10, 2020 at 4:41 PM Alexandru Ardelean
> > > > >  wrote:
> > > > > >
> > > > > > These patches synchronize the driver with the current state in the
> > > > > > Analog Devices Linux tree:
> > > > > >   https://github.com/analogdevicesinc/linux/
> > > > > >
> > > > > > They have been in the tree for about 2-3, so they did receive some
> > > > > > testing.
> > > > >
> > > > > Ping on this series.
> > > > > Do I need to do a re-send?
> > >
> > > I got this patch series twice. Not sure why.
> > 
> > My fault here.
> > Some Ctrl + R usage and not being attentive with the arguments.
> > I think I added "*.patch" twice on the send-mail command.
> > I did something similar [by accident] for some DMA patches.
> > Apologies.
> > 
> > I can do a re-send for this, if it helps.
> 
> Sure. Please resend it.
> 
> > 
> > >
> > > >
> > > > I've applied the FPGA one, the other ones should go through the clock
> > > > tree I think?
> > >
> > > Doesn't patch 6 rely on the FPGA patch? How can that driver build
> > > without the header file?
> > 
> > Yes it does depend on the FPGA patch.
> > We can drop patch 6 for now, pending a merge to Linus' tree and then
> > wait for the trickle-down.
> > I don't mind waiting for these patches.
> > I have plenty of backlog that I want to run through, and cleanup and
> > then upstream.
> > So, there is no hurry.
> 
> Can you send me a signed tag with that patch? I can base this patch
> series on top of that. Or I can just apply it to clk tree and if nobody
> changes it in the meantime merge should work out in linux-next and
> linus' tree upstream.

Long story short I messed up my pull-request to Greg and had to back out
the patch anyways. In retrospect I think the patch should have gone
through your tree anyways, so here's our chance to get it right.

Feel free to take it with the rest of the changes through your tree.

Note: When I applied the patch I fixed up the whitespace that checkpatch
complained about so you might want to do that (or ask Alexandru to
resend the patch).

Acked-by: Moritz Fischer 

Sorry for the confusion and let me know if you still prefer a signed
tag.

- Moritz

Re: [PATCH] iommu: of: skip iommu_device_list traversal in of_iommu_xlate()

2020-09-23 Thread Charan Teja Kalla




On 9/23/2020 9:54 PM, Robin Murphy wrote:
> On 2020-09-23 15:53, Charan Teja Reddy wrote:
>> In of_iommu_xlate(), check if iommu device is enabled before traversing
>> the iommu_device_list through iommu_ops_from_fwnode(). It is of no use
>> in traversing the iommu_device_list only to return NO_IOMMU because of
>> iommu device node is disabled.
> 
> Well, the "use" is that it keeps the code that much smaller and simpler
> to have a single path for returning this condition. This whole callstack
> isn't exactly a high-performance code path to begin with, and we've
> always assumed that IOMMUs present but disabled in DT would be a pretty
> rare exception. 

Fine..I thought that it is logical to return when IOMMU DT node is
disabled over code simplicity. And agree that it is not high-performance
path.

> Do you have a system that challenges those assumptions
> and shows any benefit from this change?

No, I don't have a system that challenges these assumptions.

> 
> Robin.
> 
>> Signed-off-by: Charan Teja Reddy 
>> ---
>>   drivers/iommu/of_iommu.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
>> index e505b91..225598c 100644
>> --- a/drivers/iommu/of_iommu.c
>> +++ b/drivers/iommu/of_iommu.c
>> @@ -94,9 +94,10 @@ static int of_iommu_xlate(struct device *dev,
>>   struct fwnode_handle *fwnode = _spec->np->fwnode;
>>   int ret;
>>   +    if (!of_device_is_available(iommu_spec->np))
>> +    return NO_IOMMU;
>>   ops = iommu_ops_from_fwnode(fwnode);
>> -    if ((ops && !ops->of_xlate) ||
>> -    !of_device_is_available(iommu_spec->np))
>> +    if (ops && !ops->of_xlate)
>>   return NO_IOMMU;
>>     ret = iommu_fwspec_init(dev, _spec->np->fwnode, ops);
>>

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project

drivers/gpu/drm/ingenic/ingenic-drm-drv.c:(.text.ingenic_drm_bind+0x790): undefined reference to `clk_get_parent'

2020-09-23 Thread kernel test robot

Hi Paul,

FYI, the error/warning still remains.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
master
head:   c9c9e6a49f8998e9334507378c08cc16cb3ec0e5
commit: fc1acf317b01083d47228c0d21cfc0764f37a04e drm/ingenic: Add support for 
the IPU
date:   10 weeks ago
config: mips-randconfig-c004-20200922 (attached as .config)
compiler: mips-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout fc1acf317b01083d47228c0d21cfc0764f37a04e
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
ARCH=mips 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

   mips-linux-ld: drivers/gpu/drm/ingenic/ingenic-drm-drv.o: in function 
`ingenic_drm_bind':
>> drivers/gpu/drm/ingenic/ingenic-drm-drv.c:(.text.ingenic_drm_bind+0x790): 
>> undefined reference to `clk_get_parent'
   mips-linux-ld: drivers/mmc/host/mtk-sd.o: in function `msdc_ops_set_ios':
   arch/mips/include/asm/barrier.h:(.text.msdc_ops_set_ios+0x79c): undefined 
reference to `clk_get_parent'
   mips-linux-ld: 
arch/mips/include/asm/barrier.h:(.text.msdc_ops_set_ios+0x7cc): undefined 
reference to `clk_get_parent'
   mips-linux-ld: drivers/iio/adc/ingenic-adc.o: in function 
`jz4770_adc_init_clk_div':
   arch/mips/include/asm/barrier.h:(.text.jz4770_adc_init_clk_div+0x2c): 
undefined reference to `clk_get_parent'
   mips-linux-ld: drivers/iio/adc/ingenic-adc.o: in function 
`jz4725b_adc_init_clk_div':
   arch/mips/include/asm/barrier.h:(.text.jz4725b_adc_init_clk_div+0x2c): 
undefined reference to `clk_get_parent'

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip

Re: possible deadlock in xfrm_policy_delete

On Thu, Sep 24, 2020 at 06:44:12AM +0200, Dmitry Vyukov wrote:
>
> FWIW one of the dups of this issue was bisected to:
> 
> commit 1909760f5fc3f123e47b4e24e0ccdc0fc8f3f106
> Author: Ahmed S. Darwish 
> Date:   Fri Sep 4 15:32:31 2020 +
> 
> seqlock: PREEMPT_RT: Do not starve seqlock_t writers
> 
> Can it be related?

Whether there is a genuine problem or not, the lockdep report
that I quoted is simply non-sensical because it's considering
two distinct seqlocks to be the same lock.

I don't think the lockdep issue has anything to do with the commit
question because this commit is specific to seqlocks.  There is
another syzbot report in this pile that mixed the SCTP socket lock
with the TCP socket lock and those are not seqlocks.

It's almost as if when a spinlock is freed and reallocated lockdep
is not clearing the existing state.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: possible deadlock in xfrm_policy_delete

2020-09-23 Thread Dmitry Vyukov

On Thu, Sep 24, 2020 at 6:36 AM Herbert Xu  wrote:
>
> On Sun, Sep 20, 2020 at 01:22:14PM -0700, syzbot wrote:
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:5fa35f24 Add linux-next specific files for 20200916
> > git tree:   linux-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=1122e2d990
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=6bdb7e39caf48f53
> > dashboard link: https://syzkaller.appspot.com/bug?extid=c32502fd255cb3a44048
> > compiler:   gcc (GCC) 10.1.0-syz 20200507
> >
> > Unfortunately, I don't have any reproducer for this issue yet.
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+c32502fd255cb3a44...@syzkaller.appspotmail.com
> >
> > =
> > WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
> > 5.9.0-rc5-next-20200916-syzkaller #0 Not tainted
> > -
> > syz-executor.1/13775 [HC0[0]:SC0[4]:HE1:SE0] is trying to acquire:
> > 88805ee15a58 (>xfrm.xfrm_policy_lock){+...}-{2:2}, at: 
> > spin_lock_bh include/linux/spinlock.h:359 [inline]
> > 88805ee15a58 (>xfrm.xfrm_policy_lock){+...}-{2:2}, at: 
> > xfrm_policy_delete+0x3a/0x90 net/xfrm/xfrm_policy.c:2236
> >
> > and this task is already holding:
> > 8880a811b1e0 (k-slock-AF_INET6){+.-.}-{2:2}, at: spin_lock 
> > include/linux/spinlock.h:354 [inline]
> > 8880a811b1e0 (k-slock-AF_INET6){+.-.}-{2:2}, at: tcp_close+0x6e3/0x1200 
> > net/ipv4/tcp.c:2503
> > which would create a new lock dependency:
> >  (k-slock-AF_INET6){+.-.}-{2:2} -> (>xfrm.xfrm_policy_lock){+...}-{2:2}
> >
> > but this new dependency connects a SOFTIRQ-irq-safe lock:
> >  (k-slock-AF_INET6){+.-.}-{2:2}
> >
> > ... which became SOFTIRQ-irq-safe at:
> >   lock_acquire+0x1f2/0xaa0 kernel/locking/lockdep.c:5398
> >   __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
> >   _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
> >   spin_lock include/linux/spinlock.h:354 [inline]
> >   sctp_rcv+0xd96/0x2d50 net/sctp/input.c:231
>
> What's going on with all these bogus lockdep reports?
>
> These are two completely different locks, one is for TCP and the
> other is for SCTP.  Why is lockdep suddenly beoming confused about
> this?
>
> FWIW this flood of bogus reports started on 16/Sep.


FWIW one of the dups of this issue was bisected to:

commit 1909760f5fc3f123e47b4e24e0ccdc0fc8f3f106
Author: Ahmed S. Darwish 
Date:   Fri Sep 4 15:32:31 2020 +

seqlock: PREEMPT_RT: Do not starve seqlock_t writers

Can it be related?


A number of other new lockdep reports were bisected to the following
one, which was true intentional root cause of these, but it looks a
bit too old to cause the xfrm reports:

commit f08e3888574d490b31481eef6d84c61bedba7a47
Author: Boqun Feng 
Date: Fri Aug 7 07:42:30 2020 +

  lockdep: Fix recursive read lock related safe->unsafe detection

Re: inconsistent lock state in xfrm_policy_lookup_inexact_addr

On Wed, Sep 16, 2020 at 01:51:14AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:6b02addb Add linux-next specific files for 20200915
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=15888efd90
> kernel config:  https://syzkaller.appspot.com/x/.config?x=7086d0e9e44d4a14
> dashboard link: https://syzkaller.appspot.com/bug?extid=f4d0f0f7c860608404c4
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+f4d0f0f7c86060840...@syzkaller.appspotmail.com
> 
> 
> WARNING: inconsistent lock state
> 5.9.0-rc5-next-20200915-syzkaller #0 Not tainted
> 
> inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-R} usage.
> kworker/1:1/23 [HC0[0]:SC1[1]:HE0:SE0] takes:
> 8880a6ff5f28 (>seqcount#12){+.+-}-{0:0}, at: 
> xfrm_policy_lookup_inexact_addr+0x57/0x200 net/xfrm/xfrm_policy.c:1909
> {SOFTIRQ-ON-W} state was registered at:
>   lock_acquire+0x1f2/0xaa0 kernel/locking/lockdep.c:5398
>   write_seqcount_t_begin_nested include/linux/seqlock.h:509 [inline]
>   write_seqcount_t_begin include/linux/seqlock.h:535 [inline]
>   write_seqlock include/linux/seqlock.h:883 [inline]
>   xfrm_set_spdinfo+0x302/0x660 net/xfrm/xfrm_user.c:1185

This is >xfrm.policy_hthresh.lock.

...

> stack backtrace:
> CPU: 1 PID: 23 Comm: kworker/1:1 Not tainted 
> 5.9.0-rc5-next-20200915-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Workqueue: wg-crypt-wg0 wg_packet_tx_worker
> Call Trace:
>  
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x198/0x1fb lib/dump_stack.c:118
>  print_usage_bug kernel/locking/lockdep.c:3694 [inline]
>  valid_state kernel/locking/lockdep.c:3705 [inline]
>  mark_lock_irq kernel/locking/lockdep.c:3908 [inline]
>  mark_lock.cold+0x13/0x10d kernel/locking/lockdep.c:4375
>  mark_usage kernel/locking/lockdep.c:4252 [inline]
>  __lock_acquire+0x1402/0x55d0 kernel/locking/lockdep.c:4750
>  lock_acquire+0x1f2/0xaa0 kernel/locking/lockdep.c:5398
>  seqcount_lockdep_reader_access+0x139/0x1a0 include/linux/seqlock.h:103
>  xfrm_policy_lookup_inexact_addr+0x57/0x200 net/xfrm/xfrm_policy.c:1909

And this is a completely different seqlock.

Again lockdep is creating a bogus report by lumping two unrelated
locks (but of the same type, in this case seqlock) together.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: possible deadlock in xfrm_user_rcv_msg

#syz dup: inconsistent lock state in xfrm_policy_lookup_inexact_addr
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

IRREVOCABLE COMPENSATION PAYMENT OF $750,000.00 DOLLARS VIA ATM VISA CARD

2020-09-23 Thread UNITED NATIONS

IRREVOCABLE COMPENSATION PAYMENT OF $750,000.00 DOLLARS VIA ATM VISA CARD

We sincerely apologize for the unnecessary delay of your compensation
payment. This delay was cursed due to the current EPIDEMIC outbreak
that frustrated all effort made to contact you. We have actually been
authorized by the newly appointed Minister of Finance and the
Governing Body of the UNITED NATIONS (UN) to investigate the
unnecessary delay of your payment. During the course of our
investigation, we discovered with dismay that your payment has been
unnecessarily delayed by corrupt officials of the Bank in an attempt
to swindle your fund which has led to so many losses from your end and
unnecessary delay in the receipt of your payment.

The United Nations has chosen to pay out all the unpaid contract,
compensation payment, inheritance fund and lotto winning claims by
Companies and individuals to 450 Beneficiaries from United Kingdom,
U.S.A, Europe, Canada, United Arab Emirates, Bahrain, Qatar, Saudi
Arabia, South America, Australia and Asia and Africa Continent through
ATM Visa Card as this is a global payments technology that enables
consumers, businesses, financial institutions and governments to use
digital currency instead of Cash and Cheques.

You are hereby advised to stop dealing with some corrupt and
non-officials in the bank or any offices as this is an illegal act and
you have to stop if you so wish to receive your payment immediately.
We have arranged your payment to be paid to you through an ATM Visa
Card and this will be issued on your Name and shall be posted directly
to your address via DHL or any courier services available in your
country. Upon your contact with us, the sum of $750,000.00 Dollars
will be credited into the ATM Visa Card and this will enable you to
withdraw your funds in any ATM Machines in your country with a minimum
withdrawal of $5000.00 Dollars per day. Your limit can be increased to
any amount upon your request.

In this regards, you are to contact and furnish the requested
information to the Directorate of International Payment and Transfer
with the followings;
1. Your Name:
2. Country:
3. Age and Sex:
4. Occupation:
5. Mobile Telephone:
6. Delivery Address:
7. Id Card Identification:
We required your urgent response to this email as directed to avoid
further delay.

Director of International Payment
Mrs. Maria Angeles

Re: possible deadlock in xfrm_policy_delete

On Sun, Sep 20, 2020 at 01:22:14PM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:5fa35f24 Add linux-next specific files for 20200916
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=1122e2d990
> kernel config:  https://syzkaller.appspot.com/x/.config?x=6bdb7e39caf48f53
> dashboard link: https://syzkaller.appspot.com/bug?extid=c32502fd255cb3a44048
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+c32502fd255cb3a44...@syzkaller.appspotmail.com
> 
> =
> WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
> 5.9.0-rc5-next-20200916-syzkaller #0 Not tainted
> -
> syz-executor.1/13775 [HC0[0]:SC0[4]:HE1:SE0] is trying to acquire:
> 88805ee15a58 (>xfrm.xfrm_policy_lock){+...}-{2:2}, at: spin_lock_bh 
> include/linux/spinlock.h:359 [inline]
> 88805ee15a58 (>xfrm.xfrm_policy_lock){+...}-{2:2}, at: 
> xfrm_policy_delete+0x3a/0x90 net/xfrm/xfrm_policy.c:2236
> 
> and this task is already holding:
> 8880a811b1e0 (k-slock-AF_INET6){+.-.}-{2:2}, at: spin_lock 
> include/linux/spinlock.h:354 [inline]
> 8880a811b1e0 (k-slock-AF_INET6){+.-.}-{2:2}, at: tcp_close+0x6e3/0x1200 
> net/ipv4/tcp.c:2503
> which would create a new lock dependency:
>  (k-slock-AF_INET6){+.-.}-{2:2} -> (>xfrm.xfrm_policy_lock){+...}-{2:2}
> 
> but this new dependency connects a SOFTIRQ-irq-safe lock:
>  (k-slock-AF_INET6){+.-.}-{2:2}
> 
> ... which became SOFTIRQ-irq-safe at:
>   lock_acquire+0x1f2/0xaa0 kernel/locking/lockdep.c:5398
>   __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
>   _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
>   spin_lock include/linux/spinlock.h:354 [inline]
>   sctp_rcv+0xd96/0x2d50 net/sctp/input.c:231

What's going on with all these bogus lockdep reports?

These are two completely different locks, one is for TCP and the
other is for SCTP.  Why is lockdep suddenly beoming confused about
this?

FWIW this flood of bogus reports started on 16/Sep.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH] mmap_lock: add tracepoints around lock acquisition

2020-09-23 Thread Yafang Shao

On Thu, Sep 24, 2020 at 12:09 AM Steven Rostedt  wrote:
>
> On Wed, 23 Sep 2020 18:04:17 +0800
> Yafang Shao  wrote:
>
> > > What you can do, and what we have done is the following:
> > >
> > > (see include/linux/page_ref.h)
> > >
> > >
> > > #ifdef CONFIG_TRACING
> > > extern struct tracepoint __tracepoint_mmap_lock_start_locking;
> > > extern struct tracepoint __tracepoint_mmap_lock_acquire_returned;
> > >
> > > #define mmap_lock_tracepoint_active(t) 
> > > static_key_false(&(__tracepoint_mmap_lock_##t).key)
> > >
> > > #else
> > > #define mmap_lock_tracepoint_active(t) false
> > > #endif
> > >
> > > static inline void mmap_write_lock(struct mm_struct *mm)
> > > {
> > > if (mmap_lock_tracepoint_active(start_locking))
> > > mmap_lock_start_trace_wrapper();
> > > down_write(>mmap_lock);
> > > if (mmap_lock_tracepoint_active(acquire_returned))
> > > mmap_lock_acquire_trace_wrapper();
> > > }
> > >
> > >
> > > -- Steve
> >
> >
> > Great!
> >
> > Thanks Steve.
>
> If the above becomes useful, I may just add helper functions into a
> header file that you can include. Perhaps call it: tracepoint_active()
> and you need to pass in as an argument the name of the tracepoint.
>

Yes, please. The new helper would be useful.

-- 
Thanks
Yafang

Re: [GIT PULL rcu-tasks-trace] 50x speedup for synchronize_rcu_tasks_trace()

2020-09-23 Thread Paul E. McKenney

On Wed, Sep 23, 2020 at 07:46:15PM -0700, Alexei Starovoitov wrote:
> On Tue, Sep 22, 2020 at 9:25 AM Paul E. McKenney  wrote:
> >
> > Hello, Alexei,
> >
> > This pull request contains eight commits that speed up RCU Tasks Trace
> > grace periods by a factor of 50, fix a few race conditions exposed
> > by this speedup, and clean up a couple of minor issues.  These have
> > been exposed to 0day and -next testing, and have passed well over 1,000
> > hours of rcutorture testing, some of which has contained ad-hoc changes
> > to further increase race probabilities.  So they should be solid!
> > (Famous last words...)
> >
> > I would normally have sent this series up through -tip, but as we
> > discussed, going up through the BFP and networking trees provides the
> > needed exposure to real-world testing of these changes.  Please note
> > that the first patch is already in mainline, but given identical SHA-1
> > commit IDs, git should have no problem figuring this out.  I will also
> > be retaining these commits in -rcu in order to continue exposing them
> > to rcutorture testing, but again the identical SHA-1 commit IDs will
> > make everything work out.
> 
> Pulled into bpf-next. Thanks a lot.
> 
> Also confirming 50x speedup.
> Really nice to see that selftests/bpf are now fast again.
> 
> Not only all bpf developers will be running these patches now,
> but the bpf CI system will be exercising them as well.

Sounds very good, thank you!  If problems arise, you know where
to find me.  As do most of the people running these patches, I
would guess.  ;-)

Thanx, Paul

Re: [PATCH net-next v3 1/2] net: dsa: untag the bridge pvid from rx skbs

2020-09-23 Thread Florian Fainelli





On 9/23/2020 4:08 PM, Vladimir Oltean wrote:

On Wed, Sep 23, 2020 at 03:59:46PM -0700, Florian Fainelli wrote:

On 9/23/20 3:58 PM, Vladimir Oltean wrote:

On Wed, Sep 23, 2020 at 03:54:59PM -0700, Florian Fainelli wrote:

Not having much luck with using  __vlan_find_dev_deep_rcu() for a reason
I don't understand we trip over the proto value being neither of the two
support Ethertype and hit the BUG().

+   upper_dev = __vlan_find_dev_deep_rcu(br, htons(proto), vid);
+   if (upper_dev)
+   return skb;

Any ideas?


Damn...
Yes, of course, the skb->protocol is still ETH_P_XDSA which is where
eth_type_trans() on the master left it.


proto was obtained from br_vlan_get_proto() a few lines above, and
br_vlan_get_proto() just returns br->vlan_proto which defaults to
htons(ETH_P_8021Q) from br_vlan_init().

This is not skb->protocol that we are looking at AFAICT.


Ok, my mistake. So what is the value of proto in vlan_proto_idx when it
fails? To me, the call path looks pretty pass-through for vlan_proto.


At the time we crash the proto value is indeed ETH_P_XDSA, but it is not 
because of the __vlan_find_dev_deep_rcu() call as I was mislead by the 
traces I was looking it (on ARMv7 the LR was pointing not where I was 
expecting it to), it is because of the following call trace:


netif_receive_skb_list_internal
  -> __netif_receive_skb_list_core
-> __netif_receive_skb_core
  -> vlan_do_receive()

That function does use skb->vlan_proto to determine the VLAN group, at 
that point we have not set it but we did inherit skb->protocol instead 
which is ETH_P_XDSA.


The following does work though, tested with both br0 and a br0.1 upper:

+   upper_dev = __vlan_find_dev_deep_rcu(br, htons(proto), vid);
+   if (upper_dev) {
+   skb->vlan_proto = vlan_dev_vlan_proto(upper_dev);
+   return skb;
}

I should have re-tested v2 and v3 with a bridge upper but I did not 
otherwise I would have caught that. If that sounds acceptable to you as 
well, I will submit that tomorrow.


Let me know what you think about the 802.1Q upper of a physical switch 
port in the other email.

--
Florian

linux-next: manual merge of the vfio tree with the s390 tree

2020-09-23 Thread Stephen Rothwell

Hi all,

Today's linux-next merge of the vfio tree got a conflict in:

  arch/s390/pci/pci_bus.c

between commit:

  abb95b7550f8 ("s390/pci: consolidate SR-IOV specific code")

from the s390 tree and commit:

  08b6e22b850c ("s390/pci: Mark all VFs as not implementing PCI_COMMAND_MEMORY")

from the vfio tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc arch/s390/pci/pci_bus.c
index 0c0db7c3a404,c93486a9989b..
--- a/arch/s390/pci/pci_bus.c
+++ b/arch/s390/pci/pci_bus.c
@@@ -135,9 -197,10 +135,10 @@@ void pcibios_bus_add_device(struct pci_
 * With pdev->no_vf_scan the common PCI probing code does not
 * perform PF/VF linking.
 */
-   if (zdev->vfn)
+   if (zdev->vfn) {
 -  zpci_bus_setup_virtfn(zdev->zbus, pdev, zdev->vfn);
 +  zpci_iov_setup_virtfn(zdev->zbus, pdev, zdev->vfn);
- 
+   pdev->no_command_memory = 1;
+   }
  }
  
  static int zpci_bus_add_device(struct zpci_bus *zbus, struct zpci_dev *zdev)


pgptYIFhyV6N1.pgp
Description: OpenPGP digital signature

Re: possible deadlock in xfrm_policy_lookup_bytype

#syz dup: inconsistent lock state in xfrm_policy_lookup_inexact_addr
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

RE: [PATCH] PCI: dwc: Added link up check in map_bus of dw_child_pcie_ops

2020-09-23 Thread Z.q. Hou

Hi Rob,

Thanks a lot for your comments!

> -Original Message-
> From: Rob Herring 
> Sent: 2020年9月18日 23:28
> To: Z.q. Hou 
> Cc: linux-kernel@vger.kernel.org; PCI ; Lorenzo
> Pieralisi ; Bjorn Helgaas
> ; Gustavo Pimentel
> ; Michael Walle ;
> Ard Biesheuvel 
> Subject: Re: [PATCH] PCI: dwc: Added link up check in map_bus of
> dw_child_pcie_ops
> 
> On Fri, Sep 18, 2020 at 5:02 AM Z.q. Hou  wrote:
> >
> > Hi Rob,
> >
> > Thanks a lot for your comments!
> >
> > > -Original Message-
> > > From: Rob Herring 
> > > Sent: 2020年9月17日 4:29
> > > To: Z.q. Hou 
> > > Cc: linux-kernel@vger.kernel.org; PCI ;
> > > Lorenzo Pieralisi ; Bjorn Helgaas
> > > ; Gustavo Pimentel
> > > ; Michael Walle
> ;
> > > Ard Biesheuvel 
> > > Subject: Re: [PATCH] PCI: dwc: Added link up check in map_bus of
> > > dw_child_pcie_ops
> > >
> > > On Tue, Sep 15, 2020 at 11:49 PM Zhiqiang Hou
> 
> > > wrote:
> > > >
> > > > From: Hou Zhiqiang 
> > > >
> > > > On NXP Layerscape platforms, it results in SError in the
> > > > enumeration of the PCIe controller, which is not connecting with
> > > > an Endpoint device. And it doesn't make sense to enumerate the
> > > > Endpoints when the PCIe link is down. So this patch added the link
> > > > up check to avoid to fire configuration transactions on link down bus.
> > >
> > > Michael reported the same issue as well.
> > >
> > > What happens if the link goes down between the check and the access?
> >
> > This patch cannot cover this case, and will get the SError.
> > But I think it makes sense to avoid firing transactions on link down bus.
> 
> That's impossible to do without a race even in h/w.

Agree.

> 
> > > It's a racy check. I'd like to find an alternative solution. It's
> > > even worse if Layerscape is used in ECAM mode. I looked at the EDK2
> > > setup for layerscape[1] and it looks like root ports are just skipped if 
> > > link
> is down.
> > > Maybe a link down just never happens once up, but if so, then we
> > > only need to check it once and fail probe.
> >
> > Many customers connect the FPGA Endpoint, which may establish PCIe
> > link after the PCIe enumeration and then rescan the PCIe bus, so I
> > think it should not exit the probe of root port even if there is not link up
> during enumeration.
> 
> That's a good reason. I want to unify the behavior here as it varies per
> platform currently and wasn't sure which way to go.
> 
> 
> > > I've dug into this a bit more and am curious about the PCIE_ABSERR
> > > register setting which is set to:
> > >
> > > #define PCIE_ABSERR_SETTING 0x9401 /* Forward error of non-posted
> > > request */
> > >
> > > It seems to me this is not what we want at least for config
> > > accesses, but commit 84d897d6993 where this was added seems to say
> > > otherwise. Is it not possible to configure the response per access type?
> >
> > Thanks a lot for your investigation!
> > The story is like this: Some customers worry about these silent error
> > (DWC PCIe IP won't forward the error of outbound non-post request by
> > default), so we were pushed to enable the error forwarding to AXI in
> > the commit
> > 84d897d6993 as you saw. But it cannot differentiate the config
> > transactions from the MEM_rd, except the Vendor ID access, which is
> > controlled by a separate bit and it was set to not forward error of access
> of Vendor ID.
> > So we think it's okay to enable the error forwarding, the SError
> > should not occur, because after the enumeration it won't access the
> non-existent functions.
> 
> We've rejected upstream support for platforms aborting on config
> accesses[1]. I think there's clear consensus that aborting is the wrong
> behavior.
> 
> Do MEM_wr errors get forwarded? Seems like that would be enough. Also,
> wouldn't page faults catch most OOB accesses anyways? You need things
> page aligned anyways with an IOMMU and doing userspace access or guest
> assignment.

Yes, errors of MEM_wr can be forwarded.

> 
> Here's another idea, how about only enabling forwarding errors if the link is
> up? If really would need to be configured any time the link state changes
> rather than just at probe. I'm not sure if you have a way to disable it on 
> link
> down though.

Dug deeper into this issue and found the setting of not forwarding error of 
non-existent Vender ID access counts on the link partner:
1. When there is a link partner (namely link up), it will return 0x when 
read non-existent function Vendor ID and won't forward error to AXI.
2. When no link partner (link down), it will forward the error of reading 
non-existent function Vendor ID to AXI and result in SError.

I think this is a DWC PCIe IP specific issue but not get feedback from design 
team.
I'm thinking to disable this error forwarding just like other platforms, since 
when these errors (UR, CA and CT) are detected, AER driver can also report the 
error and try to recover.

Thanks,
Zhiqiang

> 
> Rob

Re: [RFC 2/2] printk: Add more information about the printk caller

2020-09-23 Thread Ahmed S. Darwish

On Wed, Sep 23, 2020 at 03:56:17PM +0200, Petr Mladek wrote:
...
>
> -static inline u32 printk_caller_id(void)
> +static enum printk_caller_ctx get_printk_caller_ctx(void)
> +{
> + if (in_nmi())
> + return printk_ctx_nmi;
> +
> + if (in_irq())
> + return printk_ctx_hardirq;
> +
> + if (in_softirq())
> + return printk_ctx_softirq;
> +
> + return printk_ctx_task;
> +}
> +

in_softirq() here will be true for both softirq contexts *and*
BH-disabled regions. Did you mean in_serving_softirq() instead?

Thanks,

--
Ahmed S. Darwish
Linutronix GmbH

ARC: allmodconfig: Error: inappropriate arguments for opcode 'mpyd'

2020-09-23 Thread Naresh Kamboju

arc:allmodconfig build failed with gcc-8, gcc-9 and gcc-10 on Linus's
mainline tree.

Build log:

make -sk KBUILD_BUILD_USER=TuxBuild -C/linux ARCH=arc
CROSS_COMPILE=arc-elf32- HOSTCC=gcc CC="sccache arc-elf32-gcc" O=build
allmodconfig

make -sk KBUILD_BUILD_USER=TuxBuild -C/linux -j16 ARCH=arc
CROSS_COMPILE=arc-elf32- HOSTCC=gcc CC="sccache arc-elf32-gcc" O=build
uImage

In file included from :31:
./usr/include/sound/hdspm.h:48:2: error: unknown type name ‘__u32’
  __u32 input_peaks[64];
  ^
./usr/include/sound/hdspm.h:49:2: error: unknown type name ‘__u32’
  __u32 playback_peaks[64];
  ^
./usr/include/sound/hdspm.h:50:2: error: unknown type name ‘__u32’
  __u32 output_peaks[64];
  ^
./usr/include/sound/hdspm.h:52:2: error: unknown type name ‘__u64’
  __u64 input_rms[64];
  ^
./usr/include/sound/hdspm.h:53:2: error: unknown type name ‘__u64’
  __u64 playback_rms[64];
  ^
./usr/include/sound/hdspm.h:54:2: error: unknown type name ‘__u64’
  __u64 output_rms[64];
  ^
./usr/include/sound/hdspm.h:56:2: error: unknown type name ‘__u8’
  __u8 speed; /* enum {ss, ds, qs} */
  ^~~~
./usr/include/sound/hdspm.h:157:2: error: unknown type name ‘__u8’
  __u8 card_type; /* enum hdspm_io_type */
  ^~~~
./usr/include/sound/hdspm.h:160:2: error: unknown type name ‘__u64’
  __u64 card_clock;
  ^
./usr/include/sound/hdspm.h:161:2: error: unknown type name ‘__u32’
  __u32 master_period;
  ^
./usr/include/sound/hdspm.h:165:4: error: unknown type name ‘__u8’
__u8 sync_wc; /* enum hdspm_sync */
^~~~
./usr/include/sound/hdspm.h:166:4: error: unknown type name ‘__u8’
__u8 sync_madi; /* enum hdspm_sync */
^~~~
./usr/include/sound/hdspm.h:167:4: error: unknown type name ‘__u8’
__u8 sync_tco; /* enum hdspm_sync */
^~~~
./usr/include/sound/hdspm.h:168:4: error: unknown type name ‘__u8’
__u8 sync_in; /* enum hdspm_sync */
^~~~
./usr/include/sound/hdspm.h:169:4: error: unknown type name ‘__u8’
__u8 madi_input; /* enum hdspm_madi_input */
^~~~
./usr/include/sound/hdspm.h:170:4: error: unknown type name ‘__u8’
__u8 channel_format; /* enum hdspm_madi_channel_format */
^~~~
./usr/include/sound/hdspm.h:171:4: error: unknown type name ‘__u8’
__u8 frame_format; /* enum hdspm_madi_frame_format */
^~~~
./usr/include/sound/hdspm.h:186:2: error: unknown type name ‘__u8’
  __u8 card_type; /* enum hdspm_io_type */
  ^~~~
make[3]: *** [../usr/include/Makefile:108:
usr/include/sound/hdspm.hdrtest] Error 1
In file included from :31:
./usr/include/sound/hdsp.h:39:2: error: unknown type name ‘__u32’
  __u32 input_peaks[26];
  ^
./usr/include/sound/hdsp.h:40:2: error: unknown type name ‘__u32’
  __u32 playback_peaks[26];
  ^
./usr/include/sound/hdsp.h:41:2: error: unknown type name ‘__u32’
  __u32 output_peaks[28];
  ^
./usr/include/sound/hdsp.h:42:2: error: unknown type name ‘__u64’
  __u64 input_rms[26];
  ^
./usr/include/sound/hdsp.h:43:2: error: unknown type name ‘__u64’
  __u64 playback_rms[26];
  ^
./usr/include/sound/hdsp.h:45:2: error: unknown type name ‘__u64’
  __u64 output_rms[26];
  ^
make[3]: *** [../usr/include/Makefile:108:
usr/include/sound/hdsp.hdrtest] Error 1
{standard input}: Assembler messages:
{standard input}:9360: Error: inappropriate arguments for opcode 'mpyd'
make[3]: *** [../scripts/Makefile.build:283: kernel/sched/core.o] Error 1
../arch/arc/kernel/kgdb.c: In function ‘kgdb_arch_handle_exception’:
../arch/arc/kernel/kgdb.c:141:6: warning: this statement may fall
through [-Wimplicit-fallthrough=]
   if (kgdb_hex2long(, ))
  ^
../arch/arc/kernel/kgdb.c:144:2: note: here
  case 'D':
  ^~~~
In file included from ./usr/include/linux/netfilter_ipv6/ip6_tables.h:21,
 from :31:
./usr/include/linux/if.h:28:10: fatal error: sys/socket.h: No such
file or directory
 #include/* for struct sockaddr.  */
  ^~
compilation terminated.
make[3]: *** [../usr/include/Makefile:108:
usr/include/linux/netfilter_ipv6/ip6_tables.hdrtest] Error 1
../arch/arc/kernel/perf_event.c: In function ‘arc_pmu_device_probe’:
../arch/arc/kernel/perf_event.c:645:3: warning: ignoring return value
of ‘request_percpu_irq’, declared with attribute warn_unused_result
[-Wunused-result]
   request_percpu_irq(irq, arc_pmu_intr, "ARC perf counters",
   ^~
this_cpu_ptr(_pmu_cpu));
~~~
In file included from ./usr/include/linux/if_bonding.h:47,
 from :31:
./usr/include/linux/if.h:28:10: fatal error: sys/socket.h: No such
file or directory
 #include/* for struct sockaddr.  */
  ^~
compilation terminated.
make[3]: *** [../usr/include/Makefile:108:
usr/include/linux/if_bonding.hdrtest] Error 1
In file included from ./usr/include/linux/netfilter_ipv4/ip_tables.h:21,
 from :31:
./usr/include/linux/if.h:28:10: fatal error: sys/socket.h: No such
file or directory

[PATCH] alpha: switch defconfig from the legacy ide driver to libata

2020-09-23 Thread Christoph Hellwig

Switch from the soon to be removed legacy ide driver to libata.

Signed-off-by: Christoph Hellwig 
---
 arch/alpha/configs/defconfig | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/alpha/configs/defconfig b/arch/alpha/configs/defconfig
index 6293675db1644a..387174e2422f6b 100644
--- a/arch/alpha/configs/defconfig
+++ b/arch/alpha/configs/defconfig
@@ -26,13 +26,11 @@ CONFIG_PNP=y
 CONFIG_ISAPNP=y
 CONFIG_BLK_DEV_FD=y
 CONFIG_BLK_DEV_LOOP=m
-CONFIG_IDE=y
-CONFIG_BLK_DEV_IDECD=y
-CONFIG_IDE_GENERIC=y
-CONFIG_BLK_DEV_GENERIC=y
-CONFIG_BLK_DEV_ALI15X3=y
-CONFIG_BLK_DEV_CMD64X=y
-CONFIG_BLK_DEV_CY82C693=y
+CONFIG_ATA=y
+CONFIG_ATA_GENERIC=y
+CONFIG_PATA_ALI=y
+CONFIG_PATA_CMD64X=y
+CONFIG_PATA_CYPRESS=y
 CONFIG_SCSI=y
 CONFIG_BLK_DEV_SD=y
 CONFIG_BLK_DEV_SR=y
-- 
2.28.0

Re: [PATCH v5 5/6] dt-bindings: dw-apb-ictl: convert to json-schema

2020-09-23 Thread Leizhen (ThunderTown)




On 2020/9/24 11:26, Leizhen (ThunderTown) wrote:
> 
> 
> On 2020/9/24 4:49, Rob Herring wrote:
>> On Fri, Sep 18, 2020 at 07:22:01PM +0800, Zhen Lei wrote:
>>> Convert the Synopsys DesignWare APB interrupt controller (dw_apb_ictl)
>>> binding to DT schema format using json-schema.
>>>
>>> Signed-off-by: Zhen Lei 
>>> ---
>>>  .../interrupt-controller/snps,dw-apb-ictl.txt  | 43 -
>>>  .../interrupt-controller/snps,dw-apb-ictl.yaml | 75 
>>> ++
>>>  2 files changed, 75 insertions(+), 43 deletions(-)
>>>  delete mode 100644 
>>> Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.txt
>>>  create mode 100644 
>>> Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.yaml
>>>
>>> diff --git 
>>> a/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.txt
>>>  
>>> b/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.txt
>>> deleted file mode 100644
>>> index 2db59df9408f4c6..000
>>> --- 
>>> a/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.txt
>>> +++ /dev/null
>>> @@ -1,43 +0,0 @@
>>> -Synopsys DesignWare APB interrupt controller (dw_apb_ictl)
>>> -
>>> -Synopsys DesignWare provides interrupt controller IP for APB known as
>>> -dw_apb_ictl. The IP is used as secondary interrupt controller in some SoCs 
>>> with
>>> -APB bus, e.g. Marvell Armada 1500. It can also be used as primary interrupt
>>> -controller in some SoCs, e.g. Hisilicon SD5203.
>>> -
>>> -Required properties:
>>> -- compatible: shall be "snps,dw-apb-ictl"
>>> -- reg: physical base address of the controller and length of memory mapped
>>> -  region starting with ENABLE_LOW register
>>> -- interrupt-controller: identifies the node as an interrupt controller
>>> -- #interrupt-cells: number of cells to encode an interrupt-specifier, 
>>> shall be 1
>>> -
>>> -Additional required property when it's used as secondary interrupt 
>>> controller:
>>> -- interrupts: interrupt reference to primary interrupt controller
>>> -
>>> -The interrupt sources map to the corresponding bits in the interrupt
>>> -registers, i.e.
>>> -- 0 maps to bit 0 of low interrupts,
>>> -- 1 maps to bit 1 of low interrupts,
>>> -- 32 maps to bit 0 of high interrupts,
>>> -- 33 maps to bit 1 of high interrupts,
>>> -- (optional) fast interrupts start at 64.
>>> -
>>> -Example:
>>> -   /* dw_apb_ictl is used as secondary interrupt controller */
>>> -   aic: interrupt-controller@3000 {
>>> -   compatible = "snps,dw-apb-ictl";
>>> -   reg = <0x3000 0xc00>;
>>> -   interrupt-controller;
>>> -   #interrupt-cells = <1>;
>>> -   interrupt-parent = <>;
>>> -   interrupts = ;
>>> -   };
>>> -
>>> -   /* dw_apb_ictl is used as primary interrupt controller */
>>> -   vic: interrupt-controller@1013 {
>>> -   compatible = "snps,dw-apb-ictl";
>>> -   reg = <0x1013 0x1000>;
>>> -   interrupt-controller;
>>> -   #interrupt-cells = <1>;
>>> -   };
>>> diff --git 
>>> a/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.yaml
>>>  
>>> b/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.yaml
>>> new file mode 100644
>>> index 000..70c12979c843bf0
>>> --- /dev/null
>>> +++ 
>>> b/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.yaml
>>> @@ -0,0 +1,75 @@
>>> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
>>> +%YAML 1.2
>>> +---
>>> +$id: 
>>> http://devicetree.org/schemas/interrupt-controller/snps,dw-apb-ictl.yaml#
>>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
>>> +
>>> +title: Synopsys DesignWare APB interrupt controller (dw_apb_ictl)
>>> +
>>> +maintainers:
>>> +  - Marc Zyngier 
>>
>> Usually this would be an owner for this IP block, not the subsystem 
>> maintainer.
> 
> OK, I will change it to the author of the file "snps,dw-apb-ictl.txt".
> 
> 
>>
>>> +
>>> +description:
>>> +  Synopsys DesignWare provides interrupt controller IP for APB known as
>>> +  dw_apb_ictl. The IP is used as secondary interrupt controller in some 
>>> SoCs
>>> +  with APB bus, e.g. Marvell Armada 1500. It can also be used as primary
>>> +  interrupt controller in some SoCs, e.g. Hisilicon SD5203.
>>> +
>>> +allOf:
>>> +  - $ref: /schemas/interrupt-controller.yaml#
>>
>> You can drop this, it's already applied based on node name.
> But if we drop this, the "snps,dw-apb-ictl.yaml" can not require that the 
> node name
> must match '^interrupt-controller(@[0-9a-f,]+)*$'. The problem of Patch 6/6 
> was
> discovered by this.
> 
>>
>>> +
>>> +properties:
>>> +  compatible:
>>> +const: snps,dw-apb-ictl
>>> +
>>> +  interrupt-controller: true
>>> +
>>> +  reg:
>>> +description:
>>> +  Physical base address of the controller and length of memory mapped
>>> +  region starting with ENABLE_LOW register.
>>
>> Need to define how many reg regions (maxItems: 1).
> 
> OK, I

Re: WARNING: SOFTIRQ-READ-safe -> SOFTIRQ-READ-unsafe lock order detected

#syz dup: inconsistent lock state in xfrm_policy_lookup_inexact_addr
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v2] PCI: Cadence: Add quirk for Gen2 controller to do autonomous speed change.

2020-09-23 Thread Kishon Vijay Abraham I

Hi Nadeem,

On 24/09/20 12:04 am, Nadeem Athani wrote:
> Cadence controller will not initiate autonomous speed change if
> strapped as Gen2. The Retrain bit is set as a quirk to trigger
> this speed change.
> 
> Signed-off-by: Nadeem Athani 
> ---
>  drivers/pci/controller/cadence/pcie-cadence-host.c | 14 ++
>  drivers/pci/controller/cadence/pcie-cadence.h  | 15 +++
>  2 files changed, 29 insertions(+)
> 
> diff --git a/drivers/pci/controller/cadence/pcie-cadence-host.c 
> b/drivers/pci/controller/cadence/pcie-cadence-host.c
> index 4550e0d469ca..a2317614268d 100644
> --- a/drivers/pci/controller/cadence/pcie-cadence-host.c
> +++ b/drivers/pci/controller/cadence/pcie-cadence-host.c
> @@ -83,6 +83,9 @@ static int cdns_pcie_host_init_root_port(struct 
> cdns_pcie_rc *rc)
>   struct cdns_pcie *pcie = >pcie;
>   u32 value, ctrl;
>   u32 id;
> + u32 link_cap = CDNS_PCIE_LINK_CAP_OFFSET;
> + u8 sls;
> + u16 lnk_ctl;
>  
>   /*
>* Set the root complex BAR configuration register:
> @@ -111,6 +114,17 @@ static int cdns_pcie_host_init_root_port(struct 
> cdns_pcie_rc *rc)
>   if (rc->device_id != 0x)
>   cdns_pcie_rp_writew(pcie, PCI_DEVICE_ID, rc->device_id);
>  
> + /* Quirk to enable autonomous speed change for GEN2 controller */
> + /* Reading Supported Link Speed value */
> + sls = PCI_EXP_LNKCAP_SLS &
> + cdns_pcie_rp_readb(pcie, link_cap + PCI_EXP_LNKCAP);
> + if (sls == PCI_EXP_LNKCAP_SLS_5_0GB) {
> + /* Since this a Gen2 controller, set Retrain Link(RL) bit */
> + lnk_ctl = cdns_pcie_rp_readw(pcie, link_cap + PCI_EXP_LNKCTL);
> + lnk_ctl |= PCI_EXP_LNKCTL_RL;
> + cdns_pcie_rp_writew(pcie, link_cap + PCI_EXP_LNKCTL, lnk_ctl);
> + }

Is this workaround required for all Cadence controller? If not, enable
this workaround only for versions which doesn't do autonomous speed change.

I think this workaround should also be applied only after checking for
link status (cdns_pcie_link_up()).

And this is also applicable for GEN3/GEN4 controller. So the check
should be to see the capability of the connected PCIe device and not the
controller itself.

Thanks
Kishon

Re: inconsistent lock state in xfrm_user_rcv_msg

#syz dup: inconsistent lock state in xfrm_policy_lookup_inexact_addr
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: possible deadlock in xfrm_policy_lookup_inexact_addr

#syz dup: inconsistent lock state in xfrm_policy_lookup_inexact_addr
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[PATCH v4 6/6] kasan: update documentation for generic kasan

Generic KASAN also supports to record the last two timer and workqueue
stacks and print them in KASAN report. So that need to update
documentation.

Signed-off-by: Walter Wu 
Suggested-by: Marco Elver 
Acked-by: Marco Elver 
Reviewed-by: Dmitry Vyukov 
Reviewed-by: Andrey Konovalov 
Cc: Andrey Ryabinin 
Cc: Alexander Potapenko 
Cc: Jonathan Corbet 
---

v3:
- Thanks for Marco suggestion

---
 Documentation/dev-tools/kasan.rst | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/dev-tools/kasan.rst 
b/Documentation/dev-tools/kasan.rst
index 38fd5681fade..698ccb65e634 100644
--- a/Documentation/dev-tools/kasan.rst
+++ b/Documentation/dev-tools/kasan.rst
@@ -190,8 +190,9 @@ function calls GCC directly inserts the code to check the 
shadow memory.
 This option significantly enlarges kernel but it gives x1.1-x2 performance
 boost over outline instrumented kernel.
 
-Generic KASAN prints up to 2 call_rcu() call stacks in reports, the last one
-and the second to last.
+Generic KASAN also reports the last 2 call stacks to creation of work that
+potentially has access to an object. Call stacks for the following are shown:
+call_rcu(), timer and workqueue queuing.
 
 Software tag-based KASAN
 
-- 
2.18.0

[PATCH v4 3/6] kasan: print timer and workqueue stack

The aux_stack[2] is reused to record the call_rcu() call stack,
timer init call stack, and enqueuing work call stacks. So that
we need to change the auxiliary stack title for common title,
print them in KASAN report.

Signed-off-by: Walter Wu 
Suggested-by: Marco Elver 
Acked-by: Marco Elver 
Reviewed-by: Dmitry Vyukov 
Reviewed-by: Andrey Konovalov 
Cc: Andrey Ryabinin 
Cc: Alexander Potapenko 
---

v2:
- Thanks for Marco suggestion.
- We modify aux stack title name in KASAN report
  in order to print call_rcu()/timer/workqueue stack.

---
 mm/kasan/report.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/kasan/report.c b/mm/kasan/report.c
index 4f49fa6cd1aa..886809d0a8dd 100644
--- a/mm/kasan/report.c
+++ b/mm/kasan/report.c
@@ -183,12 +183,12 @@ static void describe_object(struct kmem_cache *cache, 
void *object,
 
 #ifdef CONFIG_KASAN_GENERIC
if (alloc_info->aux_stack[0]) {
-   pr_err("Last call_rcu():\n");
+   pr_err("Last potentially related work creation:\n");
print_stack(alloc_info->aux_stack[0]);
pr_err("\n");
}
if (alloc_info->aux_stack[1]) {
-   pr_err("Second to last call_rcu():\n");
+   pr_err("Second to last potentially related work 
creation:\n");
print_stack(alloc_info->aux_stack[1]);
pr_err("\n");
}
-- 
2.18.0

[PATCH v4 2/6] workqueue: kasan: record workqueue stack

Records the last two enqueuing work call stacks in order to print them
in KASAN report. It is useful for programmers to solve use-after-free
or double-free memory workqueue issue.

For workqueue it has turned out to be useful to record the enqueuing
work call stacks. Because user can see KASAN report to determine
whether it is root cause. They don't need to enable debugobjects,
but they have a chance to find out the root cause.

Signed-off-by: Walter Wu 
Suggested-by: Marco Elver 
Acked-by: Marco Elver 
Acked-by: Tejun Heo 
Reviewed-by: Dmitry Vyukov 
Reviewed-by: Andrey Konovalov 
Cc: Andrey Ryabinin 
Cc: Alexander Potapenko 
Cc: Lai Jiangshan 
---

v2:
- Thanks for Marco suggestion.
- Remove unnecessary code
- reuse kasan_record_aux_stack() and aux_stack
  to record timer and workqueue stack

---

 kernel/workqueue.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c41c3c17b86a..5fea7dc9180f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1324,6 +1324,9 @@ static void insert_work(struct pool_workqueue *pwq, 
struct work_struct *work,
 {
struct worker_pool *pool = pwq->pool;
 
+   /* record the work call stack in order to print it in KASAN reports */
+   kasan_record_aux_stack(work);
+
/* we own @work, set data and link */
set_work_pwq(work, pwq, extra_flags);
list_add_tail(>entry, head);
-- 
2.18.0

[PATCH v4 1/6] timer: kasan: record timer stack

When analyze use-after-free or double-free issue, recording the timer
stacks is helpful to preserve usage history which potentially gives
a hint about the affected code.

Record the most recent two timer init calls in KASAN which are printed
on failure in the KASAN report.

For timers it has turned out to be useful to record the stack trace
of the timer init call. Because if the UAF root cause is in timer init,
then user can see KASAN report to get where it is registered and find
out the root cause. It don't need to enable DEBUG_OBJECTS_TIMERS,
but they have a chance to find out the root cause.

Signed-off-by: Walter Wu 
Suggested-by: Marco Elver 
Suggested-by: Thomas Gleixner 
Acked-by: Marco Elver 
Reviewed-by: Dmitry Vyukov 
Reviewed-by: Andrey Konovalov 
Cc: Andrey Ryabinin 
Cc: Alexander Potapenko 
Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: Stephen Boyd 
---

v2:
- Thanks for Marco and Thomas suggestion.
- Remove unnecessary code and fix commit log
- reuse kasan_record_aux_stack() and aux_stack
  to record timer and workqueue stack.

---
 kernel/time/timer.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index a16764b0116e..1ed8f8aca7f5 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -796,6 +796,9 @@ static void do_init_timer(struct timer_list *timer,
timer->function = func;
timer->flags = flags | raw_smp_processor_id();
lockdep_init_map(>lockdep_map, name, key, 0);
+
+   /* record the timer stack in order to print it in KASAN report */
+   kasan_record_aux_stack(timer);
 }
 
 /**
-- 
2.18.0

PATCH v4 5/6] kasan: add tests for workqueue stack recording

Adds a test to verify workqueue stack recording and print it in
KASAN report.

The KASAN report was as follows(cleaned up slightly):

 BUG: KASAN: use-after-free in kasan_workqueue_uaf

 Freed by task 54:
  kasan_save_stack+0x24/0x50
  kasan_set_track+0x24/0x38
  kasan_set_free_info+0x20/0x40
  __kasan_slab_free+0x10c/0x170
  kasan_slab_free+0x10/0x18
  kfree+0x98/0x270
  kasan_workqueue_work+0xc/0x18

 Last potentially related work creation:
  kasan_save_stack+0x24/0x50
  kasan_record_wq_stack+0xa8/0xb8
  insert_work+0x48/0x288
  __queue_work+0x3e8/0xc40
  queue_work_on+0xf4/0x118
  kasan_workqueue_uaf+0xfc/0x190

Signed-off-by: Walter Wu 
Acked-by: Marco Elver 
Reviewed-by: Dmitry Vyukov 
Reviewed-by: Andrey Konovalov 
Cc: Andrey Ryabinin 
Cc: Alexander Potapenko 
Cc: Matthias Brugger 
---

v4:
- testcase has merge conflict, so that rebase onto the KASAN-KUNIT

---
 lib/test_kasan_module.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/lib/test_kasan_module.c b/lib/test_kasan_module.c
index 2e5e7be96955..c3a2d113e757 100644
--- a/lib/test_kasan_module.c
+++ b/lib/test_kasan_module.c
@@ -115,6 +115,35 @@ static noinline void __init kasan_timer_uaf(void)
((volatile struct timer_list *)timer)->expires;
 }
 
+static noinline void __init kasan_workqueue_work(struct work_struct *work)
+{
+   kfree(work);
+}
+
+static noinline void __init kasan_workqueue_uaf(void)
+{
+   struct workqueue_struct *workqueue;
+   struct work_struct *work;
+
+   workqueue = create_workqueue("kasan_wq_test");
+   if (!workqueue) {
+   pr_err("Allocation failed\n");
+   return;
+   }
+   work = kmalloc(sizeof(struct work_struct), GFP_KERNEL);
+   if (!work) {
+   pr_err("Allocation failed\n");
+   return;
+   }
+
+   INIT_WORK(work, kasan_workqueue_work);
+   queue_work(workqueue, work);
+   destroy_workqueue(workqueue);
+
+   pr_info("use-after-free on workqueue\n");
+   ((volatile struct work_struct *)work)->data;
+}
+
 static int __init test_kasan_module_init(void)
 {
/*
@@ -126,6 +155,7 @@ static int __init test_kasan_module_init(void)
copy_user_test();
kasan_rcu_uaf();
kasan_timer_uaf();
+   kasan_workqueue_uaf();
 
kasan_restore_multi_shot(multishot);
return -EAGAIN;
-- 
2.18.0

Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

2020-09-23 Thread Kuppuswamy, Sathyanarayanan

On 9/23/20 8:13 PM, Sinan Kaya wrote:

On 9/23/2020 10:51 PM, Kuppuswamy, Sathyanarayanan wrote:

I see. Can I assume that your system supports DPC?
DPC is supposed to recover the link via dpc_reset_link().

Yes. But the affected device/drivers cleanup during error recovery
is handled by hotplug handler. So we are facing issue when dealing
with non hotplug capable ports.

This is confusing.

Why would hotplug driver be involved unless port supports hotplug and
the link goes down? You said that DLLSC is only supported on hotplug
capable ports.

hotplug driver is *only* involved when dealing with recovery of hotplug
capable ports. For hotplug capable ports, when DPC is triggered and link
goes down, DLLSC handler in pciehp driver will remove the affected
devices/drivers. Once the link comes back it will re-attach them.

Need a better description of symptoms and what triggers hotplug driver
to activate.

For problem description, please check the following details

Current pcie_do_recovery() implementation has following two issues:

1. Fatal (DPC) error recovery is currently broken for non-hotplug
capable devices. Current fatal error recovery implementation relies
on PCIe hotplug (pciehp) handler for detaching and re-enumerating
the affected devices/drivers. pciehp handler listens for DLLSC state
changes and handles device/driver detachment on DLLSC_LINK_DOWN event
and re-enumeration on DLLSC_LINK_UP event. So when dealing with
non-hotplug capable devices, recovery code does not restore the state
of the affected devices correctly. Correct implementation should
restore the device state and call report_slot_reset() function after
resetting the link to restore the state of the device/driver.

You can find fatal non-hotplug related issues reported in following links:

https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/
https://lore.kernel.org/linux-pci/12115.1588207324@famine/
https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/

2. For non-fatal errors if report_error_detected() or
report_mmio_enabled() functions requests PCI_ERS_RESULT_NEED_RESET then
current pcie_do_recovery() implementation does not do the requested
explicit device reset, instead just calls the report_slot_reset() on all
affected devices. Notifying about the reset via report_slot_reset()
without doing the actual device reset is incorrect.

To fix above issues, use PCI_ERS_RESULT_NEED_RESET as error state after
successful reset_link() operation. This will ensure ->slot_reset() be
called after reset_link() operation for fatal errors. Also call
pci_bus_reset() to do slot/bus reset() before triggering device specific
->slot_reset() callback. Also, using pci_bus_reset() will restore the state
of the devices after performing the reset operation.

Even though using pci_bus_reset() will do redundant reset operation after
->reset_link() for fatal errors, it should should affect the functional
behavior.

Can you expand this a little bit?

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

[PATCH v4 4/6] kasan: add tests for timer stack recording

Adds a test to verify timer stack recording and print it
in KASAN report.

The KASAN report was as follows(cleaned up slightly):

 BUG: KASAN: use-after-free in kasan_timer_uaf

 Freed by task 0:
  kasan_save_stack+0x24/0x50
  kasan_set_track+0x24/0x38
  kasan_set_free_info+0x20/0x40
  __kasan_slab_free+0x10c/0x170
  kasan_slab_free+0x10/0x18
  kfree+0x98/0x270
  kasan_timer_function+0x1c/0x28

 Last potentially related work creation:
  kasan_save_stack+0x24/0x50
  kasan_record_tmr_stack+0xa8/0xb8
  init_timer_key+0xf0/0x248
  kasan_timer_uaf+0x5c/0xd8

Signed-off-by: Walter Wu 
Acked-by: Marco Elver 
Reviewed-by: Dmitry Vyukov 
Reviewed-by: Andrey Konovalov 
Cc: Andrey Ryabinin 
Cc: Alexander Potapenko 
Cc: Matthias Brugger 
---

v4:
- testcase has merge conflict, so that rebase onto the KASAN-KUNIT

---
 lib/test_kasan_module.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/lib/test_kasan_module.c b/lib/test_kasan_module.c
index 2d68db6ae67b..d8234a1db8c9 100644
--- a/lib/test_kasan_module.c
+++ b/lib/test_kasan_module.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../mm/kasan/kasan.h"
 
@@ -91,6 +92,29 @@ static noinline void __init kasan_rcu_uaf(void)
call_rcu(_rcu_ptr->rcu, kasan_rcu_reclaim);
 }
 
+static noinline void __init kasan_timer_function(struct timer_list *timer)
+{
+   del_timer(timer);
+   kfree(timer);
+}
+
+static noinline void __init kasan_timer_uaf(void)
+{
+   struct timer_list *timer;
+
+   timer = kmalloc(sizeof(struct timer_list), GFP_KERNEL);
+   if (!timer) {
+   pr_err("Allocation failed\n");
+   return;
+   }
+
+   timer_setup(timer, kasan_timer_function, 0);
+   add_timer(timer);
+   msleep(100);
+
+   pr_info("use-after-free on timer\n");
+   ((volatile struct timer_list *)timer)->expires;
+}
 
 static int __init test_kasan_module_init(void)
 {
@@ -102,6 +126,7 @@ static int __init test_kasan_module_init(void)
 
copy_user_test();
kasan_rcu_uaf();
+   kasan_timer_uaf();
 
kasan_restore_multi_shot(multishot);
return -EAGAIN;
-- 
2.18.0

[PATCH v4 0/6] kasan: add workqueue and timer stack for generic KASAN

Syzbot reports many UAF issues for workqueue or timer, see [1] and [2].
In some of these access/allocation happened in process_one_work(),
we see the free stack is useless in KASAN report, it doesn't help
programmers to solve UAF on workqueue. The same may stand for times.

This patchset improves KASAN reports by making them to have workqueue
queueing stack and timer stack information. It is useful for programmers
to solve use-after-free or double-free memory issue.

Generic KASAN also records the last two workqueue and timer stacks and
prints them in KASAN report. It is only suitable for generic KASAN.

[1]https://groups.google.com/g/syzkaller-bugs/search?q=%22use-after-free%22+process_one_work
[2]https://groups.google.com/g/syzkaller-bugs/search?q=%22use-after-free%22%20expire_timers
[3]https://bugzilla.kernel.org/show_bug.cgi?id=198437

Walter Wu (6):
timer: kasan: record timer stack
workqueue: kasan: record workqueue stack
kasan: print timer and workqueue stack
lib/test_kasan.c: add timer test case
lib/test_kasan.c: add workqueue test case
kasan: update documentation for generic kasan

---
Changes since v3:
- testcases have merge conflict, so that need to
  be rebased onto the KASAN-KUNIT.

Changes since v2:
- modify kasan document to be readable,
  Thanks for Marco suggestion.

Changes since v1:
- Thanks for Marco and Thomas suggestion.
- Remove unnecessary code and fix commit log
- reuse kasan_record_aux_stack() and aux_stack
  to record timer and workqueue stack.
- change the aux stack title for common name.

---
Documentation/dev-tools/kasan.rst |  5 +++--
kernel/time/timer.c   |  3 +++
kernel/workqueue.c|  3 +++
lib/test_kasan_module.c   | 55 
+++
mm/kasan/report.c |  4 ++--
5 files changed, 66 insertions(+), 4 deletions(-)

Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-09-23 Thread Huang, Ying

Rafael Aquini  writes:
> The bug here is quite simple: split_swap_cluster() misses checking for
> lock_cluster() returning NULL before committing to change cluster_info->flags.

I don't think so.  We shouldn't run into this situation firstly.  So the
"fix" hides the real bug instead of fixing it.  Just like we call
VM_BUG_ON_PAGE(!PageLocked(head), head) in split_huge_page_to_list()
instead of returning if !PageLocked(head) silently.

> The fundamental problem has nothing to do with allocating, or not allocating
> a swap cluster, but it has to do with the fact that the THP deferred split 
> scan
> can transiently race with swapcache insertion, and the fact that when you run
> your swap area on rotational storage cluster_info is _always_ NULL.
> split_swap_cluster() needs to check for lock_cluster() returning NULL because
> that's one possible case, and it clearly fails to do so.

If there's a race, we should fix the race.  But the code path for
swapcache insertion is,

add_to_swap()
  get_swap_page() /* Return if fails to allocate */
  add_to_swap_cache()
SetPageSwapCache()

While the code path to split THP is,

split_huge_page_to_list()
  if PageSwapCache()
split_swap_cluster()

Both code paths are protected by the page lock.  So there should be some
other reasons to trigger the bug.

And again, for HDD, a THP shouldn't have PageSwapCache() set at the
first place.  If so, the bug is that the flag is set and we should fix
the setting.

> Run a workload that cause multiple THP COW, and add a memory hogger to create
> memory pressure so you'll force the reclaimers to kick the registered
> shrinkers. The trigger is not heavy swapping, and that's probably why
> most swap test cases don't hit it. The window is tight, but you will get the
> NULL pointer dereference.

Do you have a script to reproduce the bug?

> Regardless you find furhter bugs, or not, this patch is needed to correct a
> blunt coding mistake.

As above.  I don't agree with that.

Best Regards,
Huang, Ying

Re: [PATCH 7/7] perf inject: Remove stale build-id processing

2020-09-23 Thread Namhyung Kim

Hi Adrian,

Thanks for your review!

On Wed, Sep 23, 2020 at 11:36 PM Adrian Hunter  wrote:
>
> On 23/09/20 11:05 am, Namhyung Kim wrote:
> > I think we don't need to call build_id__mark_dso_hit() in the
> > perf_event__repipe_sample() as it's not used by -b option.  In case of
> > the -b option is used, it uses perf_event__inject_buildid() instead.
> > This can remove unnecessary overhead of finding thread/map for each
> > sample event.
> >
> > Also I suspect HEADER_BUILD_ID feature bit setting since we already
> > generated/injected BUILD_ID event into the output stream.  So this
> > header information seems redundant.  I'm not 100% sure about the
> > auxtrace usage, but it looks like not related to this directly.
> >
> > And we now have --buildid-all so users can get the same behavior if
> > they want.
>
> For a perf.data file, don't buildids get written to the HEADER_BUILD_ID
> feature section by perf_session__write_header() if the feature flag is set
> and if they are hit?

Right, that's what perf record does unless -B option is used.
But I'm more concerned about the pipe usage which doesn't use the header.

>
> So, unless -b is used, anything you don't hit you lose i.e. a buildid in the
> HEADER_BUILD_ID feature section of the input file, will not be written to
> the output file.

But perf inject generates PERF_RECORD_HEADER_BUILD_ID events
and puts them into the data stream when -b option is used.

Do you say perf inject should process build-id even when -b is not used?

> >
> > Cc: Adrian Hunter 
> > Signed-off-by: Namhyung Kim 
> > ---
> >  tools/perf/builtin-inject.c | 12 
> >  1 file changed, 12 deletions(-)
> >
> > diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
> > index 500428aaa576..0191d72be7c4 100644
> > --- a/tools/perf/builtin-inject.c
> > +++ b/tools/perf/builtin-inject.c
> > @@ -277,8 +277,6 @@ static int perf_event__repipe_sample(struct perf_tool 
> > *tool,
> >   return f(tool, event, sample, evsel, machine);
> >   }
> >
> > - build_id__mark_dso_hit(tool, event, sample, evsel, machine);
> > -
>
> I guess that chunk would prevent losing a buildid in a perf.data file?
>
> >   if (inject->itrace_synth_opts.set && sample->aux_sample.size)
> >   event = perf_inject__cut_auxtrace_sample(inject, event, 
> > sample);
> >
> > @@ -767,16 +765,6 @@ static int __cmd_inject(struct perf_inject *inject)
> >   return ret;
> >
> >   if (!data_out->is_pipe) {
> > - if (inject->build_ids)
> > - perf_header__set_feat(>header,
> > -   HEADER_BUILD_ID);
>
> That could be due to confusion with 'perf buildid-list' which will not show
> any buildids from synthesized buildid events unless "with hits" is selected,
> so then it looks like there are no buildids.

Yeah, it's confusing.. I think we should change the behavior to handle
the pipe case properly.

>
> It could be an advantage to have the buildids also in the HEADER_BUILD_ID
> feature section, because then then build-list can list them quickly.

I'm not sure it's good to have duplicate build-id info.
We may add an option just to update the header section and
not inject BUILD_ID events.

>
> > - /*
> > -  * Keep all buildids when there is unprocessed AUX data 
> > because
> > -  * it is not known which ones the AUX trace hits.
> > -  */
> > - if (perf_header__has_feat(>header, HEADER_BUILD_ID) 
> > &&
> > - inject->have_auxtrace && !inject->itrace_synth_opts.set)
> > - dsos__hit_all(session);
>
> I expect that is definitely needed.

OK.

Thanks
Namhyung

>
> >   /*
> >* The AUX areas have been removed and replaced with
> >* synthesized hardware events, so clear the feature flag and
> >
>

Re: [PATCH V3 2/3] arm64/mm/hotplug: Enable MEM_OFFLINE event handling

2020-09-23 Thread Anshuman Khandual




On 09/23/2020 12:01 PM, Gavin Shan wrote:
> Hi Anshuman,
> 
> On 9/21/20 10:05 PM, Anshuman Khandual wrote:
>> This enables MEM_OFFLINE memory event handling. It will help intercept any
>> possible error condition such as if boot memory some how still got offlined
>> even after an explicit notifier failure, potentially by a future change in
>> generic hot plug framework. This would help detect such scenarios and help
>> debug further.
>>
>> Cc: Catalin Marinas 
>> Cc: Will Deacon 
>> Cc: Mark Rutland 
>> Cc: Marc Zyngier 
>> Cc: Steve Capper 
>> Cc: Mark Brown 
>> Cc: linux-arm-ker...@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Anshuman Khandual 
>> ---
> 
> I'm not sure if it makes sense since MEM_OFFLINE won't be triggered
> after NOTIFY_BAD is returned from MEM_GOING_OFFLINE. NOTIFY_BAD means
> the whole offline process is stopped. It would be guranteed by generic
> framework from syntax standpoint.

Right but the intent here is to catch any deviation in generic hotplug
semantics going forward.
 > 
> However, this looks good if MEM_OFFLINE is triggered without calling
> into MEM_GOING_OFFLINE previously, but it would be a bug from generic
> framework.

Exactly, this will just ensure that we know about any change or a bug
in the generic framework. But if required, this additional check can
be enabled only with DEBUG_VM.

> 
>>   arch/arm64/mm/mmu.c | 37 -
>>   1 file changed, 32 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index df3b7415b128..6b171bd88bcf 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1482,13 +1482,40 @@ static int prevent_bootmem_remove_notifier(struct 
>> notifier_block *nb,
>>   unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
>>   unsigned long pfn = arg->start_pfn;
>>   -    if (action != MEM_GOING_OFFLINE)
>> +    if ((action != MEM_GOING_OFFLINE) && (action != MEM_OFFLINE))
>>   return NOTIFY_OK;
>>   -    for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>> -    ms = __pfn_to_section(pfn);
>> -    if (early_section(ms))
>> -    return NOTIFY_BAD;
>> +    if (action == MEM_GOING_OFFLINE) {
>> +    for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>> +    ms = __pfn_to_section(pfn);
>> +    if (early_section(ms)) {
>> +    pr_warn("Boot memory offlining attempted\n");
>> +    return NOTIFY_BAD;
>> +    }
>> +    }
>> +    } else if (action == MEM_OFFLINE) {
>> +    for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>> +    ms = __pfn_to_section(pfn);
>> +    if (early_section(ms)) {
>> +
>> +    /*
>> + * This should have never happened. Boot memory
>> + * offlining should have been prevented by this
>> + * very notifier. Probably some memory removal
>> + * procedure might have changed which would then
>> + * require further debug.
>> + */
>> +    pr_err("Boot memory offlined\n");
>> +
>> +    /*
>> + * Core memory hotplug does not process a return
>> + * code from the notifier for MEM_OFFLINE event.
>> + * Error condition has been reported. Report as
>> + * ignored.
>> + */
>> +    return NOTIFY_DONE;
>> +    }
>> +    }
>>   }
>>   return NOTIFY_OK;
>>   }
>>
> 
> It's pretty much irrelevant comment if the patch doesn't make sense:
> the logical block for MEM_GOING_OFFLINE would be reused by MEM_OFFLINE
> as they looks similar except the return value and error message :)

This can be reorganized in the above mentioned format as well. Without
much additional code or iteration, it might not need DEBUG_VM as well.

for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
ms = __pfn_to_section(pfn);
if (!early_section(ms))
continue;

if (action == MEM_GOING_OFFLINE) {
pr_warn("Boot memory offlining attempted\n");
return NOTIFY_BAD;
}
else if (action == MEM_OFFLINE) {
pr_err("Boot memory offlined\n");
return NOTIFY_DONE;
}
}
return NOTIFY_OK;

[PATCH v19 08/20] mm: page_idle_get_page() does not need lru_lock

From: Hugh Dickins 

It is necessary for page_idle_get_page() to recheck PageLRU() after
get_page_unless_zero(), but holding lru_lock around that serves no
useful purpose, and adds to lru_lock contention: delete it.

See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
discussion that led to lru_lock there; but __page_set_anon_rmap() now
uses WRITE_ONCE(), and I see no other risk in page_idle_clear_pte_refs()
using rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly
but not entirely prevented by page_count() check in ksm.c's
write_protect_page(): that risk being shared with page_referenced() and
not helped by lru_lock).

Signed-off-by: Hugh Dickins 
Signed-off-by: Alex Shi 
Cc: Andrew Morton 
Cc: Vladimir Davydov 
Cc: Vlastimil Babka 
Cc: Minchan Kim 
Cc: Alex Shi 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/page_idle.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/mm/page_idle.c b/mm/page_idle.c
index 057c61df12db..64e5344a992c 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -32,19 +32,15 @@
 static struct page *page_idle_get_page(unsigned long pfn)
 {
struct page *page = pfn_to_online_page(pfn);
-   pg_data_t *pgdat;
 
if (!page || !PageLRU(page) ||
!get_page_unless_zero(page))
return NULL;
 
-   pgdat = page_pgdat(page);
-   spin_lock_irq(>lru_lock);
if (unlikely(!PageLRU(page))) {
put_page(page);
page = NULL;
}
-   spin_unlock_irq(>lru_lock);
return page;
 }
 
-- 
1.8.3.1

[PATCH v19 14/20] mm/mlock: remove __munlock_isolate_lru_page

The func only has one caller, remove it to clean up code and simplify
code.

Signed-off-by: Alex Shi 
Acked-by: Hugh Dickins 
Cc: Hugh Dickins 
Cc: Kirill A. Shutemov 
Cc: Vlastimil Babka 
Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/mlock.c | 31 +--
 1 file changed, 9 insertions(+), 22 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 796c726a0407..d487aa864e86 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -106,26 +106,6 @@ void mlock_vma_page(struct page *page)
 }
 
 /*
- * Isolate a page from LRU with optional get_page() pin.
- * Assumes lru_lock already held and page already pinned.
- */
-static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
-{
-   if (PageLRU(page)) {
-   struct lruvec *lruvec;
-
-   lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-   if (getpage)
-   get_page(page);
-   ClearPageLRU(page);
-   del_page_from_lru_list(page, lruvec, page_lru(page));
-   return true;
-   }
-
-   return false;
-}
-
-/*
  * Finish munlock after successful page isolation
  *
  * Page must be locked. This is a wrapper for try_to_munlock()
@@ -296,9 +276,16 @@ static void __munlock_pagevec(struct pagevec *pvec, struct 
zone *zone)
 * We already have pin from follow_page_mask()
 * so we can spare the get_page() here.
 */
-   if (__munlock_isolate_lru_page(page, false))
+   if (PageLRU(page)) {
+   struct lruvec *lruvec;
+
+   ClearPageLRU(page);
+   lruvec = mem_cgroup_page_lruvec(page,
+   page_pgdat(page));
+   del_page_from_lru_list(page, lruvec,
+   page_lru(page));
continue;
-   else
+   } else
__munlock_isolation_failed(page);
} else {
delta_munlocked++;
-- 
1.8.3.1

[PATCH v19 04/20] mm/thp: use head for head page in lru_add_page_tail

Since the first parameter is only used by head page, it's better to make
it explicit.

Signed-off-by: Alex Shi 
Reviewed-by: Kirill A. Shutemov 
Acked-by: Hugh Dickins 
Cc: Andrew Morton 
Cc: Johannes Weiner 
Cc: Matthew Wilcox 
Cc: Hugh Dickins 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 211d50f09785..fa0753229b52 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2318,19 +2318,19 @@ static void remap_page(struct page *page)
}
 }
 
-static void lru_add_page_tail(struct page *page, struct page *page_tail,
+static void lru_add_page_tail(struct page *head, struct page *page_tail,
struct lruvec *lruvec, struct list_head *list)
 {
-   VM_BUG_ON_PAGE(!PageHead(page), page);
-   VM_BUG_ON_PAGE(PageCompound(page_tail), page);
-   VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+   VM_BUG_ON_PAGE(!PageHead(head), head);
+   VM_BUG_ON_PAGE(PageCompound(page_tail), head);
+   VM_BUG_ON_PAGE(PageLRU(page_tail), head);
lockdep_assert_held(_pgdat(lruvec)->lru_lock);
 
if (!list)
SetPageLRU(page_tail);
 
-   if (likely(PageLRU(page)))
-   list_add_tail(_tail->lru, >lru);
+   if (likely(PageLRU(head)))
+   list_add_tail(_tail->lru, >lru);
else if (list) {
/* page reclaim is reclaiming a huge page */
get_page(page_tail);
-- 
1.8.3.1

[PATCH v19 02/20] mm/memcg: bail early from swap accounting if memcg disabled

If we disabled memcg by cgroup_disable=memory, page->memcg will be NULL
and so the charge is skipped and that will trigger a warning like below.
Let's return from the funcs earlier.

 anon flags:0x5005b48008000d(locked|uptodate|dirty|swapbacked)
 raw: 005005b48008000d dead0100 dead0122 8897c7c76ad1
 raw: 0022  0002 
 page dumped because: VM_WARN_ON_ONCE_PAGE(!memcg)
...
 RIP: 0010:vprintk_emit+0x1f7/0x260
 Code: 00 84 d2 74 72 0f b6 15 27 58 64 01 48 c7 c0 00 d4 72 82 84 d2 74 09 f3 
90 0f b6 10 84 d2 75 f7 e8 de 0d 00 00 4c 89 e7 57 9d <0f> 1f 44 00 00 e9 62 ff 
ff ff 80 3d 88 c9 3a 01 00 0f 85 54 fe ff
 RSP: 0018:c9000faab358 EFLAGS: 0202
 RAX: 8272d400 RBX: 005e RCX: 88afd80d0040
 RDX:  RSI: 0002 RDI: 0202
 RBP: c9000faab3a8 R08: 8272d440 R09: 00022480
 R10: 00120c77be68bfac R11: 00cd7568 R12: 0202
 R13: 0057c0080005 R14: 820a0130 R15: c9000faab3e8
 ? vprintk_emit+0x140/0x260
 vprintk_default+0x1a/0x20
 vprintk_func+0x4f/0xc4
 ? vprintk_func+0x4f/0xc4
 printk+0x53/0x6a
 ? xas_load+0xc/0x80
 __dump_page.cold.6+0xff/0x4ee
 ? xas_init_marks+0x23/0x50
 ? xas_store+0x30/0x40
 ? free_swap_slot+0x43/0xd0
 ? put_swap_page+0x119/0x320
 ? update_load_avg+0x82/0x580
 dump_page+0x9/0xb
 mem_cgroup_try_charge_swap+0x16e/0x1d0
 get_swap_page+0x130/0x210
 add_to_swap+0x41/0xc0
 shrink_page_list+0x99e/0xdf0
 shrink_inactive_list+0x199/0x360
 shrink_lruvec+0x40d/0x650
 ? _cond_resched+0x14/0x30
 ? _cond_resched+0x14/0x30
 shrink_node+0x226/0x6e0
 do_try_to_free_pages+0xd0/0x400
 try_to_free_pages+0xef/0x130
 __alloc_pages_slowpath.constprop.127+0x38d/0xbd0
 ? ___slab_alloc+0x31d/0x6f0
 __alloc_pages_nodemask+0x27f/0x2c0
 alloc_pages_vma+0x75/0x220
 shmem_alloc_page+0x46/0x90
 ? release_pages+0x1ae/0x410
 shmem_alloc_and_acct_page+0x77/0x1c0
 shmem_getpage_gfp+0x162/0x910
 shmem_fault+0x74/0x210
 ? filemap_map_pages+0x29c/0x410
 __do_fault+0x37/0x190
 handle_mm_fault+0x120a/0x1770
 exc_page_fault+0x251/0x450
 ? asm_exc_page_fault+0x8/0x30
 asm_exc_page_fault+0x1e/0x30

Signed-off-by: Alex Shi 
Reviewed-by: Roman Gushchin 
Acked-by: Michal Hocko 
Acked-by: Hugh Dickins 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Andrew Morton 
Cc: cgro...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/memcontrol.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f7094a04be07..368e6ac644f0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7102,6 +7102,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t 
entry)
VM_BUG_ON_PAGE(PageLRU(page), page);
VM_BUG_ON_PAGE(page_count(page), page);
 
+   if (mem_cgroup_disabled())
+   return;
+
if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
return;
 
@@ -7166,6 +7169,9 @@ int mem_cgroup_try_charge_swap(struct page *page, 
swp_entry_t entry)
struct mem_cgroup *memcg;
unsigned short oldid;
 
+   if (mem_cgroup_disabled())
+   return 0;
+
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return 0;
 
-- 
1.8.3.1

[PATCH v19 13/20] mm/mlock: remove lru_lock on TestClearPageMlocked

In the func munlock_vma_page, comments mentained lru_lock needed for
serialization with split_huge_pages. But the page must be PageLocked
as well as pages in split_huge_page series funcs. Thus the PageLocked
is enough to serialize both funcs.

Further more, Hugh Dickins pointed: before splitting in
split_huge_page_to_list, the page was unmap_page() to remove pmd/ptes
which protect the page from munlock. Thus, no needs to guard
__split_huge_page_tail for mlock clean, just keep the lru_lock there for
isolation purpose.

LKP found a preempt issue on __mod_zone_page_state which need change
to mod_zone_page_state. Thanks!

Signed-off-by: Alex Shi 
Acked-by: Hugh Dickins 
Cc: Kirill A. Shutemov 
Cc: Vlastimil Babka 
Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/mlock.c | 26 +-
 1 file changed, 5 insertions(+), 21 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 884b1216da6a..796c726a0407 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -187,40 +187,24 @@ static void __munlock_isolation_failed(struct page *page)
 unsigned int munlock_vma_page(struct page *page)
 {
int nr_pages;
-   pg_data_t *pgdat = page_pgdat(page);
 
/* For try_to_munlock() and to serialize with page migration */
BUG_ON(!PageLocked(page));
-
VM_BUG_ON_PAGE(PageTail(page), page);
 
-   /*
-* Serialize with any parallel __split_huge_page_refcount() which
-* might otherwise copy PageMlocked to part of the tail pages before
-* we clear it in the head page. It also stabilizes thp_nr_pages().
-*/
-   spin_lock_irq(>lru_lock);
-
if (!TestClearPageMlocked(page)) {
/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
-   nr_pages = 1;
-   goto unlock_out;
+   return 0;
}
 
nr_pages = thp_nr_pages(page);
-   __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+   mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 
-   if (__munlock_isolate_lru_page(page, true)) {
-   spin_unlock_irq(>lru_lock);
+   if (!isolate_lru_page(page))
__munlock_isolated_page(page);
-   goto out;
-   }
-   __munlock_isolation_failed(page);
-
-unlock_out:
-   spin_unlock_irq(>lru_lock);
+   else
+   __munlock_isolation_failed(page);
 
-out:
return nr_pages - 1;
 }
 
-- 
1.8.3.1

[PATCH v19 11/20] mm/lru: move lock into lru_note_cost

We have to move lru_lock into lru_note_cost, since it cycle up on memcg
tree, for future per lruvec lru_lock replace. It's a bit ugly and may
cost a bit more locking, but benefit from multiple memcg locking could
cover the lost.

Signed-off-by: Alex Shi 
Acked-by: Hugh Dickins 
Cc: Johannes Weiner 
Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c   | 3 +++
 mm/vmscan.c | 4 +---
 mm/workingset.c | 2 --
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 4e33873975a6..6a4b00267384 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -268,7 +268,9 @@ void lru_note_cost(struct lruvec *lruvec, bool file, 
unsigned int nr_pages)
 {
do {
unsigned long lrusize;
+   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
+   spin_lock_irq(>lru_lock);
/* Record cost event */
if (file)
lruvec->file_cost += nr_pages;
@@ -292,6 +294,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, 
unsigned int nr_pages)
lruvec->file_cost /= 2;
lruvec->anon_cost /= 2;
}
+   spin_unlock_irq(>lru_lock);
} while ((lruvec = parent_lruvec(lruvec)));
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 32102e5d354d..656558bbf399 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1971,19 +1971,17 @@ static int current_may_throttle(void)
, false);
 
spin_lock_irq(>lru_lock);
-
move_pages_to_lru(lruvec, _list);
 
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-   lru_note_cost(lruvec, file, stat.nr_pageout);
item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
if (!cgroup_reclaim(sc))
__count_vm_events(item, nr_reclaimed);
__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-
spin_unlock_irq(>lru_lock);
 
+   lru_note_cost(lruvec, file, stat.nr_pageout);
mem_cgroup_uncharge_list(_list);
free_unref_page_list(_list);
 
diff --git a/mm/workingset.c b/mm/workingset.c
index 92e66113a577..32e24cda1b4f 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -381,9 +381,7 @@ void workingset_refault(struct page *page, void *shadow)
if (workingset) {
SetPageWorkingset(page);
/* XXX: Move to lru_cache_add() when it supports new vs putback 
*/
-   spin_lock_irq(_pgdat(page)->lru_lock);
lru_note_cost_page(page);
-   spin_unlock_irq(_pgdat(page)->lru_lock);
inc_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file);
}
 out:
-- 
1.8.3.1

[PATCH] crypto: x86/poly1305 - Remove assignments with no effect

On Mon, Sep 21, 2020 at 04:56:52PM +0800, kernel test robot wrote:
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> master
> head:   eb5f95f1593f7c22dac681b19e815828e2af3efd
> commit: d7d7b853566254648df59f7ea27ea05952a6cfa8 crypto: x86/poly1305 - wire 
> up faster implementations for kernel
> date:   8 months ago
> :: branch date: 14 hours ago
> :: commit date: 8 months ago
> compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
> 
> 
> cppcheck warnings: (new ones prefixed by >>)
> 
>arch/x86/crypto/poly1305_glue.c:165:4: warning: Assignment of function 
> parameter has no effect outside the function. Did you forget dereferencing 
> it? [uselessAssignmentPtrArg]
>   inp += POLY1305_BLOCK_SIZE;
>   ^
> >> arch/x86/crypto/poly1305_glue.c:166:4: warning: Assignment of function 
> >> parameter has no effect outside the function. [uselessAssignmentArg]
>   len -= POLY1305_BLOCK_SIZE;
>   ^

This patch should fix the problem.

---8<---
This patch removes a few ineffectual assignments from the function
crypto_poly1305_setdctxkey.

Reported-by: kernel test robot 
Signed-off-by: Herbert Xu 

diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
index f85885d6ccd8..e508dbd91813 100644
--- a/arch/x86/crypto/poly1305_glue.c
+++ b/arch/x86/crypto/poly1305_glue.c
@@ -158,9 +158,6 @@ static unsigned int crypto_poly1305_setdctxkey(struct 
poly1305_desc_ctx *dctx,
dctx->s[1] = get_unaligned_le32([4]);
dctx->s[2] = get_unaligned_le32([8]);
dctx->s[3] = get_unaligned_le32([12]);
-   inp += POLY1305_BLOCK_SIZE;
-   len -= POLY1305_BLOCK_SIZE;
-   acc += POLY1305_BLOCK_SIZE;
dctx->sset = true;
}
}
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[PATCH v19 19/20] mm/lru: introduce the relock_page_lruvec function

From: Alexander Duyck 

Use this new function to replace repeated same code, no func change.

When testing for relock we can avoid the need for RCU locking if we simply
compare the page pgdat and memcg pointers versus those that the lruvec is
holding. By doing this we can avoid the extra pointer walks and accesses of
the memory cgroup.

In addition we can avoid the checks entirely if lruvec is currently NULL.

Signed-off-by: Alexander Duyck 
Signed-off-by: Alex Shi 
Acked-by: Hugh Dickins 
Cc: Johannes Weiner 
Cc: Andrew Morton 
Cc: Thomas Gleixner 
Cc: Andrey Ryabinin 
Cc: Matthew Wilcox 
Cc: Mel Gorman 
Cc: Konstantin Khlebnikov 
Cc: Hugh Dickins 
Cc: Tejun Heo 
Cc: linux-kernel@vger.kernel.org
Cc: cgro...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/memcontrol.h | 52 ++
 mm/mlock.c | 11 +-
 mm/swap.c  | 33 +++--
 mm/vmscan.c| 12 ++-
 4 files changed, 62 insertions(+), 46 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7b170e9028b5..bd8fdeccf6b5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -488,6 +488,22 @@ static inline struct lruvec *mem_cgroup_lruvec(struct 
mem_cgroup *memcg,
 
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
 
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+ struct lruvec *lruvec)
+{
+   pg_data_t *pgdat = page_pgdat(page);
+   const struct mem_cgroup *memcg;
+   struct mem_cgroup_per_node *mz;
+
+   if (mem_cgroup_disabled())
+   return lruvec == >__lruvec;
+
+   mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+   memcg = page->mem_cgroup ? : root_mem_cgroup;
+
+   return lruvec->pgdat == pgdat && mz->memcg == memcg;
+}
+
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
@@ -1023,6 +1039,14 @@ static inline struct lruvec 
*mem_cgroup_page_lruvec(struct page *page,
return >__lruvec;
 }
 
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+ struct lruvec *lruvec)
+{
+   pg_data_t *pgdat = page_pgdat(page);
+
+   return lruvec == >__lruvec;
+}
+
 static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 {
return NULL;
@@ -1469,6 +1493,34 @@ static inline void unlock_page_lruvec_irqrestore(struct 
lruvec *lruvec,
spin_unlock_irqrestore(>lru_lock, flags);
 }
 
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
+   struct lruvec *locked_lruvec)
+{
+   if (locked_lruvec) {
+   if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+   return locked_lruvec;
+
+   unlock_page_lruvec_irq(locked_lruvec);
+   }
+
+   return lock_page_lruvec_irq(page);
+}
+
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
+   struct lruvec *locked_lruvec, unsigned long *flags)
+{
+   if (locked_lruvec) {
+   if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+   return locked_lruvec;
+
+   unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
+   }
+
+   return lock_page_lruvec_irqsave(page, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/mm/mlock.c b/mm/mlock.c
index ab164a675c25..55b3b3672977 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -277,16 +277,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct 
zone *zone)
 * so we can spare the get_page() here.
 */
if (TestClearPageLRU(page)) {
-   struct lruvec *new_lruvec;
-
-   new_lruvec = mem_cgroup_page_lruvec(page,
-   page_pgdat(page));
-   if (new_lruvec != lruvec) {
-   if (lruvec)
-   unlock_page_lruvec_irq(lruvec);
-   lruvec = lock_page_lruvec_irq(page);
-   }
-
+   lruvec = relock_page_lruvec_irq(page, lruvec);
del_page_from_lru_list(page, lruvec,
page_lru(page));
continue;
diff --git a/mm/swap.c b/mm/swap.c
index 622218bd18e5..375fb2e0683a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -210,19 +210,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,

[PATCH v19 12/20] mm/vmscan: remove lruvec reget in move_pages_to_lru

A isolated page shouldn't be recharged by memcg since the memcg
migration isn't possible at the time.
So remove unnecessary regetting.

Thanks to Alexander Duyck for pointing this out.

Signed-off-by: Alex Shi 
Acked-by: Hugh Dickins 
Cc: Alexander Duyck 
Cc: Andrew Morton 
Cc: Konstantin Khlebnikov 
Cc: Michal Hocko 
Cc: Hugh Dickins 
Cc: Johannes Weiner 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: cgro...@vger.kernel.org
---
 mm/vmscan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 656558bbf399..953a068b3fc7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1885,7 +1885,8 @@ static unsigned noinline_for_stack 
move_pages_to_lru(struct lruvec *lruvec,
continue;
}
 
-   lruvec = mem_cgroup_page_lruvec(page, pgdat);
+   VM_BUG_ON_PAGE(mem_cgroup_page_lruvec(page, page_pgdat(page))
+   != lruvec, page);
lru = page_lru(page);
nr_pages = thp_nr_pages(page);
 
-- 
1.8.3.1

[PATCH v19 15/20] mm/lru: introduce TestClearPageLRU

Currently lru_lock still guards both lru list and page's lru bit, that's
ok. but if we want to use specific lruvec lock on the page, we need to
pin down the page's lruvec/memcg during locking. Just taking lruvec
lock first may be undermined by the page's memcg charge/migration. To
fix this problem, we could clear the lru bit out of locking and use
it as pin down action to block the page isolation in memcg changing.

So now a standard steps of page isolation is following:
1, get_page(); #pin the page avoid to be free
2, TestClearPageLRU(); #block other isolation like memcg change
3, spin_lock on lru_lock; #serialize lru list access
4, delete page from lru list;
The step 2 could be optimzed/replaced in scenarios which page is
unlikely be accessed or be moved between memcgs.

This patch start with the first part: TestClearPageLRU, which combines
PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This
function will be used as page isolation precondition to prevent other
isolations some where else. Then there are may !PageLRU page on lru
list, need to remove BUG() checking accordingly.

There 2 rules for lru bit now:
1, the lru bit still indicate if a page on lru list, just in some
   temporary moment(isolating), the page may have no lru bit when
   it's on lru list.  but the page still must be on lru list when the
   lru bit set.
2, have to remove lru bit before delete it from lru list.

As Andrew Morton mentioned this change would dirty cacheline for page
isn't on LRU. But the lost would be acceptable in Rong Chen
 report:
https://lore.kernel.org/lkml/20200304090301.GB5972@shao2-debian/

Suggested-by: Johannes Weiner 
Signed-off-by: Alex Shi 
Acked-by: Hugh Dickins 
Cc: Hugh Dickins 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Andrew Morton 
Cc: linux-kernel@vger.kernel.org
Cc: cgro...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/page-flags.h |  1 +
 mm/mlock.c |  3 +--
 mm/vmscan.c| 39 +++
 3 files changed, 21 insertions(+), 22 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6be1aa559b1e..9554ed1387dc 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -326,6 +326,7 @@ static inline void page_init_poison(struct page *page, 
size_t size)
 PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
__CLEARPAGEFLAG(Dirty, dirty, PF_HEAD)
 PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
+   TESTCLEARFLAG(LRU, lru, PF_HEAD)
 PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
TESTCLEARFLAG(Active, active, PF_HEAD)
 PAGEFLAG(Workingset, workingset, PF_HEAD)
diff --git a/mm/mlock.c b/mm/mlock.c
index d487aa864e86..7b0e6334be6f 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -276,10 +276,9 @@ static void __munlock_pagevec(struct pagevec *pvec, struct 
zone *zone)
 * We already have pin from follow_page_mask()
 * so we can spare the get_page() here.
 */
-   if (PageLRU(page)) {
+   if (TestClearPageLRU(page)) {
struct lruvec *lruvec;
 
-   ClearPageLRU(page);
lruvec = mem_cgroup_page_lruvec(page,
page_pgdat(page));
del_page_from_lru_list(page, lruvec,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 953a068b3fc7..9d06c609c60e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1540,7 +1540,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone 
*zone,
  */
 int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 {
-   int ret = -EINVAL;
+   int ret = -EBUSY;
 
/* Only take pages on the LRU. */
if (!PageLRU(page))
@@ -1550,8 +1550,6 @@ int __isolate_lru_page(struct page *page, isolate_mode_t 
mode)
if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
return ret;
 
-   ret = -EBUSY;
-
/*
 * To minimise LRU disruption, the caller can indicate that it only
 * wants to isolate pages it will be able to operate on without
@@ -1598,8 +1596,10 @@ int __isolate_lru_page(struct page *page, isolate_mode_t 
mode)
 * sure the page is not being freed elsewhere -- the
 * page release code relies on it.
 */
-   ClearPageLRU(page);
-   ret = 0;
+   if (TestClearPageLRU(page))
+   ret = 0;
+   else
+   put_page(page);
}
 
return ret;
@@ -1665,8 +1665,6 @@ static unsigned long isolate_lru_pages(unsigned long 
nr_to_scan,
page = lru_to_page(src);
prefetchw_prev_lru_page(page, src, flags);
 
-

[PATCH v19 17/20] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn

Hugh Dickins' found a memcg change bug on original version:
If we want to change the pgdat->lru_lock to memcg's lruvec lock, we have
to serialize mem_cgroup_move_account during pagevec_lru_move_fn. The
possible bad scenario would like:

cpu 0   cpu 1
lruvec = mem_cgroup_page_lruvec()
if (!isolate_lru_page())
mem_cgroup_move_account

spin_lock_irqsave(>lru_lock <== wrong lock.

So we need TestClearPageLRU to block isolate_lru_page(), that serializes
the memcg change. and then removing the PageLRU check in move_fn callee
as the consequence.

__pagevec_lru_add_fn() is different from the others, because the pages
it deals with are, by definition, not yet on the lru.  TestClearPageLRU
is not needed and would not work, so __pagevec_lru_add() goes its own
way.

Reported-by: Hugh Dickins 
Signed-off-by: Alex Shi 
Acked-by: Hugh Dickins 
Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 44 +++-
 1 file changed, 35 insertions(+), 9 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 6a4b00267384..50418dcc5e24 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -222,8 +222,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
spin_lock_irqsave(>lru_lock, flags);
}
 
+   /* block memcg migration during page moving between lru */
+   if (!TestClearPageLRU(page))
+   continue;
+
lruvec = mem_cgroup_page_lruvec(page, pgdat);
(*move_fn)(page, lruvec);
+
+   SetPageLRU(page);
}
if (pgdat)
spin_unlock_irqrestore(>lru_lock, flags);
@@ -233,7 +239,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
 {
-   if (PageLRU(page) && !PageUnevictable(page)) {
+   if (!PageUnevictable(page)) {
del_page_from_lru_list(page, lruvec, page_lru(page));
ClearPageActive(page);
add_page_to_lru_list_tail(page, lruvec, page_lru(page));
@@ -306,7 +312,7 @@ void lru_note_cost_page(struct page *page)
 
 static void __activate_page(struct page *page, struct lruvec *lruvec)
 {
-   if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
+   if (!PageActive(page) && !PageUnevictable(page)) {
int lru = page_lru_base_type(page);
int nr_pages = thp_nr_pages(page);
 
@@ -362,7 +368,8 @@ void activate_page(struct page *page)
 
page = compound_head(page);
spin_lock_irq(>lru_lock);
-   __activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
+   if (PageLRU(page))
+   __activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
spin_unlock_irq(>lru_lock);
 }
 #endif
@@ -521,9 +528,6 @@ static void lru_deactivate_file_fn(struct page *page, 
struct lruvec *lruvec)
bool active;
int nr_pages = thp_nr_pages(page);
 
-   if (!PageLRU(page))
-   return;
-
if (PageUnevictable(page))
return;
 
@@ -564,7 +568,7 @@ static void lru_deactivate_file_fn(struct page *page, 
struct lruvec *lruvec)
 
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
-   if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+   if (PageActive(page) && !PageUnevictable(page)) {
int lru = page_lru_base_type(page);
int nr_pages = thp_nr_pages(page);
 
@@ -581,7 +585,7 @@ static void lru_deactivate_fn(struct page *page, struct 
lruvec *lruvec)
 
 static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
 {
-   if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
+   if (PageAnon(page) && PageSwapBacked(page) &&
!PageSwapCache(page) && !PageUnevictable(page)) {
bool active = PageActive(page);
int nr_pages = thp_nr_pages(page);
@@ -979,7 +983,29 @@ static void __pagevec_lru_add_fn(struct page *page, struct 
lruvec *lruvec)
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-   pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
+   int i;
+   struct pglist_data *pgdat = NULL;
+   struct lruvec *lruvec;
+   unsigned long flags = 0;
+
+   for (i = 0; i < pagevec_count(pvec); i++) {
+   struct page *page = pvec->pages[i];
+   struct pglist_data *pagepgdat = page_pgdat(page);
+
+   if (pagepgdat != pgdat) {
+   if (pgdat)
+   spin_unlock_irqrestore(>lru_lock, flags);
+   pgdat = pagepgdat;
+   spin_lock_irqsave(>lru_lock, flags);
+   }
+
+   lruvec = mem_cgroup_page_lruvec(page, pgdat);
+   __pagevec_lru_add_fn(page,

[PATCH v19 10/20] mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn

Fold the PGROTATED event collection into pagevec_move_tail_fn call back
func like other funcs does in pagevec_lru_move_fn. Thus we could save
func call pagevec_move_tail().
Now all usage of pagevec_lru_move_fn are same and no needs of its 3rd
parameter.

It's just simply the calling. No functional change.

[l...@intel.com: found a build issue in the original patch, thanks]
Signed-off-by: Alex Shi 
Acked-by: Hugh Dickins 
Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 65 ++-
 1 file changed, 23 insertions(+), 42 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 1bc2fc5d8b7c..4e33873975a6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -204,8 +204,7 @@ int get_kernel_page(unsigned long start, int write, struct 
page **pages)
 EXPORT_SYMBOL_GPL(get_kernel_page);
 
 static void pagevec_lru_move_fn(struct pagevec *pvec,
-   void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
-   void *arg)
+   void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
int i;
struct pglist_data *pgdat = NULL;
@@ -224,7 +223,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
}
 
lruvec = mem_cgroup_page_lruvec(page, pgdat);
-   (*move_fn)(page, lruvec, arg);
+   (*move_fn)(page, lruvec);
}
if (pgdat)
spin_unlock_irqrestore(>lru_lock, flags);
@@ -232,35 +231,22 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
pagevec_reinit(pvec);
 }
 
-static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
-void *arg)
+static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
 {
-   int *pgmoved = arg;
-
if (PageLRU(page) && !PageUnevictable(page)) {
del_page_from_lru_list(page, lruvec, page_lru(page));
ClearPageActive(page);
add_page_to_lru_list_tail(page, lruvec, page_lru(page));
-   (*pgmoved) += thp_nr_pages(page);
+   __count_vm_events(PGROTATED, thp_nr_pages(page));
}
 }
 
 /*
- * pagevec_move_tail() must be called with IRQ disabled.
- * Otherwise this may cause nasty races.
- */
-static void pagevec_move_tail(struct pagevec *pvec)
-{
-   int pgmoved = 0;
-
-   pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, );
-   __count_vm_events(PGROTATED, pgmoved);
-}
-
-/*
  * Writeback is about to end against a page which has been marked for immediate
  * reclaim.  If it still appears to be reclaimable, move it to the tail of the
  * inactive list.
+ *
+ * rotate_reclaimable_page() must disable IRQs, to prevent nasty races.
  */
 void rotate_reclaimable_page(struct page *page)
 {
@@ -273,7 +259,7 @@ void rotate_reclaimable_page(struct page *page)
local_lock_irqsave(_rotate.lock, flags);
pvec = this_cpu_ptr(_rotate.pvec);
if (!pagevec_add(pvec, page) || PageCompound(page))
-   pagevec_move_tail(pvec);
+   pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
local_unlock_irqrestore(_rotate.lock, flags);
}
 }
@@ -315,8 +301,7 @@ void lru_note_cost_page(struct page *page)
  page_is_file_lru(page), thp_nr_pages(page));
 }
 
-static void __activate_page(struct page *page, struct lruvec *lruvec,
-   void *arg)
+static void __activate_page(struct page *page, struct lruvec *lruvec)
 {
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
int lru = page_lru_base_type(page);
@@ -340,7 +325,7 @@ static void activate_page_drain(int cpu)
struct pagevec *pvec = _cpu(lru_pvecs.activate_page, cpu);
 
if (pagevec_count(pvec))
-   pagevec_lru_move_fn(pvec, __activate_page, NULL);
+   pagevec_lru_move_fn(pvec, __activate_page);
 }
 
 static bool need_activate_page_drain(int cpu)
@@ -358,7 +343,7 @@ void activate_page(struct page *page)
pvec = this_cpu_ptr(_pvecs.activate_page);
get_page(page);
if (!pagevec_add(pvec, page) || PageCompound(page))
-   pagevec_lru_move_fn(pvec, __activate_page, NULL);
+   pagevec_lru_move_fn(pvec, __activate_page);
local_unlock(_pvecs.lock);
}
 }
@@ -374,7 +359,7 @@ void activate_page(struct page *page)
 
page = compound_head(page);
spin_lock_irq(>lru_lock);
-   __activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
+   __activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
spin_unlock_irq(>lru_lock);
 }
 #endif
@@ -527,8 +512,7 @@ void lru_cache_add_inactive_or_unevictable(struct page 
*page,
  * be write it out by flusher threads as this is much more effective
  * than the single-page writeout from reclaim.
  */
-static void

[PATCH v19 06/20] mm/thp: narrow lru locking

lru_lock and page cache xa_lock have no obvious reason to be taken
one way round or the other: until now, lru_lock has been taken before
page cache xa_lock, when splitting a THP; but nothing else takes them
together.  Reverse that ordering: let's narrow the lru locking - but
leave local_irq_disable to block interrupts throughout, like before.

Hugh Dickins point: split_huge_page_to_list() was already silly, to be
using the _irqsave variant: it's just been taking sleeping locks, so
would already be broken if entered with interrupts enabled.  So we
can save passing flags argument down to __split_huge_page().

Why change the lock ordering here? That was hard to decide. One reason:
when this series reaches per-memcg lru locking, it relies on the THP's
memcg to be stable when taking the lru_lock: that is now done after the
THP's refcount has been frozen, which ensures page memcg cannot change.

Another reason: previously, lock_page_memcg()'s move_lock was presumed
to nest inside lru_lock; but now lru_lock must nest inside (page cache
lock inside) move_lock, so it becomes possible to use lock_page_memcg()
to stabilize page memcg before taking its lru_lock.  That is not the
mechanism used in this series, but it is an option we want to keep open.

[Hugh Dickins: rewrite commit log]
Signed-off-by: Alex Shi 
Reviewed-by: Kirill A. Shutemov 
Acked-by: Hugh Dickins 
Cc: Hugh Dickins 
Cc: Kirill A. Shutemov 
Cc: Andrea Arcangeli 
Cc: Johannes Weiner 
Cc: Matthew Wilcox 
Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 25 +
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8b92cd197218..63af7611afaf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2402,7 +2402,7 @@ static void __split_huge_page_tail(struct page *head, int 
tail,
 }
 
 static void __split_huge_page(struct page *page, struct list_head *list,
-   pgoff_t end, unsigned long flags)
+   pgoff_t end)
 {
struct page *head = compound_head(page);
pg_data_t *pgdat = page_pgdat(head);
@@ -2411,8 +2411,6 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
unsigned long offset = 0;
int i;
 
-   lruvec = mem_cgroup_page_lruvec(head, pgdat);
-
/* complete memcg works before add pages to LRU */
mem_cgroup_split_huge_fixup(head);
 
@@ -2424,6 +2422,11 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
xa_lock(_cache->i_pages);
}
 
+   /* prevent PageLRU to go away from under us, and freeze lru stats */
+   spin_lock(>lru_lock);
+
+   lruvec = mem_cgroup_page_lruvec(head, pgdat);
+
for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
__split_huge_page_tail(head, i, lruvec, list);
/* Some pages can be beyond i_size: drop them from page cache */
@@ -2443,6 +2446,8 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
}
 
ClearPageCompound(head);
+   spin_unlock(>lru_lock);
+   /* Caller disabled irqs, so they are still disabled here */
 
split_page_owner(head, HPAGE_PMD_ORDER);
 
@@ -2460,8 +2465,7 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
page_ref_add(head, 2);
xa_unlock(>mapping->i_pages);
}
-
-   spin_unlock_irqrestore(>lru_lock, flags);
+   local_irq_enable();
 
remap_page(head);
 
@@ -2600,12 +2604,10 @@ bool can_split_huge_page(struct page *page, int 
*pextra_pins)
 int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
struct page *head = compound_head(page);
-   struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
struct deferred_split *ds_queue = get_deferred_split_queue(head);
struct anon_vma *anon_vma = NULL;
struct address_space *mapping = NULL;
int count, mapcount, extra_pins, ret;
-   unsigned long flags;
pgoff_t end;
 
VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
@@ -2666,9 +2668,8 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
unmap_page(head);
VM_BUG_ON_PAGE(compound_mapcount(head), head);
 
-   /* prevent PageLRU to go away from under us, and freeze lru stats */
-   spin_lock_irqsave(>lru_lock, flags);
-
+   /* block interrupt reentry in xa_lock and spinlock */
+   local_irq_disable();
if (mapping) {
XA_STATE(xas, >i_pages, page_index(head));
 
@@ -2698,7 +2699,7 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
__dec_node_page_state(head, NR_FILE_THPS);
}
 
-   __split_huge_page(page, list, end, flags);
+   __split_huge_page(page, list, end);
if (PageSwapCache(head)) {
swp_entry_t

[PATCH v19 16/20] mm/compaction: do page isolation first in compaction

Currently, compaction would get the lru_lock and then do page isolation
which works fine with pgdat->lru_lock, since any page isoltion would
compete for the lru_lock. If we want to change to memcg lru_lock, we
have to isolate the page before getting lru_lock, thus isoltion would
block page's memcg change which relay on page isoltion too. Then we
could safely use per memcg lru_lock later.

The new page isolation use previous introduced TestClearPageLRU() +
pgdat lru locking which will be changed to memcg lru lock later.

Hugh Dickins  fixed following bugs in this patch's
early version:

Fix lots of crashes under compaction load: isolate_migratepages_block()
must clean up appropriately when rejecting a page, setting PageLRU again
if it had been cleared; and a put_page() after get_page_unless_zero()
cannot safely be done while holding locked_lruvec - it may turn out to
be the final put_page(), which will take an lruvec lock when PageLRU.
And move __isolate_lru_page_prepare back after get_page_unless_zero to
make trylock_page() safe:
trylock_page() is not safe to use at this time: its setting PG_locked
can race with the page being freed or allocated ("Bad page"), and can
also erase flags being set by one of those "sole owners" of a freshly
allocated page who use non-atomic __SetPageFlag().

Suggested-by: Johannes Weiner 
Signed-off-by: Alex Shi 
Acked-by: Hugh Dickins 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/swap.h |  2 +-
 mm/compaction.c  | 42 +-
 mm/vmscan.c  | 43 ++-
 3 files changed, 56 insertions(+), 31 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 43e6b3458f58..550fdfdc3506 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -357,7 +357,7 @@ extern void lru_cache_add_inactive_or_unevictable(struct 
page *page,
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
-extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
+extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
  unsigned long nr_pages,
  gfp_t gfp_mask,
diff --git a/mm/compaction.c b/mm/compaction.c
index 176dcded298e..16a9f024433d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -887,6 +887,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
low_pfn = end_pfn;
+   page = NULL;
goto isolate_abort;
}
valid_page = page;
@@ -968,6 +969,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
goto isolate_fail;
 
+   /*
+* Be careful not to clear PageLRU until after we're
+* sure the page is not being freed elsewhere -- the
+* page release code relies on it.
+*/
+   if (unlikely(!get_page_unless_zero(page)))
+   goto isolate_fail;
+
+   if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
+   goto isolate_fail_put;
+
+   /* Try isolate the page */
+   if (!TestClearPageLRU(page))
+   goto isolate_fail_put;
+
/* If we already hold the lock, we can skip some rechecking */
if (!locked) {
locked = compact_lock_irqsave(>lru_lock,
@@ -980,10 +996,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
goto isolate_abort;
}
 
-   /* Recheck PageLRU and PageCompound under lock */
-   if (!PageLRU(page))
-   goto isolate_fail;
-
/*
 * Page become compound since the non-locked check,
 * and it's on LRU. It can only be a THP so the order
@@ -991,16 +1003,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
 */
if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
low_pfn += compound_nr(page) - 1;
-   goto isolate_fail;
+   SetPageLRU(page);
+   goto isolate_fail_put;

[PATCH v19 07/20] mm/vmscan: remove unnecessary lruvec adding

We don't have to add a freeable page into lru and then remove from it.
This change saves a couple of actions and makes the moving more clear.

The SetPageLRU needs to be kept before put_page_testzero for list
integrity, otherwise:

  #0 move_pages_to_lru #1 release_pages
  if !put_page_testzero
   if (put_page_testzero())
  !PageLRU //skip lru_lock
 SetPageLRU()
 list_add(>lru,)
 list_add(>lru,)

[a...@linux-foundation.org: coding style fixes]
Signed-off-by: Alex Shi 
Acked-by: Hugh Dickins 
Cc: Andrew Morton 
Cc: Johannes Weiner 
Cc: Tejun Heo 
Cc: Matthew Wilcox 
Cc: Hugh Dickins 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/vmscan.c | 38 +-
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 466fc3144fff..32102e5d354d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1850,26 +1850,30 @@ static unsigned noinline_for_stack 
move_pages_to_lru(struct lruvec *lruvec,
while (!list_empty(list)) {
page = lru_to_page(list);
VM_BUG_ON_PAGE(PageLRU(page), page);
+   list_del(>lru);
if (unlikely(!page_evictable(page))) {
-   list_del(>lru);
spin_unlock_irq(>lru_lock);
putback_lru_page(page);
spin_lock_irq(>lru_lock);
continue;
}
-   lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
+   /*
+* The SetPageLRU needs to be kept here for list integrity.
+* Otherwise:
+*   #0 move_pages_to_lru #1 release_pages
+*   if !put_page_testzero
+*if (put_page_testzero())
+*  !PageLRU //skip lru_lock
+* SetPageLRU()
+* list_add(>lru,)
+*list_add(>lru,)
+*/
SetPageLRU(page);
-   lru = page_lru(page);
 
-   nr_pages = thp_nr_pages(page);
-   update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
-   list_move(>lru, >lists[lru]);
-
-   if (put_page_testzero(page)) {
+   if (unlikely(put_page_testzero(page))) {
__ClearPageLRU(page);
__ClearPageActive(page);
-   del_page_from_lru_list(page, lruvec, lru);
 
if (unlikely(PageCompound(page))) {
spin_unlock_irq(>lru_lock);
@@ -1877,11 +1881,19 @@ static unsigned noinline_for_stack 
move_pages_to_lru(struct lruvec *lruvec,
spin_lock_irq(>lru_lock);
} else
list_add(>lru, _to_free);
-   } else {
-   nr_moved += nr_pages;
-   if (PageActive(page))
-   workingset_age_nonresident(lruvec, nr_pages);
+
+   continue;
}
+
+   lruvec = mem_cgroup_page_lruvec(page, pgdat);
+   lru = page_lru(page);
+   nr_pages = thp_nr_pages(page);
+
+   update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
+   list_add(>lru, >lists[lru]);
+   nr_moved += nr_pages;
+   if (PageActive(page))
+   workingset_age_nonresident(lruvec, nr_pages);
}
 
/*
-- 
1.8.3.1

[PATCH v19 18/20] mm/lru: replace pgdat lru_lock with lruvec lock

This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node. So on a large machine, each of memcg don't
have to suffer from per node pgdat->lru_lock competition. They could go
fast with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

In func isolate_migratepages_block, compact_unlock_should_abort and
lock_page_lruvec_irqsave are open coded to work with compact_control.
Also add a debug func in locking which may give some clues if there are
sth out of hands.

Daniel Jordan's testing show 62% improvement on modified readtwice case
on his 2P * 10 core * 2 HT broadwell box.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3lo...@ca-dmjordan1.us.oracle.com/

On a large machine with memcg enabled but not used, the page's lruvec
seeking pass a few pointers, that may lead to lru_lock holding time
increase and a bit regression.

Hugh Dickins helped on patch polish, thanks!

Signed-off-by: Alex Shi 
Acked-by: Hugh Dickins 
Cc: Hugh Dickins 
Cc: Andrew Morton 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Yang Shi 
Cc: Matthew Wilcox 
Cc: Konstantin Khlebnikov 
Cc: Tejun Heo 
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
Cc: cgro...@vger.kernel.org
---
 include/linux/memcontrol.h |  58 +
 include/linux/mmzone.h |   3 +-
 mm/compaction.c|  56 +++-
 mm/huge_memory.c   |  11 ++---
 mm/memcontrol.c|  62 --
 mm/mlock.c |  22 +++---
 mm/mmzone.c|   1 +
 mm/page_alloc.c|   1 -
 mm/swap.c  | 105 +
 mm/vmscan.c|  55 +++-
 10 files changed, 249 insertions(+), 125 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d0b036123c6a..7b170e9028b5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -494,6 +494,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct 
mem_cgroup *memcg,
 
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
 
+struct lruvec *lock_page_lruvec(struct page *page);
+struct lruvec *lock_page_lruvec_irq(struct page *page);
+struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+   unsigned long *flags);
+
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
+#else
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
+#endif
+
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -1035,6 +1048,31 @@ static inline void mem_cgroup_put(struct mem_cgroup 
*memcg)
 {
 }
 
+static inline struct lruvec *lock_page_lruvec(struct page *page)
+{
+   struct pglist_data *pgdat = page_pgdat(page);
+
+   spin_lock(>__lruvec.lru_lock);
+   return >__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+   struct pglist_data *pgdat = page_pgdat(page);
+
+   spin_lock_irq(>__lruvec.lru_lock);
+   return >__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+   unsigned long *flagsp)
+{
+   struct pglist_data *pgdat = page_pgdat(page);
+
+   spin_lock_irqsave(>__lruvec.lru_lock, *flagsp);
+   return >__lruvec;
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -1282,6 +1320,10 @@ static inline void count_memcg_page_event(struct page 
*page,
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
@@ -1411,6 +1453,22 @@ static inline struct lruvec *parent_lruvec(struct lruvec 
*lruvec)
return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
 }
 
+static inline void unlock_page_lruvec(struct lruvec *lruvec)
+{
+   spin_unlock(>lru_lock);
+}
+
+static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
+{
+   spin_unlock_irq(>lru_lock);
+}
+
+static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
+   unsigned long flags)
+{
+   spin_unlock_irqrestore(>lru_lock, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8379432f4f2f..7727f4c373f7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -273,6 +273,8 @@ enum lruvec_flags {
 };
 
 struct lruvec {
+   /* per lruvec lru_lock for memcg */
+

[PATCH v19 03/20] mm/thp: move lru_add_page_tail func to huge_memory.c

The func is only used in huge_memory.c, defining it in other file with a
CONFIG_TRANSPARENT_HUGEPAGE macro restrict just looks weird.

Let's move it THP. And make it static as Hugh Dickin suggested.

Signed-off-by: Alex Shi 
Reviewed-by: Kirill A. Shutemov 
Acked-by: Hugh Dickins 
Cc: Andrew Morton 
Cc: Johannes Weiner 
Cc: Matthew Wilcox 
Cc: Hugh Dickins 
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/swap.h |  2 --
 mm/huge_memory.c | 30 ++
 mm/swap.c| 33 -
 3 files changed, 30 insertions(+), 35 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 661046994db4..43e6b3458f58 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -338,8 +338,6 @@ extern void lru_note_cost(struct lruvec *lruvec, bool file,
  unsigned int nr_pages);
 extern void lru_note_cost_page(struct page *);
 extern void lru_cache_add(struct page *);
-extern void lru_add_page_tail(struct page *page, struct page *page_tail,
-struct lruvec *lruvec, struct list_head *head);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index faadc449cca5..211d50f09785 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2318,6 +2318,36 @@ static void remap_page(struct page *page)
}
 }
 
+static void lru_add_page_tail(struct page *page, struct page *page_tail,
+   struct lruvec *lruvec, struct list_head *list)
+{
+   VM_BUG_ON_PAGE(!PageHead(page), page);
+   VM_BUG_ON_PAGE(PageCompound(page_tail), page);
+   VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+   lockdep_assert_held(_pgdat(lruvec)->lru_lock);
+
+   if (!list)
+   SetPageLRU(page_tail);
+
+   if (likely(PageLRU(page)))
+   list_add_tail(_tail->lru, >lru);
+   else if (list) {
+   /* page reclaim is reclaiming a huge page */
+   get_page(page_tail);
+   list_add_tail(_tail->lru, list);
+   } else {
+   /*
+* Head page has not yet been counted, as an hpage,
+* so we must account for each subpage individually.
+*
+* Put page_tail on the list at the correct position
+* so they all end up in order.
+*/
+   add_page_to_lru_list_tail(page_tail, lruvec,
+ page_lru(page_tail));
+   }
+}
+
 static void __split_huge_page_tail(struct page *head, int tail,
struct lruvec *lruvec, struct list_head *list)
 {
diff --git a/mm/swap.c b/mm/swap.c
index e7bdf094f76a..1bc2fc5d8b7c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -935,39 +935,6 @@ void __pagevec_release(struct pagevec *pvec)
 }
 EXPORT_SYMBOL(__pagevec_release);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-/* used by __split_huge_page_refcount() */
-void lru_add_page_tail(struct page *page, struct page *page_tail,
-  struct lruvec *lruvec, struct list_head *list)
-{
-   VM_BUG_ON_PAGE(!PageHead(page), page);
-   VM_BUG_ON_PAGE(PageCompound(page_tail), page);
-   VM_BUG_ON_PAGE(PageLRU(page_tail), page);
-   lockdep_assert_held(_pgdat(lruvec)->lru_lock);
-
-   if (!list)
-   SetPageLRU(page_tail);
-
-   if (likely(PageLRU(page)))
-   list_add_tail(_tail->lru, >lru);
-   else if (list) {
-   /* page reclaim is reclaiming a huge page */
-   get_page(page_tail);
-   list_add_tail(_tail->lru, list);
-   } else {
-   /*
-* Head page has not yet been counted, as an hpage,
-* so we must account for each subpage individually.
-*
-* Put page_tail on the list at the correct position
-* so they all end up in order.
-*/
-   add_page_to_lru_list_tail(page_tail, lruvec,
- page_lru(page_tail));
-   }
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
 static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 void *arg)
 {
-- 
1.8.3.1

[PATCH v19 05/20] mm/thp: Simplify lru_add_page_tail()

Simplify lru_add_page_tail(), there are actually only two cases possible:
split_huge_page_to_list(), with list supplied and head isolated from lru
by its caller; or split_huge_page(), with NULL list and head on lru -
because when head is racily isolated from lru, the isolator's reference
will stop the split from getting any further than its page_ref_freeze().

So decide between the two cases by "list", but add VM_WARN_ON()s to
verify that they match our lru expectations.

[Hugh Dickins: rewrite commit log]
Signed-off-by: Alex Shi 
Reviewed-by: Kirill A. Shutemov 
Acked-by: Hugh Dickins 
Cc: Kirill A. Shutemov 
Cc: Andrew Morton 
Cc: Johannes Weiner 
Cc: Matthew Wilcox 
Cc: Hugh Dickins 
Cc: Mika Penttilä 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 21 ++---
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fa0753229b52..8b92cd197218 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2326,25 +2326,16 @@ static void lru_add_page_tail(struct page *head, struct 
page *page_tail,
VM_BUG_ON_PAGE(PageLRU(page_tail), head);
lockdep_assert_held(_pgdat(lruvec)->lru_lock);
 
-   if (!list)
-   SetPageLRU(page_tail);
-
-   if (likely(PageLRU(head)))
-   list_add_tail(_tail->lru, >lru);
-   else if (list) {
+   if (list) {
/* page reclaim is reclaiming a huge page */
+   VM_WARN_ON(PageLRU(head));
get_page(page_tail);
list_add_tail(_tail->lru, list);
} else {
-   /*
-* Head page has not yet been counted, as an hpage,
-* so we must account for each subpage individually.
-*
-* Put page_tail on the list at the correct position
-* so they all end up in order.
-*/
-   add_page_to_lru_list_tail(page_tail, lruvec,
- page_lru(page_tail));
+   /* head is still on lru (and we have it frozen) */
+   VM_WARN_ON(!PageLRU(head));
+   SetPageLRU(page_tail);
+   list_add_tail(_tail->lru, >lru);
}
 }
 
-- 
1.8.3.1

[PATCH v19 20/20] mm/lru: revise the comments of lru_lock

From: Hugh Dickins 

Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to
fix the incorrect comments in code. Also fixed some zone->lru_lock comment
error from ancient time. etc.

I struggled to understand the comment above move_pages_to_lru() (surely
it never calls page_referenced()), and eventually realized that most of
it had got separated from shrink_active_list(): move that comment back.

Signed-off-by: Hugh Dickins 
Signed-off-by: Alex Shi 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: Andrey Ryabinin 
Cc: Jann Horn 
Cc: Mel Gorman 
Cc: Johannes Weiner 
Cc: Matthew Wilcox 
Cc: Hugh Dickins 
Cc: cgro...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
---
 Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 ++--
 Documentation/admin-guide/cgroup-v1/memory.rst | 21 +--
 Documentation/trace/events-kmem.rst|  2 +-
 Documentation/vm/unevictable-lru.rst   | 22 +---
 include/linux/mm_types.h   |  2 +-
 include/linux/mmzone.h |  3 +-
 mm/filemap.c   |  4 +--
 mm/rmap.c  |  4 +--
 mm/vmscan.c| 41 --
 9 files changed, 50 insertions(+), 64 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst 
b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
index 3f7115e07b5d..0b9f91589d3d 100644
--- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@@ -133,18 +133,9 @@ Under below explanation, we assume 
CONFIG_MEM_RES_CTRL_SWAP=y.
 
 8. LRU
 ==
-Each memcg has its own private LRU. Now, its handling is under global
-   VM's control (means that it's handled under global pgdat->lru_lock).
-   Almost all routines around memcg's LRU is called by global LRU's
-   list management functions under pgdat->lru_lock.
-
-   A special function is mem_cgroup_isolate_pages(). This scans
-   memcg's private LRU and call __isolate_lru_page() to extract a page
-   from LRU.
-
-   (By __isolate_lru_page(), the page is removed from both of global and
-   private LRU.)
-
+   Each memcg has its own vector of LRUs (inactive anon, active anon,
+   inactive file, active file, unevictable) of pages from each node,
+   each LRU handled under a single lru_lock for that memcg and node.
 
 9. Typical Tests.
 =
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst 
b/Documentation/admin-guide/cgroup-v1/memory.rst
index 12757e63b26c..24450696579f 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -285,20 +285,17 @@ When oom event notifier is registered, event will be 
delivered.
 2.6 Locking
 ---
 
-   lock_page_cgroup()/unlock_page_cgroup() should not be called under
-   the i_pages lock.
+Lock order is as follows:
 
-   Other lock order is following:
+  Page lock (PG_locked bit of page->flags)
+mm->page_table_lock or split pte_lock
+  lock_page_memcg (memcg->move_lock)
+mapping->i_pages lock
+  lruvec->lru_lock.
 
-   PG_locked.
- mm->page_table_lock
- pgdat->lru_lock
-  lock_page_cgroup.
-
-  In many cases, just lock_page_cgroup() is called.
-
-  per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
-  pgdat->lru_lock, it has no lock of its own.
+Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
+lruvec->lru_lock; PG_lru bit of page->flags is cleared before
+isolating a page from its LRU under lruvec->lru_lock.
 
 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
 ---
diff --git a/Documentation/trace/events-kmem.rst 
b/Documentation/trace/events-kmem.rst
index 555484110e36..68fa75247488 100644
--- a/Documentation/trace/events-kmem.rst
+++ b/Documentation/trace/events-kmem.rst
@@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched 
is triggered.
 Broadly speaking, pages are taken off the LRU lock in bulk and
 freed in batch with a page list. Significant amounts of activity here could
 indicate that the system is under memory pressure and can also indicate
-contention on the zone->lru_lock.
+contention on the lruvec->lru_lock.
 
 4. Per-CPU Allocator Activity
 =
diff --git a/Documentation/vm/unevictable-lru.rst 
b/Documentation/vm/unevictable-lru.rst
index 17d0861b0f1d..0e1490524f53 100644
--- a/Documentation/vm/unevictable-lru.rst
+++ b/Documentation/vm/unevictable-lru.rst
@@ -33,7 +33,7 @@ reclaim in Linux.  The problems have been observed at 
customer sites on large
 memory x86_64 systems.
 
 To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
-main memory will have over 32 million 4k pages in a single zone.  When a large
+main memory will have over 32 million

[PATCH v19 09/20] mm/memcg: add debug checking in lock_page_memcg

Add a debug checking in lock_page_memcg, then we could get alarm
if anything wrong here.

Suggested-by: Johannes Weiner 
Signed-off-by: Alex Shi 
Acked-by: Hugh Dickins 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Andrew Morton 
Cc: cgro...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/memcontrol.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 368e6ac644f0..6acc6a60c52b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2097,6 +2097,12 @@ struct mem_cgroup *lock_page_memcg(struct page *page)
if (unlikely(!memcg))
return NULL;
 
+#ifdef CONFIG_PROVE_LOCKING
+   local_irq_save(flags);
+   might_lock(>move_lock);
+   local_irq_restore(flags);
+#endif
+
if (atomic_read(>moving_account) <= 0)
return memcg;
 
-- 
1.8.3.1

[PATCH v19 00/20] per memcg lru_lock

The new version rebased on v5.9-rc6 with line by line review by Hugh Dickins.
Millions thanks! And removed the 4th part from last version which do post
optimization, that we can repost after the main part get merged. About one
year coding and review, now I believe it's ready.

So now this patchset includes 3 parts:
1, some code cleanup and minimum optimization as a preparation. 
2, use TestCleanPageLRU as page isolation's precondition.
3, replace per node lru_lock with per memcg per node lru_lock.

Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
lru lists, but now we had moved the lru lists into memcg for long time. Still
using per node lru_lock is clearly unscalable, pages on each of memcgs have
to compete each others for a whole lru_lock. This patchset try to use per
lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
it scalable for memcgs and get performance gain.

Currently lru_lock still guards both lru list and page's lru bit, that's ok.
but if we want to use specific lruvec lock on the page, we need to pin down
the page's lruvec/memcg during locking. Just taking lruvec lock first may be
undermined by the page's memcg charge/migration. To fix this problem, we could
take out the page's lru bit clear and use it as pin down action to block the
memcg changes. That's the reason for new atomic func TestClearPageLRU.
So now isolating a page need both actions: TestClearPageLRU and hold the
lru_lock.

The typical usage of this is isolate_migratepages_block() in compaction.c
we have to take lru bit before lru lock, that serialized the page isolation
in memcg page charge/migration which will change page's lruvec and new 
lru_lock in it.

The above solution suggested by Johannes Weiner, and based on his new memcg 
charge path, then have this patchset. (Hugh Dickins tested and contributed much
code from compaction fix to general code polish, thanks a lot!).

Daniel Jordan's testing show 62% improvement on modified readtwice case
on his 2P * 10 core * 2 HT broadwell box.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3lo...@ca-dmjordan1.us.oracle.com/

Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
idea 8 years ago, and others who give comments as well: Daniel Jordan, 
Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc.

Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!


Alex Shi (17):
  mm/memcg: warning on !memcg after readahead page charged
  mm/memcg: bail early from swap accounting if memcg disabled
  mm/thp: move lru_add_page_tail func to huge_memory.c
  mm/thp: use head for head page in lru_add_page_tail
  mm/thp: Simplify lru_add_page_tail()
  mm/thp: narrow lru locking
  mm/vmscan: remove unnecessary lruvec adding
  mm/memcg: add debug checking in lock_page_memcg
  mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
  mm/lru: move lock into lru_note_cost
  mm/vmscan: remove lruvec reget in move_pages_to_lru
  mm/mlock: remove lru_lock on TestClearPageMlocked
  mm/mlock: remove __munlock_isolate_lru_page
  mm/lru: introduce TestClearPageLRU
  mm/compaction: do page isolation first in compaction
  mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
  mm/lru: replace pgdat lru_lock with lruvec lock

Alexander Duyck (1):
  mm/lru: introduce the relock_page_lruvec function

Hugh Dickins (2):
  mm: page_idle_get_page() does not need lru_lock
  mm/lru: revise the comments of lru_lock

 Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
 Documentation/admin-guide/cgroup-v1/memory.rst |  21 +--
 Documentation/trace/events-kmem.rst|   2 +-
 Documentation/vm/unevictable-lru.rst   |  22 +--
 include/linux/memcontrol.h | 110 +++
 include/linux/mm_types.h   |   2 +-
 include/linux/mmdebug.h|  13 ++
 include/linux/mmzone.h |   6 +-
 include/linux/page-flags.h |   1 +
 include/linux/swap.h   |   4 +-
 mm/compaction.c|  94 +++---
 mm/filemap.c   |   4 +-
 mm/huge_memory.c   |  45 +++--
 mm/memcontrol.c|  85 -
 mm/mlock.c |  63 ++-
 mm/mmzone.c|   1 +
 mm/page_alloc.c|   1 -
 mm/page_idle.c |   4 -
 mm/rmap.c  |   4 +-
 mm/swap.c  | 199 
 mm/vmscan.c| 203 +++--
 mm/workingset.c|   2 -
 22 files changed, 523 insertions(+), 378

[PATCH v19 01/20] mm/memcg: warning on !memcg after readahead page charged

Add VM_WARN_ON_ONCE_PAGE() macro.

Since readahead page is charged on memcg too, in theory we don't have to
check this exception now. Before safely remove them all, add a warning
for the unexpected !memcg.

Signed-off-by: Alex Shi 
Acked-by: Michal Hocko 
Acked-by: Hugh Dickins 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Andrew Morton 
Cc: cgro...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/mmdebug.h | 13 +
 mm/memcontrol.c | 11 ---
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmdebug.h b/include/linux/mmdebug.h
index 2ad72d2c8cc5..4ed52879ce55 100644
--- a/include/linux/mmdebug.h
+++ b/include/linux/mmdebug.h
@@ -37,6 +37,18 @@
BUG();  \
}   \
} while (0)
+#define VM_WARN_ON_ONCE_PAGE(cond, page)   ({  \
+   static bool __section(.data.once) __warned; \
+   int __ret_warn_once = !!(cond); \
+   \
+   if (unlikely(__ret_warn_once && !__warned)) {   \
+   dump_page(page, "VM_WARN_ON_ONCE_PAGE(" __stringify(cond)")");\
+   __warned = true;\
+   WARN_ON(1); \
+   }   \
+   unlikely(__ret_warn_once);  \
+})
+
 #define VM_WARN_ON(cond) (void)WARN_ON(cond)
 #define VM_WARN_ON_ONCE(cond) (void)WARN_ON_ONCE(cond)
 #define VM_WARN_ONCE(cond, format...) (void)WARN_ONCE(cond, format)
@@ -48,6 +60,7 @@
 #define VM_BUG_ON_MM(cond, mm) VM_BUG_ON(cond)
 #define VM_WARN_ON(cond) BUILD_BUG_ON_INVALID(cond)
 #define VM_WARN_ON_ONCE(cond) BUILD_BUG_ON_INVALID(cond)
+#define VM_WARN_ON_ONCE_PAGE(cond, page)  BUILD_BUG_ON_INVALID(cond)
 #define VM_WARN_ONCE(cond, format...) BUILD_BUG_ON_INVALID(cond)
 #define VM_WARN(cond, format...) BUILD_BUG_ON_INVALID(cond)
 #endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cfa6cbad21d5..f7094a04be07 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1322,10 +1322,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, 
struct pglist_data *pgd
}
 
memcg = page->mem_cgroup;
-   /*
-* Swapcache readahead pages are added to the LRU - and
-* possibly migrated - before they are charged.
-*/
+   VM_WARN_ON_ONCE_PAGE(!memcg, page);
if (!memcg)
memcg = root_mem_cgroup;
 
@@ -6912,8 +6909,8 @@ void mem_cgroup_migrate(struct page *oldpage, struct page 
*newpage)
if (newpage->mem_cgroup)
return;
 
-   /* Swapcache readahead pages can get replaced before being charged */
memcg = oldpage->mem_cgroup;
+   VM_WARN_ON_ONCE_PAGE(!memcg, oldpage);
if (!memcg)
return;
 
@@ -7110,7 +7107,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t 
entry)
 
memcg = page->mem_cgroup;
 
-   /* Readahead page, never charged */
+   VM_WARN_ON_ONCE_PAGE(!memcg, page);
if (!memcg)
return;
 
@@ -7174,7 +7171,7 @@ int mem_cgroup_try_charge_swap(struct page *page, 
swp_entry_t entry)
 
memcg = page->mem_cgroup;
 
-   /* Readahead page, never charged */
+   VM_WARN_ON_ONCE_PAGE(!memcg, page);
if (!memcg)
return 0;
 
-- 
1.8.3.1

[RFC PATCH 23/24] vdpa_sim: filter destination mac address

Add a simple unicast filter to filter out the dest MAC doesn't match
to the one stored in the config.

Signed-off-by: Jason Wang 
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 49 
 1 file changed, 31 insertions(+), 18 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index b21670e054ba..66d901fb4c57 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -171,6 +171,22 @@ static void vdpasim_complete(struct vdpasim_virtqueue *vq, 
size_t len)
local_bh_enable();
 }
 
+static bool receive_filter(struct vdpasim *vdpasim, size_t len)
+{
+   bool modern = vdpasim->features & (1ULL << VIRTIO_F_VERSION_1);
+   size_t hdr_len = modern ? sizeof(struct virtio_net_hdr_v1) :
+ sizeof(struct virtio_net_hdr);
+
+   if (len < ETH_ALEN + hdr_len)
+   return false;
+
+   if (!strncmp(vdpasim->buffer + hdr_len,
+vdpasim->config.mac, ETH_ALEN))
+   return true;
+
+   return false;
+}
+
 static void vdpasim_work(struct work_struct *work)
 {
struct vdpasim *vdpasim = container_of(work, struct
@@ -178,7 +194,6 @@ static void vdpasim_work(struct work_struct *work)
struct vdpasim_virtqueue *txq = >vqs[1];
struct vdpasim_virtqueue *rxq = >vqs[0];
ssize_t read, write;
-   size_t total_write;
int pkts = 0;
int err;
 
@@ -191,36 +206,34 @@ static void vdpasim_work(struct work_struct *work)
goto out;
 
while (true) {
-   total_write = 0;
err = vringh_getdesc_iotlb(>vring, >riov, NULL,
   >head, GFP_ATOMIC);
if (err <= 0)
break;
 
+   read = vringh_iov_pull_iotlb(>vring, >riov,
+vdpasim->buffer,
+PAGE_SIZE);
+
+   if (!receive_filter(vdpasim, read)) {
+   vdpasim_complete(txq, 0);
+   continue;
+   }
+
err = vringh_getdesc_iotlb(>vring, NULL, >wiov,
   >head, GFP_ATOMIC);
if (err <= 0) {
-   vringh_complete_iotlb(>vring, txq->head, 0);
+   vdpasim_complete(txq, 0);
break;
}
 
-   while (true) {
-   read = vringh_iov_pull_iotlb(>vring, >riov,
-vdpasim->buffer,
-PAGE_SIZE);
-   if (read <= 0)
-   break;
-
-   write = vringh_iov_push_iotlb(>vring, >wiov,
- vdpasim->buffer, read);
-   if (write <= 0)
-   break;
-
-   total_write += write;
-   }
+   write = vringh_iov_push_iotlb(>vring, >wiov,
+ vdpasim->buffer, read);
+   if (write <= 0)
+   break;
 
vdpasim_complete(txq, 0);
-   vdpasim_complete(rxq, total_write);
+   vdpasim_complete(rxq, write);
 
if (++pkts > 4) {
schedule_work(>work);
-- 
2.20.1

Re: [PATCH v2 1/4] dt-bindings: arm: hisilicon: add binding for SD5203 SoC

2020-09-23 Thread Leizhen (ThunderTown)




On 2020/9/24 4:59, Rob Herring wrote:
> On Sat, Sep 19, 2020 at 02:45:52PM +0800, Zhen Lei wrote:
>> Add devicetree binding for Hisilicon SD5203 SoC.
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>  Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt | 5 +
>>  1 file changed, 5 insertions(+)
> 
> Please convert this to DT schema format first.

OK, I will do it.

> 
>>
>> diff --git a/Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt 
>> b/Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt
>> index a97f643e7d1c760..5d80070bfb13fc0 100644
>> --- a/Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt
>> +++ b/Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt
>> @@ -56,6 +56,11 @@ HiP07 D05 Board
>>  Required root node properties:
>>  - compatible = "hisilicon,hip07-d05";
>>  
>> +SD5203 SoC
>> +Required root node properties:
>> +- compatible = "hisilicon,sd5203";
>> +
>> +
>>  Hisilicon system controller
>>  
>>  Required properties:
>> -- 
>> 1.8.3
>>
>>
> 
> .
>

Re: [PATCH v5 5/6] dt-bindings: dw-apb-ictl: convert to json-schema

2020-09-23 Thread Leizhen (ThunderTown)




On 2020/9/24 4:49, Rob Herring wrote:
> On Fri, Sep 18, 2020 at 07:22:01PM +0800, Zhen Lei wrote:
>> Convert the Synopsys DesignWare APB interrupt controller (dw_apb_ictl)
>> binding to DT schema format using json-schema.
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>  .../interrupt-controller/snps,dw-apb-ictl.txt  | 43 -
>>  .../interrupt-controller/snps,dw-apb-ictl.yaml | 75 
>> ++
>>  2 files changed, 75 insertions(+), 43 deletions(-)
>>  delete mode 100644 
>> Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.txt
>>  create mode 100644 
>> Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.yaml
>>
>> diff --git 
>> a/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.txt
>>  
>> b/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.txt
>> deleted file mode 100644
>> index 2db59df9408f4c6..000
>> --- 
>> a/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.txt
>> +++ /dev/null
>> @@ -1,43 +0,0 @@
>> -Synopsys DesignWare APB interrupt controller (dw_apb_ictl)
>> -
>> -Synopsys DesignWare provides interrupt controller IP for APB known as
>> -dw_apb_ictl. The IP is used as secondary interrupt controller in some SoCs 
>> with
>> -APB bus, e.g. Marvell Armada 1500. It can also be used as primary interrupt
>> -controller in some SoCs, e.g. Hisilicon SD5203.
>> -
>> -Required properties:
>> -- compatible: shall be "snps,dw-apb-ictl"
>> -- reg: physical base address of the controller and length of memory mapped
>> -  region starting with ENABLE_LOW register
>> -- interrupt-controller: identifies the node as an interrupt controller
>> -- #interrupt-cells: number of cells to encode an interrupt-specifier, shall 
>> be 1
>> -
>> -Additional required property when it's used as secondary interrupt 
>> controller:
>> -- interrupts: interrupt reference to primary interrupt controller
>> -
>> -The interrupt sources map to the corresponding bits in the interrupt
>> -registers, i.e.
>> -- 0 maps to bit 0 of low interrupts,
>> -- 1 maps to bit 1 of low interrupts,
>> -- 32 maps to bit 0 of high interrupts,
>> -- 33 maps to bit 1 of high interrupts,
>> -- (optional) fast interrupts start at 64.
>> -
>> -Example:
>> -/* dw_apb_ictl is used as secondary interrupt controller */
>> -aic: interrupt-controller@3000 {
>> -compatible = "snps,dw-apb-ictl";
>> -reg = <0x3000 0xc00>;
>> -interrupt-controller;
>> -#interrupt-cells = <1>;
>> -interrupt-parent = <>;
>> -interrupts = ;
>> -};
>> -
>> -/* dw_apb_ictl is used as primary interrupt controller */
>> -vic: interrupt-controller@1013 {
>> -compatible = "snps,dw-apb-ictl";
>> -reg = <0x1013 0x1000>;
>> -interrupt-controller;
>> -#interrupt-cells = <1>;
>> -};
>> diff --git 
>> a/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.yaml
>>  
>> b/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.yaml
>> new file mode 100644
>> index 000..70c12979c843bf0
>> --- /dev/null
>> +++ 
>> b/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.yaml
>> @@ -0,0 +1,75 @@
>> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
>> +%YAML 1.2
>> +---
>> +$id: 
>> http://devicetree.org/schemas/interrupt-controller/snps,dw-apb-ictl.yaml#
>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
>> +
>> +title: Synopsys DesignWare APB interrupt controller (dw_apb_ictl)
>> +
>> +maintainers:
>> +  - Marc Zyngier 
> 
> Usually this would be an owner for this IP block, not the subsystem 
> maintainer.

OK, I will change it to the author of the file "snps,dw-apb-ictl.txt".


> 
>> +
>> +description:
>> +  Synopsys DesignWare provides interrupt controller IP for APB known as
>> +  dw_apb_ictl. The IP is used as secondary interrupt controller in some SoCs
>> +  with APB bus, e.g. Marvell Armada 1500. It can also be used as primary
>> +  interrupt controller in some SoCs, e.g. Hisilicon SD5203.
>> +
>> +allOf:
>> +  - $ref: /schemas/interrupt-controller.yaml#
> 
> You can drop this, it's already applied based on node name.
But if we drop this, the "snps,dw-apb-ictl.yaml" can not require that the node 
name
must match '^interrupt-controller(@[0-9a-f,]+)*$'. The problem of Patch 6/6 was
discovered by this.

> 
>> +
>> +properties:
>> +  compatible:
>> +const: snps,dw-apb-ictl
>> +
>> +  interrupt-controller: true
>> +
>> +  reg:
>> +description:
>> +  Physical base address of the controller and length of memory mapped
>> +  region starting with ENABLE_LOW register.
> 
> Need to define how many reg regions (maxItems: 1).

OK, I will add it.

> 
>> +
>> +  interrupts:
>> +description:
>> +  Interrupt reference to primary interrupt controller.
>> +
>> +  The interrupt sources map to the corresponding

[RFC PATCH 24/24] vdpasim: control virtqueue support

This patch introduces the control virtqueue support for vDPA
simulator. This is a requirement for supporting advanced features like
multiqueue.

A requirement for control virtqueue is to isolate its memory access
from the rx/tx virtqueues. This is because when using vDPA device
for VM, the control virqueue may not be assigned to VM
directly but shadowed by userspace VMM (Qemu).

The isolation is done via the virtqueue groups and ASID support in
vDPA through vhost-vdpa. The simulator is extended to have:

1) three virtqueues: RXVQ, TXVQ and CVQ (control virtqueue)
2) two virtqueue groups: group 0 contains RX/TX, group 1 contains CVQ
3) two address spaces and the simulator simply implements the address
   spaces by mapping it 1:1 to IOTLB.

For the VM use cases, userspace(Qemu) may set AS 0 to group 0 and AS 1
to group 1. So we have:

1) The IOTLB for virtqueue group 0 contains the mappings of guest, so
   RX and TX can be assigned to guest directly.
2) The IOTLB for virtqueue group 1 contains the mappings of CVQ which
   is the buffers that allocated and managed by userspace only so that
   guest can not access the CVQ of vhost-vdpa.

For the other use cases, since AS 0 is associated to all virtqueue
groups by default. All virtqueues share the same mapping by default.

To demonstrate the function, VIRITO_NET_F_CTRL_MACADDR is
implemented in the simulator for the driver to set mac address.

Signed-off-by: Jason Wang 
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 189 +++
 1 file changed, 166 insertions(+), 23 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 66d901fb4c57..3459539c4460 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -56,14 +56,18 @@ struct vdpasim_virtqueue {
 #define VDPASIM_QUEUE_MAX 256
 #define VDPASIM_DEVICE_ID 0x1
 #define VDPASIM_VENDOR_ID 0
-#define VDPASIM_VQ_NUM 0x2
+#define VDPASIM_VQ_NUM 0x3
+#define VDPASIM_AS_NUM 0x2
+#define VDPASIM_GROUP_NUM 0x2
 #define VDPASIM_NAME "vdpasim-netdev"
 
 static u64 vdpasim_features = (1ULL << VIRTIO_F_ANY_LAYOUT) |
  (1ULL << VIRTIO_F_VERSION_1)  |
  (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
  (1ULL << VIRTIO_NET_F_MTU) |
- (1ULL << VIRTIO_NET_F_MAC);
+ (1ULL << VIRTIO_NET_F_MAC) |
+ (1ULL << VIRTIO_NET_F_CTRL_VQ) |
+ (1ULL << VIRTIO_NET_F_CTRL_MAC_ADDR);
 
 /* State of each vdpasim device */
 struct vdpasim {
@@ -143,11 +147,17 @@ static void vdpasim_reset(struct vdpasim *vdpasim)
 {
int i;
 
-   for (i = 0; i < VDPASIM_VQ_NUM; i++)
+   spin_lock(>iommu_lock);
+
+   for (i = 0; i < VDPASIM_VQ_NUM; i++) {
vdpasim_vq_reset(>vqs[i]);
+   vringh_set_iotlb(>vqs[i].vring,
+>iommu[0]);
+   }
 
-   spin_lock(>iommu_lock);
-   vhost_iotlb_reset(vdpasim->iommu);
+   for (i = 0; i < VDPASIM_AS_NUM; i++) {
+   vhost_iotlb_reset(>iommu[i]);
+   }
spin_unlock(>iommu_lock);
 
vdpasim->features = 0;
@@ -187,6 +197,80 @@ static bool receive_filter(struct vdpasim *vdpasim, size_t 
len)
return false;
 }
 
+virtio_net_ctrl_ack vdpasim_handle_ctrl_mac(struct vdpasim *vdpasim,
+   u8 cmd)
+{
+   struct vdpasim_virtqueue *cvq = >vqs[2];
+   virtio_net_ctrl_ack status = VIRTIO_NET_ERR;
+   size_t read;
+
+   switch (cmd) {
+   case VIRTIO_NET_CTRL_MAC_ADDR_SET:
+   read = vringh_iov_pull_iotlb(>vring, >riov,
+(void *)vdpasim->config.mac,
+ETH_ALEN);
+   if (read == ETH_ALEN)
+   status = VIRTIO_NET_OK;
+   break;
+   default:
+   break;
+   }
+
+   return status;
+}
+
+static void vdpasim_handle_cvq(struct vdpasim *vdpasim)
+{
+   struct vdpasim_virtqueue *cvq = >vqs[2];
+   virtio_net_ctrl_ack status = VIRTIO_NET_ERR;
+   struct virtio_net_ctrl_hdr ctrl;
+   size_t read, write;
+   int err;
+
+   if (!(vdpasim->features & (1ULL << VIRTIO_NET_F_CTRL_VQ)))
+   return;
+
+   if (!cvq->ready)
+   return;
+
+   while (true) {
+   err = vringh_getdesc_iotlb(>vring, >riov, >wiov,
+  >head, GFP_ATOMIC);
+   if (err <= 0)
+   break;
+
+   read = vringh_iov_pull_iotlb(>vring, >riov, ,
+sizeof(ctrl));
+   if (read != sizeof(ctrl))
+   break;
+
+   switch (ctrl.class) {
+   case VIRTIO_NET_CTRL_MAC:
+   status = vdpasim_handle_ctrl_mac(vdpasim, ctrl.cmd);

[RFC PATCH 22/24] vdpa_sim: factor out buffer completion logic

This patch factors out the buffer completion logic in order to support
future features.

Signed-off-by: Jason Wang 
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 33 +---
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index ca5c2d0db905..b21670e054ba 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -155,6 +155,22 @@ static void vdpasim_reset(struct vdpasim *vdpasim)
++vdpasim->generation;
 }
 
+static void vdpasim_complete(struct vdpasim_virtqueue *vq, size_t len)
+{
+   /* Make sure data is wrote before advancing index */
+   smp_wmb();
+
+   vringh_complete_iotlb(>vring, vq->head, len);
+
+   /* Make sure used is visible before rasing the interrupt. */
+   smp_wmb();
+
+   local_bh_disable();
+   if (vq->cb)
+   vq->cb(vq->private);
+   local_bh_enable();
+}
+
 static void vdpasim_work(struct work_struct *work)
 {
struct vdpasim *vdpasim = container_of(work, struct
@@ -203,21 +219,8 @@ static void vdpasim_work(struct work_struct *work)
total_write += write;
}
 
-   /* Make sure data is wrote before advancing index */
-   smp_wmb();
-
-   vringh_complete_iotlb(>vring, txq->head, 0);
-   vringh_complete_iotlb(>vring, rxq->head, total_write);
-
-   /* Make sure used is visible before rasing the interrupt. */
-   smp_wmb();
-
-   local_bh_disable();
-   if (txq->cb)
-   txq->cb(txq->private);
-   if (rxq->cb)
-   rxq->cb(rxq->private);
-   local_bh_enable();
+   vdpasim_complete(txq, 0);
+   vdpasim_complete(rxq, total_write);
 
if (++pkts > 4) {
schedule_work(>work);
-- 
2.20.1

[RFC PATCH 21/24] vdpa_sim: advertise VIRTIO_NET_F_MAC

We advertise mac address via config space, so let's advertise
VIRTIO_NET_F_MAC.

Signed-off-by: Jason Wang 
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 4b2d0d3fbc87..ca5c2d0db905 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -62,7 +62,8 @@ struct vdpasim_virtqueue {
 static u64 vdpasim_features = (1ULL << VIRTIO_F_ANY_LAYOUT) |
  (1ULL << VIRTIO_F_VERSION_1)  |
  (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
- (1ULL << VIRTIO_NET_F_MTU);
+ (1ULL << VIRTIO_NET_F_MTU) |
+ (1ULL << VIRTIO_NET_F_MAC);
 
 /* State of each vdpasim device */
 struct vdpasim {
-- 
2.20.1

[RFC PATCH 20/24] vdpa_sim: advertise VIRTIO_NET_F_MTU

We've already reported maximum mtu via config space, so let's
advertise the feature.

Signed-off-by: Jason Wang 
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index d1764a64578d..4b2d0d3fbc87 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -61,7 +61,8 @@ struct vdpasim_virtqueue {
 
 static u64 vdpasim_features = (1ULL << VIRTIO_F_ANY_LAYOUT) |
  (1ULL << VIRTIO_F_VERSION_1)  |
- (1ULL << VIRTIO_F_ACCESS_PLATFORM);
+ (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
+ (1ULL << VIRTIO_NET_F_MTU);
 
 /* State of each vdpasim device */
 struct vdpasim {
-- 
2.20.1

[RFC PATCH 16/24] vhost-vdpa: uAPI to get virtqueue group id

Follows the support for virtqueue group in vDPA. This patches
introduces uAPI to get the virtqueue group ID for a specific virtqueue
in vhost-vdpa.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vdpa.c   | 8 
 include/uapi/linux/vhost.h | 4 
 2 files changed, 12 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 4d97a59824a1..a234d3783e16 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -433,6 +433,14 @@ static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, 
unsigned int cmd,
return -EFAULT;
ops->set_vq_ready(vdpa, idx, s.num);
return 0;
+   case VHOST_VDPA_GET_VRING_GROUP:
+   s.index = idx;
+   s.num = ops->get_vq_group(vdpa, idx);
+   if (s.num >= vdpa->ngroups)
+   return -EIO;
+   else if (copy_to_user(argp, , sizeof s))
+   return -EFAULT;
+   return 0;
case VHOST_GET_VRING_BASE:
r = ops->get_vq_state(v->vdpa, idx, _state);
if (r)
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 99bdf50efc50..d1c4b5561fee 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -147,4 +147,8 @@
 
 /* Get the number of address spaces. */
 #define VHOST_VDPA_GET_AS_NUM  _IOR(VHOST_VIRTIO, 0x79, unsigned int)
+
+/* Get the group for a virtqueue: read index, write group in num */
+#define VHOST_VDPA_GET_VRING_GROUP _IOWR(VHOST_VIRTIO, 0x79,   \
+ struct vhost_vring_state)
 #endif
-- 
2.20.1

[RFC PATCH 19/24] vdpa_sim: use separated iov for reading and writing

In order to support control virtqueue whose commands have both in and
out descriptors, we need to use separated iov for reading and writing
in vdpa_sim.

Signed-off-by: Jason Wang 
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 5dc04ec271bb..d1764a64578d 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -40,7 +40,8 @@ MODULE_PARM_DESC(batch_mapping, "Batched mapping 1 -Enable; 0 
- Disable");
 
 struct vdpasim_virtqueue {
struct vringh vring;
-   struct vringh_kiov iov;
+   struct vringh_kiov riov;
+   struct vringh_kiov wiov;
unsigned short head;
bool ready;
u64 desc_addr;
@@ -173,12 +174,12 @@ static void vdpasim_work(struct work_struct *work)
 
while (true) {
total_write = 0;
-   err = vringh_getdesc_iotlb(>vring, >iov, NULL,
+   err = vringh_getdesc_iotlb(>vring, >riov, NULL,
   >head, GFP_ATOMIC);
if (err <= 0)
break;
 
-   err = vringh_getdesc_iotlb(>vring, NULL, >iov,
+   err = vringh_getdesc_iotlb(>vring, NULL, >wiov,
   >head, GFP_ATOMIC);
if (err <= 0) {
vringh_complete_iotlb(>vring, txq->head, 0);
@@ -186,13 +187,13 @@ static void vdpasim_work(struct work_struct *work)
}
 
while (true) {
-   read = vringh_iov_pull_iotlb(>vring, >iov,
+   read = vringh_iov_pull_iotlb(>vring, >riov,
 vdpasim->buffer,
 PAGE_SIZE);
if (read <= 0)
break;
 
-   write = vringh_iov_push_iotlb(>vring, >iov,
+   write = vringh_iov_push_iotlb(>vring, >wiov,
  vdpasim->buffer, read);
if (write <= 0)
break;
-- 
2.20.1

[RFC PATCH 18/24] vhost-vdpa: support ASID based IOTLB API

This patch extends the vhost-vdpa to support ASID based IOTLB API. The
vhost-vdpa device will allocated multple IOTLBs for vDPA device that
supports multiple address spaces. The IOTLBs and vDPA device memory
mappings is determined and maintained through ASID.

Note that we still don't support vDPA device with more than one
address spaces that depends on platform IOMMU. This work will be done
by moving the IOMMU logic from vhost-vDPA to vDPA device driver.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vdpa.c | 106 ---
 1 file changed, 79 insertions(+), 27 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 978cf97dc03a..99ac13b2ed11 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -29,7 +29,8 @@
 enum {
VHOST_VDPA_BACKEND_FEATURES =
(1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2) |
-   (1ULL << VHOST_BACKEND_F_IOTLB_BATCH),
+   (1ULL << VHOST_BACKEND_F_IOTLB_BATCH) |
+   (1ULL << VHOST_BACKEND_F_IOTLB_ASID),
 };
 
 #define VHOST_VDPA_DEV_MAX (1U << MINORBITS)
@@ -58,12 +59,20 @@ struct vhost_vdpa {
struct eventfd_ctx *config_ctx;
int in_batch;
int used_as;
+   u32 batch_asid;
 };
 
 static DEFINE_IDA(vhost_vdpa_ida);
 
 static dev_t vhost_vdpa_major;
 
+static inline u32 iotlb_to_asid(struct vhost_iotlb *iotlb)
+{
+   struct vhost_vdpa_as *as = container_of(iotlb, struct
+   vhost_vdpa_as, iotlb);
+   return as->id;
+}
+
 static struct vhost_vdpa_as *asid_to_as(struct vhost_vdpa *v, u32 asid)
 {
struct hlist_head *head = >as[asid % VHOST_VDPA_IOTLB_BUCKETS];
@@ -76,6 +85,16 @@ static struct vhost_vdpa_as *asid_to_as(struct vhost_vdpa 
*v, u32 asid)
return NULL;
 }
 
+static struct vhost_iotlb *asid_to_iotlb(struct vhost_vdpa *v, u32 asid)
+{
+   struct vhost_vdpa_as *as = asid_to_as(v, asid);
+
+   if (!as)
+   return NULL;
+
+   return >iotlb;
+}
+
 static struct vhost_vdpa_as *vhost_vdpa_alloc_as(struct vhost_vdpa *v, u32 
asid)
 {
struct hlist_head *head = >as[asid % VHOST_VDPA_IOTLB_BUCKETS];
@@ -84,6 +103,9 @@ static struct vhost_vdpa_as *vhost_vdpa_alloc_as(struct 
vhost_vdpa *v, u32 asid)
if (asid_to_as(v, asid))
return NULL;
 
+   if (asid >= v->vdpa->nas)
+   return NULL;
+
as = kmalloc(sizeof(*as), GFP_KERNEL);
if (!as)
return NULL;
@@ -96,13 +118,20 @@ static struct vhost_vdpa_as *vhost_vdpa_alloc_as(struct 
vhost_vdpa *v, u32 asid)
return as;
 }
 
-static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
+static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
+ u32 asid)
 {
struct vhost_vdpa_as *as = asid_to_as(v, asid);
 
-   /* Remove default address space is not allowed */
-   if (asid == 0)
-   return -EINVAL;
+   if (as)
+   return as;
+
+   return vhost_vdpa_alloc_as(v, asid);
+}
+
+static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
+{
+   struct vhost_vdpa_as *as = asid_to_as(v, asid);
 
if (!as)
return -EINVAL;
@@ -623,6 +652,7 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct 
vhost_iotlb *iotlb,
 {
struct vdpa_device *vdpa = v->vdpa;
const struct vdpa_config_ops *ops = vdpa->config;
+   u32 asid = iotlb_to_asid(iotlb);
int r = 0;
 
r = vhost_iotlb_add_range(iotlb, iova, iova + size - 1,
@@ -631,10 +661,10 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct 
vhost_iotlb *iotlb,
return r;
 
if (ops->dma_map) {
-   r = ops->dma_map(vdpa, 0, iova, size, pa, perm);
+   r = ops->dma_map(vdpa, asid, iova, size, pa, perm);
} else if (ops->set_map) {
if (!v->in_batch)
-   r = ops->set_map(vdpa, 0, iotlb);
+   r = ops->set_map(vdpa, asid, iotlb);
} else {
r = iommu_map(v->domain, iova, pa, size,
  perm_to_iommu_flags(perm));
@@ -643,23 +673,32 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct 
vhost_iotlb *iotlb,
return r;
 }
 
-static void vhost_vdpa_unmap(struct vhost_vdpa *v,
-struct vhost_iotlb *iotlb,
-u64 iova, u64 size)
+static int vhost_vdpa_unmap(struct vhost_vdpa *v,
+   struct vhost_iotlb *iotlb,
+   u64 iova, u64 size)
 {
struct vdpa_device *vdpa = v->vdpa;
const struct vdpa_config_ops *ops = vdpa->config;
+   u32 asid = iotlb_to_asid(iotlb);
+
+   if (!iotlb)
+   return -EINVAL;
 
vhost_vdpa_iotlb_unmap(v, iotlb, iova, iova + size - 1);
 
if (ops->dma_map) {
-   ops->dma_unmap(vdpa, 0, iova, size);
+

[RFC PATCH 17/24] vhost-vdpa: introduce uAPI to set group ASID

Follows the vDPA support for associating ASID to a specific virtqueue
group. This patch adds a uAPI to support setting them from userspace.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vdpa.c   | 8 
 include/uapi/linux/vhost.h | 4 
 2 files changed, 12 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index a234d3783e16..978cf97dc03a 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -441,6 +441,14 @@ static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, 
unsigned int cmd,
else if (copy_to_user(argp, , sizeof s))
return -EFAULT;
return 0;
+   case VHOST_VDPA_SET_GROUP_ASID:
+   if (copy_from_user(, argp, sizeof(s)))
+   return -EFAULT;
+   if (s.num >= vdpa->ngroups)
+   return -EINVAL;
+   if (!ops->set_group_asid)
+   return -ENOTSUPP;
+   return ops->set_group_asid(vdpa, idx, s.num);
case VHOST_GET_VRING_BASE:
r = ops->get_vq_state(v->vdpa, idx, _state);
if (r)
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index d1c4b5561fee..f3de9e45c518 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -151,4 +151,8 @@
 /* Get the group for a virtqueue: read index, write group in num */
 #define VHOST_VDPA_GET_VRING_GROUP _IOWR(VHOST_VIRTIO, 0x79,   \
  struct vhost_vring_state)
+/* Set the ASID for a virtqueue group. */
+#define VHOST_VDPA_SET_GROUP_ASID  _IOW(VHOST_VIRTIO, 0x7A, \
+struct vhost_vring_state)
+
 #endif
-- 
2.20.1

Re: [PATCH V3 1/3] arm64/mm/hotplug: Register boot memory hot remove notifier earlier

2020-09-23 Thread Anshuman Khandual




On 09/23/2020 11:34 AM, Gavin Shan wrote:
> Hi Anshuman,
> 
> On 9/21/20 10:05 PM, Anshuman Khandual wrote:
>> This moves memory notifier registration earlier in the boot process from
>> device_initcall() to early_initcall() which will help in guarding against
>> potential early boot memory offline requests. Even though there should not
>> be any actual offlinig requests till memory block devices are initialized
>> with memory_dev_init() but then generic init sequence might just change in
>> future. Hence an early registration for the memory event notifier would be
>> helpful. While here, just skip the registration if CONFIG_MEMORY_HOTREMOVE
>> is not enabled and also call out when memory notifier registration fails.
>>
>> Cc: Catalin Marinas 
>> Cc: Will Deacon 
>> Cc: Mark Rutland 
>> Cc: Marc Zyngier 
>> Cc: Steve Capper 
>> Cc: Mark Brown 
>> Cc: linux-arm-ker...@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Anshuman Khandual 
>> ---
>>   arch/arm64/mm/mmu.c | 14 --
>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>
> 
> With the following nit-picky comments resolved:
> 
> Reviewed-by: Gavin Shan 
> 
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 75df62fea1b6..df3b7415b128 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1499,7 +1499,17 @@ static struct notifier_block 
>> prevent_bootmem_remove_nb = {
>>     static int __init prevent_bootmem_remove_init(void)
>>   {
>> -    return register_memory_notifier(_bootmem_remove_nb);
>> +    int ret = 0;
>> +
>> +    if (!IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
>> +    return ret;
>> +
>> +    ret = register_memory_notifier(_bootmem_remove_nb);
>> +    if (!ret)
>> +    return ret;
>> +
>> +    pr_err("Notifier registration failed - boot memory can be removed\n");
>> +    return ret;
>>   }
> 
> It might be cleaner if the duplicated return statements can be
> avoided. Besides, it's always nice to print the errno even though

Thought about it, just that the error message was too long.

> zero is always returned from register_memory_notifier(). So I guess
> you probably need something like below:
> 
>     ret = register_memory_notifier(_bootmem_remove_nb);
>     if (ret)
>     pr_err("%s: Error %d registering notifier\n", __func__, ret)
> 
>     return ret;

Sure, will do.

> 
> 
> register_memory_notifier   # 0 is returned on 
> !CONFIG_MEMORY_HOTPLUG_SPARSE
>    blocking_notifier_chain_register
>   notifier_chain_register  # 0 is always returned
>  
>> -device_initcall(prevent_bootmem_remove_init);
>> +early_initcall(prevent_bootmem_remove_init);
>>   #endif
>>
> 
> Cheers,
> Gavin
> 
>

[RFC PATCH 15/24] vhost-vdpa: introduce uAPI to get the number of address spaces

A new uAPI is introduced for the userspace to know the address spaces
that is supported by a specific device.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vdpa.c   | 3 +++
 include/uapi/linux/vhost.h | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 4b8882f55bc9..4d97a59824a1 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -532,6 +532,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
r = copy_to_user(argp, >vdpa->ngroups,
 sizeof(v->vdpa->ngroups));
break;
+   case VHOST_VDPA_GET_AS_NUM:
+   r = copy_to_user(argp, >vdpa->nas, sizeof(v->vdpa->nas));
+   break;
case VHOST_SET_LOG_BASE:
case VHOST_SET_LOG_FD:
r = -ENOIOCTLCMD;
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 19f1acdfe3ea..99bdf50efc50 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -145,4 +145,6 @@
 /* Get the number of virtqueue groups. */
 #define VHOST_VDPA_GET_GROUP_NUM   _IOR(VHOST_VIRTIO, 0x78, unsigned int)
 
+/* Get the number of address spaces. */
+#define VHOST_VDPA_GET_AS_NUM  _IOR(VHOST_VIRTIO, 0x79, unsigned int)
 #endif
-- 
2.20.1

[RFC PATCH 10/24] vdpa: introduce config operations for associating ASID to a virtqueue group

This patch introduces a new bus operation to allow the vDPA bus driver
to associate an ASID to a virtqueue group.

Signed-off-by: Jason Wang 
---
 include/linux/vdpa.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 1e1163daa352..e2394995a3cd 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -160,6 +160,12 @@ struct vdpa_device {
  * @get_generation:Get device config generation (optional)
  * @vdev: vdpa device
  * Returns u32: device generation
+ * @set_group_asid:Set address space identifier for a
+ * virtqueue group
+ * @vdev: vdpa device
+ * @group: virtqueue group
+ * @asid: address space id for this group
+ * Returns integer: success (0) or error (< 0)
  * @set_map:   Set device memory mapping (optional)
  * Needed for device that using device
  * specific DMA translation (on-chip IOMMU)
@@ -237,6 +243,10 @@ struct vdpa_config_ops {
   u64 iova, u64 size, u64 pa, u32 perm);
int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
 u64 iova, u64 size);
+   int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
+ unsigned int asid);
+
+
 
/* Free device resources */
void (*free)(struct vdpa_device *vdev);
-- 
2.20.1

Re: [PATCH v9 10/20] gpiolib: cdev: support GPIO_V2_LINE_SET_CONFIG_IOCTL

2020-09-23 Thread Kent Gibson

On Wed, Sep 23, 2020 at 07:15:46PM +0300, Andy Shevchenko wrote:
> On Wed, Sep 23, 2020 at 7:14 PM Andy Shevchenko
>  wrote:
> >
> > On Tue, Sep 22, 2020 at 5:35 AM Kent Gibson  wrote:
> > >
> > > Add support for GPIO_V2_LINE_SET_CONFIG_IOCTL, the uAPI v2
> > > line set config ioctl.
> 
> > > +static long linereq_set_config_unlocked(struct linereq *lr,
> > > +   struct gpio_v2_line_config *lc)
> > > +{
> > > +   struct gpio_desc *desc;
> > > +   unsigned int i;
> > > +   u64 flags;
> > > +   bool polarity_change;
> > > +   int ret;
> > > +
> > > +   for (i = 0; i < lr->num_lines; i++) {
> > > +   desc = lr->lines[i].desc;
> > > +   flags = gpio_v2_line_config_flags(lc, i);
> >
> > > +   polarity_change =
> > > +   (test_bit(FLAG_ACTIVE_LOW, >flags) !=
> > > +((flags & GPIO_V2_LINE_FLAG_ACTIVE_LOW) != 0));
> >
> > Comparison
> 
> Comparison between int / long (not all archs are agreed on this) and
> boolean is not the best we can do.
> 

There is no bool to int comparision here.

There are two comparisons - the inner int vs int => bool and the
outer bool vs bool.  The "!= 0" is effectively an implicit cast to
bool, as is your new_polarity initialisation below.

> What about
> 
>   bool old_polarity = test_bit(FLAG_ACTIVE_LOW, >flags);
>   bool new_polarity = flags & GPIO_V2_LINE_FLAG_ACTIVE_LOW;
> 
>   old_polarity ^ new_polarity
> 

So using bitwise operators on bools is ok??

> and move this under INPUT conditional?
> 

It has to be before the gpio_v2_line_config_flags_to_desc_flags() call,
as that modifies the desc flags, including the new polarity, so
polarity_change would then always be false :-).

Cheers,
Kent.

> > > +
> > > +   gpio_v2_line_config_flags_to_desc_flags(flags, 
> > > >flags);
> > > +   /*
> > > +* Lines have to be requested explicitly for input
> > > +* or output, else the line will be treated "as is".
> > > +*/
> > > +   if (flags & GPIO_V2_LINE_FLAG_OUTPUT) {
> > > +   int val = gpio_v2_line_config_output_value(lc, i);
> > > +
> > > +   edge_detector_stop(>lines[i]);
> > > +   ret = gpiod_direction_output(desc, val);
> > > +   if (ret)
> > > +   return ret;
> > > +   } else if (flags & GPIO_V2_LINE_FLAG_INPUT) {
> > > +   ret = gpiod_direction_input(desc);
> > > +   if (ret)
> > > +   return ret;
> > > +
> > > +   ret = edge_detector_update(>lines[i],
> > > +   flags & GPIO_V2_LINE_EDGE_FLAGS,
> > > +   polarity_change);
> > > +   if (ret)
> > > +   return ret;
> > > +   }
> > > +
> > > +   blocking_notifier_call_chain(>gdev->notifier,
> > > +GPIO_V2_LINE_CHANGED_CONFIG,
> > > +desc);
> > > +   }
> > > +   return 0;
> > > +}
> 
> -- 
> With Best Regards,
> Andy Shevchenko

[RFC PATCH 14/24] vhost-vdpa: introduce uAPI to get the number of virtqueue groups

Follows the vDPA support for multiple address spaces, this patch
introduce uAPI for the userspace to know the number of virtqueue
groups supported by the vDPA device.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vdpa.c   | 4 
 include/uapi/linux/vhost.h | 4 
 2 files changed, 8 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 1ba7e95619b5..4b8882f55bc9 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -528,6 +528,10 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
case VHOST_VDPA_GET_VRING_NUM:
r = vhost_vdpa_get_vring_num(v, argp);
break;
+   case VHOST_VDPA_GET_GROUP_NUM:
+   r = copy_to_user(argp, >vdpa->ngroups,
+sizeof(v->vdpa->ngroups));
+   break;
case VHOST_SET_LOG_BASE:
case VHOST_SET_LOG_FD:
r = -ENOIOCTLCMD;
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index c26452782882..19f1acdfe3ea 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -141,4 +141,8 @@
 
 /* Set event fd for config interrupt*/
 #define VHOST_VDPA_SET_CONFIG_CALL _IOW(VHOST_VIRTIO, 0x77, int)
+
+/* Get the number of virtqueue groups. */
+#define VHOST_VDPA_GET_GROUP_NUM   _IOR(VHOST_VIRTIO, 0x78, unsigned int)
+
 #endif
-- 
2.20.1

[RFC PATCH 13/24] vhost-vdpa: introduce ASID based IOTLB

This patch introduces the support of ASID based IOTLB by tagging IOTLB
with a unique ASID. This is a must for supporting ASID based vhost
IOTLB API by the following patches.

IOTLB were stored in a hlist and new IOTLB will be allocated when a
new ASID is seen via IOTLB API and destoryed when there's no mapping
associated with an ASID.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vdpa.c | 94 +---
 1 file changed, 72 insertions(+), 22 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 6552987544d7..1ba7e95619b5 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -34,13 +34,21 @@ enum {
 
 #define VHOST_VDPA_DEV_MAX (1U << MINORBITS)
 
+#define VHOST_VDPA_IOTLB_BUCKETS 16
+
+struct vhost_vdpa_as {
+   struct hlist_node hash_link;
+   struct vhost_iotlb iotlb;
+   u32 id;
+};
+
 struct vhost_vdpa {
struct vhost_dev vdev;
struct iommu_domain *domain;
struct vhost_virtqueue *vqs;
struct completion completion;
struct vdpa_device *vdpa;
-   struct vhost_iotlb *iotlb;
+   struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS];
struct device dev;
struct cdev cdev;
atomic_t opened;
@@ -49,12 +57,64 @@ struct vhost_vdpa {
int minor;
struct eventfd_ctx *config_ctx;
int in_batch;
+   int used_as;
 };
 
 static DEFINE_IDA(vhost_vdpa_ida);
 
 static dev_t vhost_vdpa_major;
 
+static struct vhost_vdpa_as *asid_to_as(struct vhost_vdpa *v, u32 asid)
+{
+   struct hlist_head *head = >as[asid % VHOST_VDPA_IOTLB_BUCKETS];
+   struct vhost_vdpa_as *as;
+
+   hlist_for_each_entry(as, head, hash_link)
+   if (as->id == asid)
+   return as;
+
+   return NULL;
+}
+
+static struct vhost_vdpa_as *vhost_vdpa_alloc_as(struct vhost_vdpa *v, u32 
asid)
+{
+   struct hlist_head *head = >as[asid % VHOST_VDPA_IOTLB_BUCKETS];
+   struct vhost_vdpa_as *as;
+
+   if (asid_to_as(v, asid))
+   return NULL;
+
+   as = kmalloc(sizeof(*as), GFP_KERNEL);
+   if (!as)
+   return NULL;
+
+   vhost_iotlb_init(>iotlb, 0, 0);
+   as->id = asid;
+   hlist_add_head(>hash_link, head);
+   ++v->used_as;
+
+   return as;
+}
+
+static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
+{
+   struct vhost_vdpa_as *as = asid_to_as(v, asid);
+
+   /* Remove default address space is not allowed */
+   if (asid == 0)
+   return -EINVAL;
+
+   if (!as)
+   return -EINVAL;
+
+   hlist_del(>hash_link);
+   vhost_iotlb_reset(>iotlb);
+   kfree(as);
+   --v->used_as;
+
+   return 0;
+}
+
 static void handle_vq_kick(struct vhost_work *work)
 {
struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
@@ -513,15 +573,6 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v,
}
 }
 
-static void vhost_vdpa_iotlb_free(struct vhost_vdpa *v)
-{
-   struct vhost_iotlb *iotlb = v->iotlb;
-
-   vhost_vdpa_iotlb_unmap(v, iotlb, 0ULL, 0ULL - 1);
-   kfree(v->iotlb);
-   v->iotlb = NULL;
-}
-
 static int perm_to_iommu_flags(u32 perm)
 {
int flags = 0;
@@ -681,7 +732,8 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev 
*dev, u32 asid,
struct vhost_vdpa *v = container_of(dev, struct vhost_vdpa, vdev);
struct vdpa_device *vdpa = v->vdpa;
const struct vdpa_config_ops *ops = vdpa->config;
-   struct vhost_iotlb *iotlb = v->iotlb;
+   struct vhost_vdpa_as *as = asid_to_as(v, 0);
+   struct vhost_iotlb *iotlb = >iotlb;
int r = 0;
 
if (asid != 0)
@@ -775,6 +827,7 @@ static void vhost_vdpa_cleanup(struct vhost_vdpa *v)
 {
vhost_dev_cleanup(>vdev);
kfree(v->vdev.vqs);
+   vhost_vdpa_remove_as(v, 0);
 }
 
 static int vhost_vdpa_open(struct inode *inode, struct file *filep)
@@ -807,23 +860,18 @@ static int vhost_vdpa_open(struct inode *inode, struct 
file *filep)
vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false,
   vhost_vdpa_process_iotlb_msg);
 
-   dev->iotlb = vhost_iotlb_alloc(0, 0);
-   if (!dev->iotlb) {
-   r = -ENOMEM;
-   goto err_init_iotlb;
-   }
+   if (!vhost_vdpa_alloc_as(v, 0))
+   goto err_alloc_as;
 
r = vhost_vdpa_alloc_domain(v);
if (r)
-   goto err_alloc_domain;
+   goto err_alloc_as;
 
filep->private_data = v;
 
return 0;
 
-err_alloc_domain:
-   vhost_vdpa_iotlb_free(v);
-err_init_iotlb:
+err_alloc_as:
vhost_vdpa_cleanup(v);
 err:
atomic_dec(>opened);
@@ -851,7 +899,6 @@ static int vhost_vdpa_release(struct inode *inode, struct 
file *filep)
filep->private_data = NULL;
vhost_vdpa_reset(v);
vhost_dev_stop(>vdev);
-   vhost_vdpa_iotlb_free(v);
vhost_vdpa_free_domain(v);

[RFC PATCH 11/24] vhost_iotlb: split out IOTLB initialization

This patch split out IOTLB initialization logic into a new
helper. This allows vhost to implement device specific IOTLB
allocation logic.

Signed-off-by: Jason Wang 
---
 drivers/vhost/iotlb.c   | 23 ++-
 include/linux/vhost_iotlb.h |  2 ++
 2 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/iotlb.c b/drivers/vhost/iotlb.c
index 1f0ca6e44410..36d785efd038 100644
--- a/drivers/vhost/iotlb.c
+++ b/drivers/vhost/iotlb.c
@@ -98,6 +98,23 @@ void vhost_iotlb_del_range(struct vhost_iotlb *iotlb, u64 
start, u64 last)
 }
 EXPORT_SYMBOL_GPL(vhost_iotlb_del_range);
 
+/**
+ * vhost_iotlb_init - initialize a vhost IOTLB
+ * @iotlb: the IOTLB that needs to be initialized
+ * @limit: maximum number of IOTLB entries
+ * @flags: VHOST_IOTLB_FLAG_XXX
+ */
+void vhost_iotlb_init(struct vhost_iotlb *iotlb, unsigned int limit,
+ unsigned int flags)
+{
+   iotlb->root = RB_ROOT_CACHED;
+   iotlb->limit = limit;
+   iotlb->nmaps = 0;
+   iotlb->flags = flags;
+   INIT_LIST_HEAD(>list);
+}
+EXPORT_SYMBOL_GPL(vhost_iotlb_init);
+
 /**
  * vhost_iotlb_alloc - add a new vhost IOTLB
  * @limit: maximum number of IOTLB entries
@@ -112,11 +129,7 @@ struct vhost_iotlb *vhost_iotlb_alloc(unsigned int limit, 
unsigned int flags)
if (!iotlb)
return NULL;
 
-   iotlb->root = RB_ROOT_CACHED;
-   iotlb->limit = limit;
-   iotlb->nmaps = 0;
-   iotlb->flags = flags;
-   INIT_LIST_HEAD(>list);
+   vhost_iotlb_init(iotlb, limit, flags);
 
return iotlb;
 }
diff --git a/include/linux/vhost_iotlb.h b/include/linux/vhost_iotlb.h
index 6b09b786a762..c0df193ec3e1 100644
--- a/include/linux/vhost_iotlb.h
+++ b/include/linux/vhost_iotlb.h
@@ -33,6 +33,8 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb, u64 
start, u64 last,
  u64 addr, unsigned int perm);
 void vhost_iotlb_del_range(struct vhost_iotlb *iotlb, u64 start, u64 last);
 
+void vhost_iotlb_init(struct vhost_iotlb *iotlb, unsigned int limit,
+ unsigned int flags);
 struct vhost_iotlb *vhost_iotlb_alloc(unsigned int limit, unsigned int flags);
 void vhost_iotlb_free(struct vhost_iotlb *iotlb);
 void vhost_iotlb_reset(struct vhost_iotlb *iotlb);
-- 
2.20.1

[RFC PATCH 12/24] vhost: support ASID in IOTLB API

This patches allows userspace to send ASID based IOTLB message to
vhost. This idea is to use the reserved u32 field in the existing V2
IOTLB message. Vhost device should advertise this capability via
VHOST_BACKEND_F_IOTLB_ASID backend feature.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vdpa.c |  5 -
 drivers/vhost/vhost.c| 23 ++-
 drivers/vhost/vhost.h|  4 ++--
 include/uapi/linux/vhost_types.h |  5 -
 4 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index eeefcd971e3f..6552987544d7 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -675,7 +675,7 @@ static int vhost_vdpa_process_iotlb_update(struct 
vhost_vdpa *v,
return ret;
 }
 
-static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
+static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
struct vhost_iotlb_msg *msg)
 {
struct vhost_vdpa *v = container_of(dev, struct vhost_vdpa, vdev);
@@ -684,6 +684,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev 
*dev,
struct vhost_iotlb *iotlb = v->iotlb;
int r = 0;
 
+   if (asid != 0)
+   return -EINVAL;
+
r = vhost_dev_check_owner(dev);
if (r)
return r;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index b45519ca66a7..060662b12de3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -463,7 +463,7 @@ void vhost_dev_init(struct vhost_dev *dev,
struct vhost_virtqueue **vqs, int nvqs,
int iov_limit, int weight, int byte_weight,
bool use_worker,
-   int (*msg_handler)(struct vhost_dev *dev,
+   int (*msg_handler)(struct vhost_dev *dev, u32 asid,
   struct vhost_iotlb_msg *msg))
 {
struct vhost_virtqueue *vq;
@@ -1079,11 +1079,14 @@ static bool umem_access_ok(u64 uaddr, u64 size, int 
access)
return true;
 }
 
-static int vhost_process_iotlb_msg(struct vhost_dev *dev,
+static int vhost_process_iotlb_msg(struct vhost_dev *dev, u16 asid,
   struct vhost_iotlb_msg *msg)
 {
int ret = 0;
 
+   if (asid != 0)
+   return -EINVAL;
+
mutex_lock(>mutex);
vhost_dev_lock_vqs(dev);
switch (msg->type) {
@@ -1130,6 +1133,7 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
struct vhost_iotlb_msg msg;
size_t offset;
int type, ret;
+   u16 asid = 0;
 
ret = copy_from_iter(, sizeof(type), from);
if (ret != sizeof(type)) {
@@ -1145,7 +1149,16 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
offset = offsetof(struct vhost_msg, iotlb) - sizeof(int);
break;
case VHOST_IOTLB_MSG_V2:
-   offset = sizeof(__u32);
+   if (vhost_backend_has_feature(dev->vqs[0],
+ VHOST_BACKEND_F_IOTLB_ASID)) {
+   ret = copy_from_iter(, sizeof(asid), from);
+   if (ret != sizeof(asid)) {
+   ret = -EINVAL;
+   goto done;
+   }
+   offset = sizeof(__u16);
+   } else
+   offset = sizeof(__u32);
break;
default:
ret = -EINVAL;
@@ -1160,9 +1173,9 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
}
 
if (dev->msg_handler)
-   ret = dev->msg_handler(dev, );
+   ret = dev->msg_handler(dev, asid, );
else
-   ret = vhost_process_iotlb_msg(dev, );
+   ret = vhost_process_iotlb_msg(dev, asid, );
if (ret) {
ret = -EFAULT;
goto done;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 9032d3c2a9f4..05e7aaf6071b 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -162,7 +162,7 @@ struct vhost_dev {
int byte_weight;
u64 kcov_handle;
bool use_worker;
-   int (*msg_handler)(struct vhost_dev *dev,
+   int (*msg_handler)(struct vhost_dev *dev, u32 asid,
   struct vhost_iotlb_msg *msg);
 };
 
@@ -170,7 +170,7 @@ bool vhost_exceeds_weight(struct vhost_virtqueue *vq, int 
pkts, int total_len);
 void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs,
int nvqs, int iov_limit, int weight, int byte_weight,
bool use_worker,
-   int (*msg_handler)(struct vhost_dev *dev,
+   int (*msg_handler)(struct vhost_dev *dev, u32 asid,
   struct vhost_iotlb_msg *msg));
 long vhost_dev_set_owner(struct vhost_dev *dev);
 bool vhost_dev_has_owner(struct vhost_dev *dev);
diff --git

[RFC PATCH 09/24] vdpa: multiple address spaces support