Re: [PATCH] Revert "net: linkwatch: add check for netdevice being present to linkwatch_do_dev"
On Wed, 2020-09-23 at 17:23 -0700, David Miller wrote: > From: David Miller > Date: Wed, 23 Sep 2020 17:21:25 -0700 (PDT) > > > If an async code path tests 'present', gets true, and then the RTNL > > holding synchronous code path puts the device into D3hot > immediately > > afterwards, the async code path will still continue and access the > > chips registers and fault. > > Wait, is the sequence: > > ->ndo_stop() > mark device not present and put into D3hot > triggers linkwatch event > ... > ->ndo_get_stats64() > > ??? > I assume it is, since normally device drivers do carrier_off() on ndo_stop() 1) One problematic sequence would be (for drivers doing D3hot on ndo_stop()) __dev_close_many() ->ndo_stop() netif_device_detach() //Mark !present; ... D3hot carrier_off()->linkwatch_event() ... // !present && IFF_UP 2) Another problematic scenario which i see is repeated in many drivers: shutdown/suspend() rtnl_lock() netif_device_detach()//Mark !present; stop()->carrier_off()->linkwatch_event() // at this point device is still IFF_UP and !present // due to the early detach above.. rtnl_unlock(); For scenario 1) we can fix by marking IFF_UP at the beginning, but for 2), i think we need to fix the drivers to detach only after stop :( > Then yeah we might have to clear IFF_UP at the beginning of taking > a netdev down.
Re: [PATCH] ARM: dts: document pinctrl-single,pins when #pinctrl-cells = 2
On Wed, Sep 23, 2020 at 10:43 PM Tony Lindgren wrote: > > * Trent Piepho [200924 01:34]: > > On Tue, Sep 22, 2020 at 11:57 PM Tony Lindgren wrote: > > > > > > Also FYI, folks have also complained for a long time that the > > > pinctrl-single > > > binding mixes mux and conf values while they should be handled separately. > > > > > > > Instead of combining two fields when the dts is generated they are now > > combined when the pinctrl-single driver reads the dts. Other than > > this detail, the result is the same. The board dts source is the > > same. The value programmed into the pinctrl register is the same. > > There is no mechanism currently that can alter that value in any way. > > > > What does combining them later allow that is not possible now? > > It now allows further driver changes to manage conf and mux separately :) The pinctrl-single driver? How will that work with boards that are not am335x and don't use conf and mux fields in the same manner as am335x?
Re: [Linux-stm32] [PATCH 3/3] ARM: dts: stm32: update stm32mp151 for remote proc synchronisation support
Hello Arnaud, On 8/27/20 9:21 AM, Arnaud Pouliquen wrote: > Two backup registers are used to store the Cortex-M4 state and the resource > table address. > Declare the tamp node and add associated properties in m4_rproc node > to allow Linux to attach to a firmware loaded by the first boot stages. > > Associated driver implementation is available in commit 9276536f455b3 > ("remoteproc: stm32: Parse syscon that will manage M4 synchronisation"). > > Signed-off-by: Arnaud Pouliquen > --- > arch/arm/boot/dts/stm32mp151.dtsi | 7 +++ > 1 file changed, 7 insertions(+) > > diff --git a/arch/arm/boot/dts/stm32mp151.dtsi > b/arch/arm/boot/dts/stm32mp151.dtsi > index bfe29023fbd5..842ecffae73a 100644 > --- a/arch/arm/boot/dts/stm32mp151.dtsi > +++ b/arch/arm/boot/dts/stm32mp151.dtsi > @@ -1541,6 +1541,11 @@ > status = "disabled"; > }; > > + tamp: tamp@5c00a000 { > + compatible = "st,stm32-tamp", "syscon"; > + reg = <0x5c00a000 0x400>; > + }; > + Just saw this now. I have a pending patch adding this node as well: https://lore.kernel.org/patchwork/patch/1306971/ For my use case, I need a "simple-mfd" compatible to allow child nodes to be probed. Could you CC me when you send out your v2, so I can rebase? (Or if you don't mind, just add the "simple-mfd" into the compatible list yourself :-) Cheers Ahmad > /* >* Break node order to solve dependency probe issue between >* pinctrl and exti. > @@ -1717,6 +1722,8 @@ > st,syscfg-holdboot = < 0x10C 0x1>; > st,syscfg-tz = < 0x000 0x1>; > st,syscfg-pdds = <_mcu 0x0 0x1>; > + st,syscfg-rsc-tbl = < 0x144 0x>; > + st,syscfg-m4-state = < 0x148 0x>; > status = "disabled"; > }; > }; > -- Pengutronix e.K. | | Steuerwalder Str. 21 | http://www.pengutronix.de/ | 31137 Hildesheim, Germany | Phone: +49-5121-206917-0| Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917- |
Re: [PATCH] ARM: dts: document pinctrl-single,pins when #pinctrl-cells = 2
* Trent Piepho [200924 01:34]: > On Tue, Sep 22, 2020 at 11:57 PM Tony Lindgren wrote: > > > > Also FYI, folks have also complained for a long time that the pinctrl-single > > binding mixes mux and conf values while they should be handled separately. > > > > Instead of combining two fields when the dts is generated they are now > combined when the pinctrl-single driver reads the dts. Other than > this detail, the result is the same. The board dts source is the > same. The value programmed into the pinctrl register is the same. > There is no mechanism currently that can alter that value in any way. > > What does combining them later allow that is not possible now? It now allows further driver changes to manage conf and mux separately :) Regards, Tony
Re: [PATCH v2 0/6] clk: axi-clk-gen: misc updates to the driver
On Thu, Sep 24, 2020 at 7:53 AM Moritz Fischer wrote: > > Hi Stephen, > > On Wed, Sep 23, 2020 at 04:58:33PM -0700, Stephen Boyd wrote: > > Quoting Alexandru Ardelean (2020-09-22 23:22:33) > > > On Tue, Sep 22, 2020 at 10:42 PM Stephen Boyd wrote: > > > > > > > > Quoting Moritz Fischer (2020-09-14 19:41:38) > > > > > On Mon, Sep 14, 2020 at 11:11:05AM +0300, Alexandru Ardelean wrote: > > > > > > On Mon, Aug 10, 2020 at 4:41 PM Alexandru Ardelean > > > > > > wrote: > > > > > > > > > > > > > > These patches synchronize the driver with the current state in the > > > > > > > Analog Devices Linux tree: > > > > > > > https://github.com/analogdevicesinc/linux/ > > > > > > > > > > > > > > They have been in the tree for about 2-3, so they did receive some > > > > > > > testing. > > > > > > > > > > > > Ping on this series. > > > > > > Do I need to do a re-send? > > > > > > > > I got this patch series twice. Not sure why. > > > > > > My fault here. > > > Some Ctrl + R usage and not being attentive with the arguments. > > > I think I added "*.patch" twice on the send-mail command. > > > I did something similar [by accident] for some DMA patches. > > > Apologies. > > > > > > I can do a re-send for this, if it helps. > > > > Sure. Please resend it. > > > > > > > > > > > > > > > > > > > I've applied the FPGA one, the other ones should go through the clock > > > > > tree I think? > > > > > > > > Doesn't patch 6 rely on the FPGA patch? How can that driver build > > > > without the header file? > > > > > > Yes it does depend on the FPGA patch. > > > We can drop patch 6 for now, pending a merge to Linus' tree and then > > > wait for the trickle-down. > > > I don't mind waiting for these patches. > > > I have plenty of backlog that I want to run through, and cleanup and > > > then upstream. > > > So, there is no hurry. > > > > Can you send me a signed tag with that patch? I can base this patch > > series on top of that. Or I can just apply it to clk tree and if nobody > > changes it in the meantime merge should work out in linux-next and > > linus' tree upstream. > > Long story short I messed up my pull-request to Greg and had to back out > the patch anyways. In retrospect I think the patch should have gone > through your tree anyways, so here's our chance to get it right. > > Feel free to take it with the rest of the changes through your tree. > > Note: When I applied the patch I fixed up the whitespace that checkpatch > complained about so you might want to do that (or ask Alexandru to > resend the patch). > I'll fixup the checkpatch stuff, re-send as a V3, and add your Acked-by. Thanks & apologies for the mess-up on my part. > Acked-by: Moritz Fischer > > Sorry for the confusion and let me know if you still prefer a signed > tag. > > - Moritz
Re: [PATCH printk 3/5] printk: use buffer pool for sprint buffers
On (20/09/23 17:11), Petr Mladek wrote: > > AFAIK, there is one catch. We need to use va_copy() around > the 1st call because va_format can be proceed only once. > Current printk() should be good enough for reporting, say, "Kernel stack overflow" errors. Is extra pressure that va_copy() adds something that we need to consider? -ss
[PATCH] Input: trackpoint - enable Synaptics trackpoints
Add Synaptics IDs in trackpoint_start_protocol() to mark them as valid. Signed-off-by: Vincent Huang --- drivers/input/mouse/trackpoint.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/input/mouse/trackpoint.c b/drivers/input/mouse/trackpoint.c index 854d5e758724..ef2fa0905208 100644 --- a/drivers/input/mouse/trackpoint.c +++ b/drivers/input/mouse/trackpoint.c @@ -282,6 +282,8 @@ static int trackpoint_start_protocol(struct psmouse *psmouse, case TP_VARIANT_ALPS: case TP_VARIANT_ELAN: case TP_VARIANT_NXP: + case TP_VARIANT_JYT_SYNAPTICS: + case TP_VARIANT_SYNAPTICS: if (variant_id) *variant_id = param[0]; if (firmware_id) -- 2.25.1
Re: [PATCH] KVM: Enable hardware before doing arch VM initialization
On 23.09.20 20:57, Sean Christopherson wrote: > Swap the order of hardware_enable_all() and kvm_arch_init_vm() to > accommodate Intel's Trust Domain Extension (TDX), which needs VMX to be > fully enabled during VM init in order to make SEAMCALLs. > > This also provides consistent ordering between kvm_create_vm() and > kvm_destroy_vm() with respect to calling kvm_arch_destroy_vm() and > hardware_disable_all(). > > Cc: Marc Zyngier > Cc: James Morse > Cc: Julien Thierry > Cc: Suzuki K Poulose > Cc: linux-arm-ker...@lists.infradead.org > Cc: Huacai Chen > Cc: Aleksandar Markovic > Cc: linux-m...@vger.kernel.org > Cc: Paul Mackerras > Cc: kvm-...@vger.kernel.org > Cc: Christian Borntraeger > Cc: Janosch Frank > Cc: David Hildenbrand > Cc: Cornelia Huck > Cc: Claudio Imbrenda > Cc: Vitaly Kuznetsov > Cc: Wanpeng Li > Cc: Jim Mattson > Cc: Joerg Roedel > Signed-off-by: Sean Christopherson > --- > > Obviously not required until the TDX series comes along, but IMO KVM > should be consistent with respect to enabling and disabling virt support > in hardware. > > Tested only on Intel hardware. Unless I missed something, this only > affects x86, Arm and MIPS as hardware enabling is a nop for s390 and PPC. Yes, looks fine from an s390 perspective. Reviewed-by: Christian Borntraeger
Re: [PATCH] doc: zh_CN: add translatation for btrfs
Hi Qing, It looks like all vivo guys patch has 'charset=y' problem and mess code which fails on a success 'git am'. I have to repeat the same reminder again and again... Could you guys double your patches before send out? and make sure docs looks good on webpage, like https://www.kernel.org/doc/html/v5.9-rc3/translations/zh_CN/filesystems/debugfs.html Thanks Alex 在 2020/9/22 下午8:03, Wang Qing 写道: > Translate Documentation/filesystems/btrfs.rst into Chinese. > > Signed-off-by: Wang Qing > ---
[PATCH v3] mm: cma: indefinitely retry allocations in cma_alloc
V1: Introduces a retry loop that attempts a CMA allocation a finite number of times before giving up: https://lkml.org/lkml/2020/8/5/1097 https://lkml.org/lkml/2020/8/11/893 V2: Introduces an indefinite retry for CMA allocations. David Hildenbrand raised a page pinning example which precludes doing this infite-retrying for all CMA users: https://lkml.org/lkml/2020/9/17/984 V3: Re-introduce a GFP mask argument for cma_alloc(), that can take in __GFP_NOFAIL as an argument to indicate that a CMA allocation should be retried indefinitely. This lets callers of cma_alloc() decide if they want to perform indefinite retires. Also introduces a config option for controlling the duration of the sleep between retries. Chris Goldsworthy (1): mm: cma: indefinitely retry allocations in cma_alloc arch/powerpc/kvm/book3s_hv_builtin.c | 2 +- drivers/dma-buf/heaps/cma_heap.c | 2 +- drivers/s390/char/vmcp.c | 2 +- drivers/staging/android/ion/ion_cma_heap.c | 2 +- include/linux/cma.h| 2 +- kernel/dma/contiguous.c| 4 ++-- mm/Kconfig | 11 ++ mm/cma.c | 35 +- mm/cma_debug.c | 2 +- mm/hugetlb.c | 4 ++-- 10 files changed, 50 insertions(+), 16 deletions(-) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v3] mm: cma: indefinitely retry allocations in cma_alloc
CMA allocations will fail if 'pinned' pages are in a CMA area, since we cannot migrate pinned pages. The _refcount of a struct page being greater than _mapcount for that page can cause pinning for anonymous pages. This is because try_to_unmap(), which (1) is called in the CMA allocation path, and (2) decrements both _refcount and _mapcount for a page, will stop unmapping a page from VMAs once the _mapcount for a page reaches 0. This implies that after try_to_unmap() has finished successfully for a page where _recount > _mapcount, that _refcount will be greater than 0. Later in the CMA allocation path in migrate_page_move_mapping(), we will have one more reference count than intended for anonymous pages, meaning the allocation will fail for that page. If a process ends up causing _refcount > _mapcount for a page (by either incrementing _recount or decrementing _mapcount), such that the process is context switched out after modifying one refcount but before modifying the other, the page will be temporarily pinned. One example of where _refcount can be greater than _mapcount is inside of zap_pte_range(), which is called for all the entries of a PMD when a process is exiting, to unmap the process's memory. Inside of zap_pte_range(), after unammping a page with page_remove_rmap(), we have that _recount > _mapcount. _refcount can only be decremented after a TLB flush is performed for the page - this doesn't occur until enough pages have been batched together for flushing. The flush can either occur inside of zap_pte_range() (during the same invocation or a later one), or if there aren't enough pages collected by the time we unmap all of the pages in a process, the flush will occur in tlb_finish_mmu() in exit_mmap(). After the flush has occurred, tlb_batch_pages_flush() will decrement the references on the flushed pages. Another such example like the above is inside of copy_one_pte(), which is called during a fork. For PTEs for which pte_present(pte) == true, copy_one_pte() will increment the _refcount field followed by the _mapcount field of a page. So, inside of cma_alloc(), add the option of letting users pass in __GFP_NOFAIL to indicate that we should retry CMA allocations indefinitely, in the event that alloc_contig_range() returns -EBUSY after having scanned a whole CMA-region bitmap. Signed-off-by: Chris Goldsworthy Co-developed-by: Vinayak Menon Signed-off-by: Vinayak Menon --- arch/powerpc/kvm/book3s_hv_builtin.c | 2 +- drivers/dma-buf/heaps/cma_heap.c | 2 +- drivers/s390/char/vmcp.c | 2 +- drivers/staging/android/ion/ion_cma_heap.c | 2 +- include/linux/cma.h| 2 +- kernel/dma/contiguous.c| 4 ++-- mm/Kconfig | 11 ++ mm/cma.c | 35 +- mm/cma_debug.c | 2 +- mm/hugetlb.c | 4 ++-- 10 files changed, 50 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c b/arch/powerpc/kvm/book3s_hv_builtin.c index 073617c..21c3f6a 100644 --- a/arch/powerpc/kvm/book3s_hv_builtin.c +++ b/arch/powerpc/kvm/book3s_hv_builtin.c @@ -74,7 +74,7 @@ struct page *kvm_alloc_hpt_cma(unsigned long nr_pages) VM_BUG_ON(order_base_2(nr_pages) < KVM_CMA_CHUNK_ORDER - PAGE_SHIFT); return cma_alloc(kvm_cma, nr_pages, order_base_2(HPT_ALIGN_PAGES), -false); +0); } EXPORT_SYMBOL_GPL(kvm_alloc_hpt_cma); diff --git a/drivers/dma-buf/heaps/cma_heap.c b/drivers/dma-buf/heaps/cma_heap.c index 626cf7f..7657359 100644 --- a/drivers/dma-buf/heaps/cma_heap.c +++ b/drivers/dma-buf/heaps/cma_heap.c @@ -66,7 +66,7 @@ static int cma_heap_allocate(struct dma_heap *heap, helper_buffer->heap = heap; helper_buffer->size = len; - cma_pages = cma_alloc(cma_heap->cma, nr_pages, align, false); + cma_pages = cma_alloc(cma_heap->cma, nr_pages, align, 0); if (!cma_pages) goto free_buf; diff --git a/drivers/s390/char/vmcp.c b/drivers/s390/char/vmcp.c index 9e06628..11c4e3b 100644 --- a/drivers/s390/char/vmcp.c +++ b/drivers/s390/char/vmcp.c @@ -70,7 +70,7 @@ static void vmcp_response_alloc(struct vmcp_session *session) * anymore the system won't work anyway. */ if (order > 2) - page = cma_alloc(vmcp_cma, nr_pages, 0, false); + page = cma_alloc(vmcp_cma, nr_pages, 0, 0); if (page) { session->response = (char *)page_to_phys(page); session->cma_alloc = 1; diff --git a/drivers/staging/android/ion/ion_cma_heap.c b/drivers/staging/android/ion/ion_cma_heap.c index bf65e67..128d3a5 100644 --- a/drivers/staging/android/ion/ion_cma_heap.c +++ b/drivers/staging/android/ion/ion_cma_heap.c @@ -39,7 +39,7 @@ static int ion_cma_allocate(struct ion_heap *heap, struct
[PATCH] rpadlpar_io:Add MODULE_DESCRIPTION entries to kernel modules
This patch adds a brief MODULE_DESCRIPTION to rpadlpar_io kernel modules (descriptions taken from Kconfig file) Signed-off-by: Mamatha Inamdar --- drivers/pci/hotplug/rpadlpar_core.c |1 + 1 file changed, 1 insertion(+) diff --git a/drivers/pci/hotplug/rpadlpar_core.c b/drivers/pci/hotplug/rpadlpar_core.c index f979b70..bac65ed 100644 --- a/drivers/pci/hotplug/rpadlpar_core.c +++ b/drivers/pci/hotplug/rpadlpar_core.c @@ -478,3 +478,4 @@ static void __exit rpadlpar_io_exit(void) module_init(rpadlpar_io_init); module_exit(rpadlpar_io_exit); MODULE_LICENSE("GPL"); +MODULE_DESCRIPTION("RPA Dynamic Logical Partitioning driver for I/O slots");
Re: [PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc
On 2020-09-17 10:54, Chris Goldsworthy wrote: On 2020-09-15 00:53, David Hildenbrand wrote: On 14.09.20 20:33, Chris Goldsworthy wrote: On 2020-09-14 02:31, David Hildenbrand wrote: On 11.09.20 21:17, Chris Goldsworthy wrote: So, inside of cma_alloc(), instead of giving up when alloc_contig_range() returns -EBUSY after having scanned a whole CMA-region bitmap, perform retries indefinitely, with sleeps, to give the system an opportunity to unpin any pinned pages. Signed-off-by: Chris Goldsworthy Co-developed-by: Vinayak Menon Signed-off-by: Vinayak Menon --- mm/cma.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index 7f415d7..90bb505 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, bitmap_maxno, start, bitmap_count, mask, offset); if (bitmap_no >= bitmap_maxno) { - mutex_unlock(>lock); - break; + if (ret == -EBUSY) { + mutex_unlock(>lock); + + /* +* Page may be momentarily pinned by some other +* process which has been scheduled out, e.g. +* in exit path, during unmap call, or process +* fork and so cannot be freed there. Sleep +* for 100ms and retry the allocation. +*/ + start = 0; + ret = -ENOMEM; + msleep(100); + continue; + } else { + /* +* ret == -ENOMEM - all bits in cma->bitmap are +* set, so we break accordingly. +*/ + mutex_unlock(>lock); + break; + } } bitmap_set(cma->bitmap, bitmap_no, bitmap_count); /* What about long-term pinnings? IIRC, that can happen easily e.g., with vfio (and I remember there is a way via vmsplice). Not convinced trying forever is a sane approach in the general case ... V1: [1] https://lkml.org/lkml/2020/8/5/1097 [2] https://lkml.org/lkml/2020/8/6/1040 [3] https://lkml.org/lkml/2020/8/11/893 [4] https://lkml.org/lkml/2020/8/21/1490 [5] https://lkml.org/lkml/2020/9/11/1072 We're fine with doing indefinite retries, on the grounds that if there is some long-term pinning that occurs when alloc_contig_range returns -EBUSY, that it should be debugged and fixed. Would it be possible to make this infinite-retrying something that could be enabled or disabled by a defconfig option? Two thoughts: This means I strongly prefer something like [3] if feasible. _Resending so that this ends up on LKML_ I can give [3] some further thought then. Also, I realized [3] will not completely solve the problem, it just reduces the window in which _refcount > _mapcount (as mentioned in earlier threads, we encountered the pinning when a task in copy_one_pte() or in the exit_mmap() path gets context switched out). If we were to try a sleeping-lock based solution, do you think it would be permissible to add another lock to struct page? I have not been able to think of a clean way of introducing calls to preempt_disable() in exit_mmap(), which is the more problematic case. We would need to track state across multiple invocations of zap_pte_range() (which is called for each entry in a PMD when a process's memory is being unmapped), and would also need to extend this to tlb_finish_mmu(), which is called after all the process's memory has been unmapped: https://elixir.bootlin.com/linux/v5.8.10/source/mm/mmap.c#L3164. As a follow-up to this patch, I'm submitting a patch that re-introduces the GFP mask for cma_alloc, that will perform indefinite retires if __GFP_NOFAIL is passed to the function. -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v2] powerpc/pci: unmap legacy INTx interrupts when a PHB is removed
On 23/09/2020 17:06, Cédric Le Goater wrote: > On 9/23/20 2:33 AM, Qian Cai wrote: >> On Fri, 2020-08-07 at 12:18 +0200, Cédric Le Goater wrote: >>> When a passthrough IO adapter is removed from a pseries machine using >>> hash MMU and the XIVE interrupt mode, the POWER hypervisor expects the >>> guest OS to clear all page table entries related to the adapter. If >>> some are still present, the RTAS call which isolates the PCI slot >>> returns error 9001 "valid outstanding translations" and the removal of >>> the IO adapter fails. This is because when the PHBs are scanned, Linux >>> maps automatically the INTx interrupts in the Linux interrupt number >>> space but these are never removed. >>> >>> To solve this problem, we introduce a PPC platform specific >>> pcibios_remove_bus() routine which clears all interrupt mappings when >>> the bus is removed. This also clears the associated page table entries >>> of the ESB pages when using XIVE. >>> >>> For this purpose, we record the logical interrupt numbers of the >>> mapped interrupt under the PHB structure and let pcibios_remove_bus() >>> do the clean up. >>> >>> Since some PCI adapters, like GPUs, use the "interrupt-map" property >>> to describe interrupt mappings other than the legacy INTx interrupts, >>> we can not restrict the size of the mapping array to PCI_NUM_INTX. The >>> number of interrupt mappings is computed from the "interrupt-map" >>> property and the mapping array is allocated accordingly. >>> >>> Cc: "Oliver O'Halloran" >>> Cc: Alexey Kardashevskiy >>> Signed-off-by: Cédric Le Goater >> >> Some syscall fuzzing will trigger this on POWER9 NV where the traces pointed >> to >> this patch. >> >> .config: https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config > > OK. The patch is missing a NULL assignement after kfree() and that > might be the issue. > > I did try PHB removal under PowerNV, so I would like to understand > how we managed to remove twice the PCI bus and possibly reproduce. > Any chance we could grab what the syscall fuzzer (syzkaller) did ? My guess would be it is doing this in parallel to provoke races. -- Alexey
Re: [PATCH v3 2/3] misc: bcm-vk: add Broadcom VK driver
On Wed, Sep 23, 2020 at 09:43:55PM -0700, Scott Branden wrote: > >> +struct bcm_vk_tty { > >> + struct tty_port port; > >> + uint32_t to_offset; /* bar offset to use */ > >> + uint32_t to_size; /* to VK buffer size */ > >> + uint32_t wr;/* write offset shadow */ > >> + uint32_t from_offset; /* bar offset to use */ > >> + uint32_t from_size; /* from VK buffer size */ > >> + uint32_t rd;/* read offset shadow */ > > nit, these "unit32_t" stuff really doesn't matter in the kernel, 'u32' > > is a better choice overall. Same for u8 and others, for this whole > > driver. > Other than personal preference, I don't understand how 'u32' is better. > uint32_t follows the ANSI stdint.h. It allows for portable code without > the need to define custom u32 types. The ANSI namespace does not work in the kernel, which is why we have our own types that pre-date those, and work properly everywhere in the kernel. > stdint types are used in many drivers in the linux kernel already. > We would prefer to keep our code as portable as possible and use > stdint types in the driver. You aren't porting this code to other operating systems easily, please use the kernel types :) And yes, these types are used in other parts, but when you have 25 million lines of code, some crud does slip in at times... > >> + pid_t pid; > >> + bool irq_enabled; > >> + bool is_opened; /* tracks tty open/close */ > > Why do you need to track this? Doesn't the tty core handle this for > > you? > I have tried using tty_port_kopened() and it doesn't seem to work. > Will need to debug some more unless you have another suggested function to > use. You didn't answer _why_ you need to track this. A tty driver shouldn't care about this type of thing. > >> + struct workqueue_struct *tty_wq_thread; > >> + struct work_struct tty_wq_work; > >> + > >> + /* Reference-counting to handle file operations */ > >> + struct kref kref; > > And a kref? > > > > What is controlling the lifetime rules of your structure? > > > > Why a kref? > > > > Why the tty ports? > > > > Why the misc device? > > > > This feels really crazy to me... > Comments mostly from Desmond here: > > Yes, we have created a PCIe centric driver that combines with both a misc > devices on top (for the read/write/ioctrl), and also ttys. > The device sits on PCIe but we are using the misc device for accessing it. > tty is just another on top. I don't think this is that uncommon to have a > hybrid driver. Ugh, yes, it is uncommon because those are two different things. Why do you need/want a misc driver to control a tty device? Why do you need a tty device? What really is this beast? We got rid of the old "control path" device nodes for tty devices a long time ago, this feels like a return to that old model, is that why you are doing this? But again, I really don't understand what this driver is trying to control/manage, so it's hard to review it without that knowledge. > Since we have a hybrid of PCIe + misc + tty, it means that we could > simultaneously have opening dev/node to read/write (multiple) + tty o going. That's almost always a bad idea. > Since the struct is embedded inside the primary PCIe structure, we need a way > to know when all the references are done, and then at that point we could > free the primary structure. > That is the reason for the kref. On PCIe device removal, we signal the user > space process to stop first, but the data structure can not be freed until > the ref goes to 0. Again, you can not have multiple reference count objects controling a single object. That way is madness and buggy and will never work properly. You can have different objects with different lifespans, which, if you really really want to do this, is the correct way. Otherwise, stick with one object and one reference count please. thanks, greg k-h
Re: [PATCH -next] crypto: qat - remove unnecessary mutex_init()
On Wed, Sep 16, 2020 at 07:21:21AM +0100, Qinglang Miao wrote: > The mutex adf_ctl_lock is initialized statically. It is > unnecessary to initialize by mutex_init(). > > Signed-off-by: Qinglang Miao Acked-by: Giovanni Cabiddu > --- > drivers/crypto/qat/qat_common/adf_ctl_drv.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/drivers/crypto/qat/qat_common/adf_ctl_drv.c > b/drivers/crypto/qat/qat_common/adf_ctl_drv.c > index 71d0c44aa..eb9b3be9d 100644 > --- a/drivers/crypto/qat/qat_common/adf_ctl_drv.c > +++ b/drivers/crypto/qat/qat_common/adf_ctl_drv.c > @@ -416,8 +416,6 @@ static long adf_ctl_ioctl(struct file *fp, unsigned int > cmd, unsigned long arg) > > static int __init adf_register_ctl_device_driver(void) > { > - mutex_init(_ctl_lock); > - > if (adf_chr_drv_create()) > goto err_chr_dev; > > -- > 2.23.0 >
Re: [PATCH -next] crypto: qat - convert to use DEFINE_SEQ_ATTRIBUTE macro
On Wed, Sep 16, 2020 at 03:50:17AM +0100, Liu Shixin wrote: > Use DEFINE_SEQ_ATTRIBUTE macro to simplify the code. > > Signed-off-by: Liu Shixin Acked-by: Giovanni Cabiddu > --- > drivers/crypto/qat/qat_common/adf_cfg.c | 19 + > .../qat/qat_common/adf_transport_debug.c | 42 ++- > 2 files changed, 5 insertions(+), 56 deletions(-) > > diff --git a/drivers/crypto/qat/qat_common/adf_cfg.c > b/drivers/crypto/qat/qat_common/adf_cfg.c > index ac462796cefc..22ae32838113 100644 > --- a/drivers/crypto/qat/qat_common/adf_cfg.c > +++ b/drivers/crypto/qat/qat_common/adf_cfg.c > @@ -52,24 +52,7 @@ static const struct seq_operations qat_dev_cfg_sops = { > .show = qat_dev_cfg_show > }; > > -static int qat_dev_cfg_open(struct inode *inode, struct file *file) > -{ > - int ret = seq_open(file, _dev_cfg_sops); > - > - if (!ret) { > - struct seq_file *seq_f = file->private_data; > - > - seq_f->private = inode->i_private; > - } > - return ret; > -} > - > -static const struct file_operations qat_dev_cfg_fops = { > - .open = qat_dev_cfg_open, > - .read = seq_read, > - .llseek = seq_lseek, > - .release = seq_release > -}; > +DEFINE_SEQ_ATTRIBUTE(qat_dev_cfg); > > /** > * adf_cfg_dev_add() - Create an acceleration device configuration table. > diff --git a/drivers/crypto/qat/qat_common/adf_transport_debug.c > b/drivers/crypto/qat/qat_common/adf_transport_debug.c > index 2a2eccbf56ec..dac25ba47260 100644 > --- a/drivers/crypto/qat/qat_common/adf_transport_debug.c > +++ b/drivers/crypto/qat/qat_common/adf_transport_debug.c > @@ -77,31 +77,14 @@ static void adf_ring_stop(struct seq_file *sfile, void *v) > mutex_unlock(_read_lock); > } > > -static const struct seq_operations adf_ring_sops = { > +static const struct seq_operations adf_ring_debug_sops = { > .start = adf_ring_start, > .next = adf_ring_next, > .stop = adf_ring_stop, > .show = adf_ring_show > }; > > -static int adf_ring_open(struct inode *inode, struct file *file) > -{ > - int ret = seq_open(file, _ring_sops); > - > - if (!ret) { > - struct seq_file *seq_f = file->private_data; > - > - seq_f->private = inode->i_private; > - } > - return ret; > -} > - > -static const struct file_operations adf_ring_debug_fops = { > - .open = adf_ring_open, > - .read = seq_read, > - .llseek = seq_lseek, > - .release = seq_release > -}; > +DEFINE_SEQ_ATTRIBUTE(adf_ring_debug); > > int adf_ring_debugfs_add(struct adf_etr_ring_data *ring, const char *name) > { > @@ -188,31 +171,14 @@ static void adf_bank_stop(struct seq_file *sfile, void > *v) > mutex_unlock(_read_lock); > } > > -static const struct seq_operations adf_bank_sops = { > +static const struct seq_operations adf_bank_debug_sops = { > .start = adf_bank_start, > .next = adf_bank_next, > .stop = adf_bank_stop, > .show = adf_bank_show > }; > > -static int adf_bank_open(struct inode *inode, struct file *file) > -{ > - int ret = seq_open(file, _bank_sops); > - > - if (!ret) { > - struct seq_file *seq_f = file->private_data; > - > - seq_f->private = inode->i_private; > - } > - return ret; > -} > - > -static const struct file_operations adf_bank_debug_fops = { > - .open = adf_bank_open, > - .read = seq_read, > - .llseek = seq_lseek, > - .release = seq_release > -}; > +DEFINE_SEQ_ATTRIBUTE(adf_bank_debug); > > int adf_bank_debugfs_add(struct adf_etr_bank_data *bank) > { > -- > 2.25.1 >
Re: [PATCH V2] rcu/tree: Correctly handle single cpu check in rcu_blocking_is_gp
Hi Paul, On 9/24/2020 2:33 AM, Paul E. McKenney wrote: On Wed, Sep 23, 2020 at 12:59:33PM +0530, Neeraj Upadhyay wrote: Currently, for non-preempt kernels (with CONFIG_PREEMPTION=n), rcu_blocking_is_gp() checks (with preemption disabled), whether there is only one cpu online. It uses num_online_cpus() to decide whether only one cpu is online. If there is only single cpu online, synchronize_rcu() is optimized to return without doing all the work to wait for grace period. However, there are few issues with the num_online_cpus() check used, for transition of __num_online_cpus from 2 -> 1 for cpu down path and 1 -> 2 for cpu up path. Again, good catch! As usual, I could not resist editing the commit log and comments, so could you please look the following over to make sure that I did not mess anything up? The commit log and comments look very good! Thanks! Thanks Neeraj Thanx, Paul commit 7af8c1c8d13c6c9c7013fbfef77f843dbc430c4b Author: Neeraj Upadhyay Date: Wed Sep 23 12:59:33 2020 +0530 rcu: Fix single-CPU check in rcu_blocking_is_gp() Currently, for CONFIG_PREEMPTION=n kernels, rcu_blocking_is_gp() uses num_online_cpus() to determine whether there is only one CPU online. When there is only single CPU online, the simple fact that synchronize_rcu() could be legally called implies that a full grace period has elapsed. Therefore, in the single-CPU case, synchronize_rcu() simply returns immediately. Unfortunately, num_online_cpus() is unreliable while a CPU-hotplug operation is transitioning to or from single-CPU operation because: 1. num_online_cpus() uses atomic_read(&__num_online_cpus) to locklessly sample the number of online CPUs. The hotplug locks are not held, which means that an incoming CPU can concurrently update this count. This in turn means that an RCU read-side critical section on the incoming CPU might observe updates prior to the grace period, but also that this critical section might extend beyond the end of the optimized synchronize_rcu(). This breaks RCU's fundamental guarantee. 2. In addition, num_online_cpus() does no ordering, thus providing another way that RCU's fundamental guarantee can be broken by the current code. 3. The most probable failure mode happens on outgoing CPUs. The outgoing CPU updates the count of online CPUs in the CPUHP_TEARDOWN_CPU stop-machine handler, which is fine in and of itself due to preemption being disabled at the call to num_online_cpus(). Unfortunately, after that stop-machine handler returns, the CPU takes one last trip through the scheduler (which has RCU readers) and, after the resulting context switch, one final dive into the idle loop. During this time, RCU needs to keep track of two CPUs, but num_online_cpus() will say that there is only one, which in turn means that the surviving CPU will incorrectly ignore the outgoing CPU's RCU read-side critical sections. This problem is illustrated by the following litmus test in which P0() corresponds to synchronize_rcu() and P1() corresponds to the incoming CPU. The herd7 tool confirms that the "exists" clause can be satisfied, thus demonstrating that this breakage can happen according to the Linux kernel memory model. { int x = 0; atomic_t numonline = ATOMIC_INIT(1); } P0(int *x, atomic_t *numonline) { int r0; WRITE_ONCE(*x, 1); r0 = atomic_read(numonline); if (r0 == 1) { smp_mb(); } else { synchronize_rcu(); } WRITE_ONCE(*x, 2); } P1(int *x, atomic_t *numonline) { int r0; int r1; atomic_inc(numonline); smp_mb(); rcu_read_lock(); r0 = READ_ONCE(*x); smp_rmb(); r1 = READ_ONCE(*x); rcu_read_unlock(); } locations [x;numonline;] exists (1:r0=0 /\ 1:r1=2) It is important to note that these problems arise only when the system is transitioning to or from single-CPU operation. One solution would be to hold the CPU-hotplug locks while sampling num_online_cpus(), which was in fact the intent of the (redundant) preempt_disable() and preempt_enable() surrounding this call to num_online_cpus(). Actually blocking CPU hotplug would not only result in excessive overhead, but would also unnecessarily impede CPU-hotplug
[PATCH v2 1/2] dt-bindings: arm: fsl: add Protonic WD3 board
Add Protonic Holland WD3 iMX6qp based board Signed-off-by: Oleksij Rempel --- Documentation/devicetree/bindings/arm/fsl.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/Documentation/devicetree/bindings/arm/fsl.yaml b/Documentation/devicetree/bindings/arm/fsl.yaml index e94a455eeab9..57dcd061d4c9 100644 --- a/Documentation/devicetree/bindings/arm/fsl.yaml +++ b/Documentation/devicetree/bindings/arm/fsl.yaml @@ -161,6 +161,7 @@ properties: - enum: - fsl,imx6qp-sabreauto # i.MX6 Quad Plus SABRE Automotive Board - fsl,imx6qp-sabresd# i.MX6 Quad Plus SABRE Smart Device Board + - prt,prtwd3# Protonic WD3 board - const: fsl,imx6qp - description: i.MX6DL based Boards -- 2.28.0
[PATCH v2 0/2] mainline Protonic WD3 board
changes v2: - fix comment: WD2 -> WD3 Oleksij Rempel (2): dt-bindings: arm: fsl: add Protonic WD3 board ARM: dts: add Protonic WD3 board .../devicetree/bindings/arm/fsl.yaml | 1 + arch/arm/boot/dts/Makefile| 1 + arch/arm/boot/dts/imx6qp-prtwd3.dts | 553 ++ 3 files changed, 555 insertions(+) create mode 100644 arch/arm/boot/dts/imx6qp-prtwd3.dts -- 2.28.0
[PATCH v2 2/2] ARM: dts: add Protonic WD3 board
Protonic WD3 is a proof of concept platform for tractor e-cockpit applications Signed-off-by: Oleksij Rempel --- arch/arm/boot/dts/Makefile | 1 + arch/arm/boot/dts/imx6qp-prtwd3.dts | 553 2 files changed, 554 insertions(+) create mode 100644 arch/arm/boot/dts/imx6qp-prtwd3.dts diff --git a/arch/arm/boot/dts/Makefile b/arch/arm/boot/dts/Makefile index 0730cbfe13db..2355b43a0f4b 100644 --- a/arch/arm/boot/dts/Makefile +++ b/arch/arm/boot/dts/Makefile @@ -578,6 +578,7 @@ dtb-$(CONFIG_SOC_IMX6Q) += \ imx6qp-nitrogen6_max.dtb \ imx6qp-nitrogen6_som2.dtb \ imx6qp-phytec-mira-rdk-nand.dtb \ + imx6qp-prtwd3.dtb \ imx6qp-sabreauto.dtb \ imx6qp-sabresd.dtb \ imx6qp-tx6qp-8037.dtb \ diff --git a/arch/arm/boot/dts/imx6qp-prtwd3.dts b/arch/arm/boot/dts/imx6qp-prtwd3.dts new file mode 100644 index ..1957d5686dee --- /dev/null +++ b/arch/arm/boot/dts/imx6qp-prtwd3.dts @@ -0,0 +1,553 @@ +// SPDX-License-Identifier: GPL-2.0-or-later OR MIT +/* + * Copyright (c) 2018 Protonic Holland + */ + +/dts-v1/; +#include +#include "imx6qp.dtsi" + +/ { + model = "Protonic WD3 board"; + compatible = "prt,prtwd3", "fsl,imx6qp"; + + chosen { + stdout-path = + }; + + memory@1000 { + device_type = "memory"; + reg = <0x1000 0x2000>; + }; + + memory@8000 { + device_type = "memory"; + reg = <0x8000 0x2000>; + }; + + clock_ksz8081: clock-ksz8081 { + compatible = "fixed-clock"; + #clock-cells = <0>; + clock-frequency = <5000>; + }; + + clock_ksz9031: clock-ksz9031 { + compatible = "fixed-clock"; + #clock-cells = <0>; + clock-frequency = <2500>; + }; + + clock_mcp25xx: clock-mcp25xx { + compatible = "fixed-clock"; + #clock-cells = <0>; + clock-frequency = <2000>; + }; + + clock_sja1105: clock-sja1105 { + compatible = "fixed-clock"; + #clock-cells = <0>; + clock-frequency = <2500>; + }; + + mdio { + compatible = "virtual,mdio-gpio"; + pinctrl-names = "default"; + pinctrl-0 = <_mdio>; + + #address-cells = <1>; + #size-cells = <0>; + gpios = < 6 GPIO_ACTIVE_HIGH + 7 GPIO_ACTIVE_HIGH>; + + /* Microchip KSZ8081 */ + usbeth_phy: ethernet-phy@3 { + reg = <0x3>; + + interrupts-extended = < 12 IRQ_TYPE_LEVEL_LOW>; + reset-gpios = < 11 GPIO_ACTIVE_LOW>; + reset-assert-us = <500>; + reset-deassert-us = <1000>; + clocks = <_ksz8081>; + clock-names = "rmii-ref"; + micrel,led-mode = <0>; + }; + + tja1102_phy0: ethernet-phy@4 { + compatible = "ethernet-phy-id0180.dc80"; + reg = <0x4>; + + interrupts-extended = < 8 IRQ_TYPE_LEVEL_LOW>; + reset-gpios = < 9 GPIO_ACTIVE_LOW>; + reset-assert-us = <20>; + reset-deassert-us = <2000>; + #address-cells = <1>; + #size-cells = <0>; + + tja1102_phy1: ethernet-phy@5 { + reg = <0x5>; + + interrupts-extended = < 8 IRQ_TYPE_LEVEL_LOW>; + }; + }; + }; + + reg_5v0: regulator-5v0 { + compatible = "regulator-fixed"; + regulator-name = "5v0"; + regulator-min-microvolt = <500>; + regulator-max-microvolt = <500>; + }; + + reg_otg_vbus: regulator-otg-vbus { + compatible = "regulator-fixed"; + regulator-name = "otg-vbus"; + regulator-min-microvolt = <500>; + regulator-max-microvolt = <500>; + gpio = < 22 GPIO_ACTIVE_HIGH>; + enable-active-high; + }; + + usdhc2_wifi_pwrseq: usdhc2-wifi-pwrseq { + compatible = "mmc-pwrseq-simple"; + pinctrl-names = "default"; + pinctrl-0 = <_wifi_npd>; + reset-gpios = < 10 GPIO_ACTIVE_LOW>; + }; +}; + + { + pinctrl-names = "default"; + pinctrl-0 = <_can1>; + xceiver-supply = <_5v0>; + status = "okay"; +}; + + { + cs-gpios = < 26 GPIO_ACTIVE_HIGH>; + pinctrl-names = "default"; + pinctrl-0 = <_ecspi2>; + status = "okay"; + + switch@0 { + compatible = "nxp,sja1105q"; +
[PATCH 2/2] Register IPI_CPU_CRASH_STOP IPI as pseudo-NMI
Register IPI_CPU_CRASH_STOP IPI as pseudo-NMI. For systems that do not support pseudo-NMI, register as a normal IRQ. Signed-off-by: Yuichi Ito --- arch/arm64/kernel/smp.c | 39 +++ 1 file changed, 31 insertions(+), 8 deletions(-) diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index b6bde2675ccc..d929dd7221ff 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -79,6 +79,8 @@ enum ipi_msg_type { static int ipi_irq_base __read_mostly; static int nr_ipi __read_mostly = NR_IPI; static struct irq_desc *ipi_desc[NR_IPI] __read_mostly; +static int ipi_crash_stop = -1; +static int ipi_crash_stop_enable_nmi; static void ipi_setup(int cpu); static void ipi_teardown(int cpu); @@ -954,8 +956,16 @@ static void ipi_setup(int cpu) if (WARN_ON_ONCE(!ipi_irq_base)) return; - for (i = 0; i < nr_ipi; i++) - enable_percpu_irq(ipi_irq_base + i, 0); + for (i = 0; i < nr_ipi; i++) { + if (ipi_irq_base + i == ipi_crash_stop) { + if (!prepare_percpu_nmi(ipi_irq_base + i)) { + enable_percpu_nmi(ipi_irq_base + i, 0); + ipi_crash_stop_enable_nmi = 1; + } else + pr_crit("CPU%u: IPI_CPU_CRASH_STOP cannot be enabled NMI.\n", cpu); + } else + enable_percpu_irq(ipi_irq_base + i, 0); + } } static void ipi_teardown(int cpu) @@ -965,23 +975,37 @@ static void ipi_teardown(int cpu) if (WARN_ON_ONCE(!ipi_irq_base)) return; - for (i = 0; i < nr_ipi; i++) - disable_percpu_irq(ipi_irq_base + i); + for (i = 0; i < nr_ipi; i++) { + if (ipi_irq_base + i == ipi_crash_stop) { + if (ipi_crash_stop_enable_nmi) { + disable_percpu_nmi(ipi_irq_base + i); + teardown_percpu_nmi(ipi_irq_base + i); + } + } else + disable_percpu_irq(ipi_irq_base + i); + } } void __init set_smp_ipi_range(int ipi_base, int n) { - int i; + int i, ret; WARN_ON(n < NR_IPI); nr_ipi = min(n, NR_IPI); + ret = request_percpu_nmi(ipi_base + IPI_CPU_CRASH_STOP, +ipi_handler, "IPI", _number); + if (!ret) + ipi_crash_stop = ipi_base + IPI_CPU_CRASH_STOP; + for (i = 0; i < nr_ipi; i++) { int err; - err = request_percpu_irq(ipi_base + i, ipi_handler, -"IPI", _number); - WARN_ON(err); + if (ipi_base + i != ipi_crash_stop) { + err = request_percpu_irq(ipi_base + i, ipi_handler, +"IPI", _number); + WARN_ON(err); + } ipi_desc[i] = irq_to_desc(ipi_base + i); irq_set_status_flags(ipi_base + i, IRQ_HIDDEN); -- 2.25.1
[PATCH 0/2] Enable support IPI_CPU_CRASH_STOP to be pseudo-NMI
Enable support IPI_CPU_CRASH_STOP to be pseudo-NMI This patchset enables IPI_CPU_CRASH_STOP IPI to be pseudo-NMI. This allows kdump to collect system information even when the CPU is in a HARDLOCKUP state. Only IPI_CPU_CRASH_STOP uses NMI and the other IPIs remain normal IRQs. The patch has been tested on ThunderX. This patch assumes Marc's latest IPIs patch-set. [1] It also uses some of Sumit's IPI patch set for NMI.[2] [1] https://lore.kernel.org/linux-arm-kernel/20200901144324.1071694-1-...@kernel.org/ [2] https://lore.kernel.org/linux-arm-kernel/1599830924-13990-3-git-send-email-sumit.g...@linaro.org/ $ echo 1 > /proc/sys/kernel/panic_on_rcu_stal $ echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT : kernel panics and crash kernel boot : makedumpfile saves the system state at HARDLOCKUP in vmcore. crash utility: crash> bt PID: 3213 TASK: fd001adc5940 CPU: 8 COMMAND: "bash" #0 [fe0022fefcf0] lkdtm_HARDLOCKUP at fe0010888ab4 #1 [fe0022fefd10] lkdtm_do_action at fe00108882bc #2 [fe0022fefd20] direct_entry at fe0010888720 #3 [fe0022fefd70] full_proxy_write at fe001058cfe4 #4 [fe0022fefdb0] vfs_write at fe00104a4c2c #5 [fe0022fefdf0] ksys_write at fe00104a4f0c #6 [fe0022fefe40] __arm64_sys_write at fe00104a4fbc #7 [fe0022fefe50] el0_svc_common.constprop.0 at fe0010159e38 #8 [fe0022fefe80] do_el0_svc at fe0010159fa0 #9 [fe0022fefe90] el0_svc at fe00101481d0 #10 [fe0022fefea0] el0_sync_handler at fe00101484b4 #11 [fe0022fefff0] el0_sync at fe0010142b7c Sumit Garg (1): irqchip/gic-v3: Enable support for SGIs to act as NMIs Yuichi Ito (1): Register IPI_CPU_CRASH_STOP IPI as pseudo-NMI arch/arm64/kernel/smp.c | 39 drivers/irqchip/irq-gic-v3.c | 13 ++-- 2 files changed, 42 insertions(+), 10 deletions(-) -- 2.25.1
[PATCH 1/2] irqchip/gic-v3: Enable support for SGIs to act as NMIs
From: Sumit Garg Add support to handle SGIs as regular NMIs. As SGIs or IPIs defaults to a special flow handler: handle_percpu_devid_fasteoi_ipi(), so skip NMI handler update in case of SGIs. Also, enable NMI support prior to gic_smp_init() as allocation of SGIs as IRQs/NMIs happen as part of this routine. Signed-off-by: Sumit Garg --- drivers/irqchip/irq-gic-v3.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 262bc7377abd..2eda18d1df59 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -477,6 +477,11 @@ static int gic_irq_nmi_setup(struct irq_data *d) if (WARN_ON(gic_irq(d) >= 8192)) return -EINVAL; + if (get_intid_range(d) == SGI_RANGE) { + gic_irq_set_prio(d, GICD_INT_NMI_PRI); + return 0; + } + /* desc lock should already be held */ if (gic_irq_in_rdist(d)) { u32 idx = gic_get_ppi_index(d); @@ -514,6 +519,11 @@ static void gic_irq_nmi_teardown(struct irq_data *d) if (WARN_ON(gic_irq(d) >= 8192)) return; + if (get_intid_range(d) == SGI_RANGE) { + gic_irq_set_prio(d, GICD_INT_DEF_PRI); + return; + } + /* desc lock should already be held */ if (gic_irq_in_rdist(d)) { u32 idx = gic_get_ppi_index(d); @@ -1708,6 +1718,7 @@ static int __init gic_init_bases(void __iomem *dist_base, gic_dist_init(); gic_cpu_init(); + gic_enable_nmi_support(); gic_smp_init(); gic_cpu_pm_init(); @@ -1719,8 +1730,6 @@ static int __init gic_init_bases(void __iomem *dist_base, gicv2m_init(handle, gic_data.domain); } - gic_enable_nmi_support(); - return 0; out_free: -- 2.25.1
Re: [PATCH v2 0/6] clk: axi-clk-gen: misc updates to the driver
Hi Stephen, On Wed, Sep 23, 2020 at 04:58:33PM -0700, Stephen Boyd wrote: > Quoting Alexandru Ardelean (2020-09-22 23:22:33) > > On Tue, Sep 22, 2020 at 10:42 PM Stephen Boyd wrote: > > > > > > Quoting Moritz Fischer (2020-09-14 19:41:38) > > > > On Mon, Sep 14, 2020 at 11:11:05AM +0300, Alexandru Ardelean wrote: > > > > > On Mon, Aug 10, 2020 at 4:41 PM Alexandru Ardelean > > > > > wrote: > > > > > > > > > > > > These patches synchronize the driver with the current state in the > > > > > > Analog Devices Linux tree: > > > > > > https://github.com/analogdevicesinc/linux/ > > > > > > > > > > > > They have been in the tree for about 2-3, so they did receive some > > > > > > testing. > > > > > > > > > > Ping on this series. > > > > > Do I need to do a re-send? > > > > > > I got this patch series twice. Not sure why. > > > > My fault here. > > Some Ctrl + R usage and not being attentive with the arguments. > > I think I added "*.patch" twice on the send-mail command. > > I did something similar [by accident] for some DMA patches. > > Apologies. > > > > I can do a re-send for this, if it helps. > > Sure. Please resend it. > > > > > > > > > > > > > > I've applied the FPGA one, the other ones should go through the clock > > > > tree I think? > > > > > > Doesn't patch 6 rely on the FPGA patch? How can that driver build > > > without the header file? > > > > Yes it does depend on the FPGA patch. > > We can drop patch 6 for now, pending a merge to Linus' tree and then > > wait for the trickle-down. > > I don't mind waiting for these patches. > > I have plenty of backlog that I want to run through, and cleanup and > > then upstream. > > So, there is no hurry. > > Can you send me a signed tag with that patch? I can base this patch > series on top of that. Or I can just apply it to clk tree and if nobody > changes it in the meantime merge should work out in linux-next and > linus' tree upstream. Long story short I messed up my pull-request to Greg and had to back out the patch anyways. In retrospect I think the patch should have gone through your tree anyways, so here's our chance to get it right. Feel free to take it with the rest of the changes through your tree. Note: When I applied the patch I fixed up the whitespace that checkpatch complained about so you might want to do that (or ask Alexandru to resend the patch). Acked-by: Moritz Fischer Sorry for the confusion and let me know if you still prefer a signed tag. - Moritz
Re: [PATCH] iommu: of: skip iommu_device_list traversal in of_iommu_xlate()
On 9/23/2020 9:54 PM, Robin Murphy wrote: > On 2020-09-23 15:53, Charan Teja Reddy wrote: >> In of_iommu_xlate(), check if iommu device is enabled before traversing >> the iommu_device_list through iommu_ops_from_fwnode(). It is of no use >> in traversing the iommu_device_list only to return NO_IOMMU because of >> iommu device node is disabled. > > Well, the "use" is that it keeps the code that much smaller and simpler > to have a single path for returning this condition. This whole callstack > isn't exactly a high-performance code path to begin with, and we've > always assumed that IOMMUs present but disabled in DT would be a pretty > rare exception. Fine..I thought that it is logical to return when IOMMU DT node is disabled over code simplicity. And agree that it is not high-performance path. > Do you have a system that challenges those assumptions > and shows any benefit from this change? No, I don't have a system that challenges these assumptions. > > Robin. > >> Signed-off-by: Charan Teja Reddy >> --- >> drivers/iommu/of_iommu.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c >> index e505b91..225598c 100644 >> --- a/drivers/iommu/of_iommu.c >> +++ b/drivers/iommu/of_iommu.c >> @@ -94,9 +94,10 @@ static int of_iommu_xlate(struct device *dev, >> struct fwnode_handle *fwnode = _spec->np->fwnode; >> int ret; >> + if (!of_device_is_available(iommu_spec->np)) >> + return NO_IOMMU; >> ops = iommu_ops_from_fwnode(fwnode); >> - if ((ops && !ops->of_xlate) || >> - !of_device_is_available(iommu_spec->np)) >> + if (ops && !ops->of_xlate) >> return NO_IOMMU; >> ret = iommu_fwspec_init(dev, _spec->np->fwnode, ops); >> -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
drivers/gpu/drm/ingenic/ingenic-drm-drv.c:(.text.ingenic_drm_bind+0x790): undefined reference to `clk_get_parent'
Hi Paul, FYI, the error/warning still remains. tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: c9c9e6a49f8998e9334507378c08cc16cb3ec0e5 commit: fc1acf317b01083d47228c0d21cfc0764f37a04e drm/ingenic: Add support for the IPU date: 10 weeks ago config: mips-randconfig-c004-20200922 (attached as .config) compiler: mips-linux-gcc (GCC) 9.3.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross git checkout fc1acf317b01083d47228c0d21cfc0764f37a04e # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=mips If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot All errors (new ones prefixed by >>): mips-linux-ld: drivers/gpu/drm/ingenic/ingenic-drm-drv.o: in function `ingenic_drm_bind': >> drivers/gpu/drm/ingenic/ingenic-drm-drv.c:(.text.ingenic_drm_bind+0x790): >> undefined reference to `clk_get_parent' mips-linux-ld: drivers/mmc/host/mtk-sd.o: in function `msdc_ops_set_ios': arch/mips/include/asm/barrier.h:(.text.msdc_ops_set_ios+0x79c): undefined reference to `clk_get_parent' mips-linux-ld: arch/mips/include/asm/barrier.h:(.text.msdc_ops_set_ios+0x7cc): undefined reference to `clk_get_parent' mips-linux-ld: drivers/iio/adc/ingenic-adc.o: in function `jz4770_adc_init_clk_div': arch/mips/include/asm/barrier.h:(.text.jz4770_adc_init_clk_div+0x2c): undefined reference to `clk_get_parent' mips-linux-ld: drivers/iio/adc/ingenic-adc.o: in function `jz4725b_adc_init_clk_div': arch/mips/include/asm/barrier.h:(.text.jz4725b_adc_init_clk_div+0x2c): undefined reference to `clk_get_parent' --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org .config.gz Description: application/gzip
Re: possible deadlock in xfrm_policy_delete
On Thu, Sep 24, 2020 at 06:44:12AM +0200, Dmitry Vyukov wrote: > > FWIW one of the dups of this issue was bisected to: > > commit 1909760f5fc3f123e47b4e24e0ccdc0fc8f3f106 > Author: Ahmed S. Darwish > Date: Fri Sep 4 15:32:31 2020 + > > seqlock: PREEMPT_RT: Do not starve seqlock_t writers > > Can it be related? Whether there is a genuine problem or not, the lockdep report that I quoted is simply non-sensical because it's considering two distinct seqlocks to be the same lock. I don't think the lockdep issue has anything to do with the commit question because this commit is specific to seqlocks. There is another syzbot report in this pile that mixed the SCTP socket lock with the TCP socket lock and those are not seqlocks. It's almost as if when a spinlock is freed and reallocated lockdep is not clearing the existing state. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: possible deadlock in xfrm_policy_delete
On Thu, Sep 24, 2020 at 6:36 AM Herbert Xu wrote: > > On Sun, Sep 20, 2020 at 01:22:14PM -0700, syzbot wrote: > > Hello, > > > > syzbot found the following issue on: > > > > HEAD commit:5fa35f24 Add linux-next specific files for 20200916 > > git tree: linux-next > > console output: https://syzkaller.appspot.com/x/log.txt?x=1122e2d990 > > kernel config: https://syzkaller.appspot.com/x/.config?x=6bdb7e39caf48f53 > > dashboard link: https://syzkaller.appspot.com/bug?extid=c32502fd255cb3a44048 > > compiler: gcc (GCC) 10.1.0-syz 20200507 > > > > Unfortunately, I don't have any reproducer for this issue yet. > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > > Reported-by: syzbot+c32502fd255cb3a44...@syzkaller.appspotmail.com > > > > = > > WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected > > 5.9.0-rc5-next-20200916-syzkaller #0 Not tainted > > - > > syz-executor.1/13775 [HC0[0]:SC0[4]:HE1:SE0] is trying to acquire: > > 88805ee15a58 (>xfrm.xfrm_policy_lock){+...}-{2:2}, at: > > spin_lock_bh include/linux/spinlock.h:359 [inline] > > 88805ee15a58 (>xfrm.xfrm_policy_lock){+...}-{2:2}, at: > > xfrm_policy_delete+0x3a/0x90 net/xfrm/xfrm_policy.c:2236 > > > > and this task is already holding: > > 8880a811b1e0 (k-slock-AF_INET6){+.-.}-{2:2}, at: spin_lock > > include/linux/spinlock.h:354 [inline] > > 8880a811b1e0 (k-slock-AF_INET6){+.-.}-{2:2}, at: tcp_close+0x6e3/0x1200 > > net/ipv4/tcp.c:2503 > > which would create a new lock dependency: > > (k-slock-AF_INET6){+.-.}-{2:2} -> (>xfrm.xfrm_policy_lock){+...}-{2:2} > > > > but this new dependency connects a SOFTIRQ-irq-safe lock: > > (k-slock-AF_INET6){+.-.}-{2:2} > > > > ... which became SOFTIRQ-irq-safe at: > > lock_acquire+0x1f2/0xaa0 kernel/locking/lockdep.c:5398 > > __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] > > _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151 > > spin_lock include/linux/spinlock.h:354 [inline] > > sctp_rcv+0xd96/0x2d50 net/sctp/input.c:231 > > What's going on with all these bogus lockdep reports? > > These are two completely different locks, one is for TCP and the > other is for SCTP. Why is lockdep suddenly beoming confused about > this? > > FWIW this flood of bogus reports started on 16/Sep. FWIW one of the dups of this issue was bisected to: commit 1909760f5fc3f123e47b4e24e0ccdc0fc8f3f106 Author: Ahmed S. Darwish Date: Fri Sep 4 15:32:31 2020 + seqlock: PREEMPT_RT: Do not starve seqlock_t writers Can it be related? A number of other new lockdep reports were bisected to the following one, which was true intentional root cause of these, but it looks a bit too old to cause the xfrm reports: commit f08e3888574d490b31481eef6d84c61bedba7a47 Author: Boqun Feng Date: Fri Aug 7 07:42:30 2020 + lockdep: Fix recursive read lock related safe->unsafe detection
Re: inconsistent lock state in xfrm_policy_lookup_inexact_addr
On Wed, Sep 16, 2020 at 01:51:14AM -0700, syzbot wrote: > Hello, > > syzbot found the following issue on: > > HEAD commit:6b02addb Add linux-next specific files for 20200915 > git tree: linux-next > console output: https://syzkaller.appspot.com/x/log.txt?x=15888efd90 > kernel config: https://syzkaller.appspot.com/x/.config?x=7086d0e9e44d4a14 > dashboard link: https://syzkaller.appspot.com/bug?extid=f4d0f0f7c860608404c4 > compiler: gcc (GCC) 10.1.0-syz 20200507 > > Unfortunately, I don't have any reproducer for this issue yet. > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > Reported-by: syzbot+f4d0f0f7c86060840...@syzkaller.appspotmail.com > > > WARNING: inconsistent lock state > 5.9.0-rc5-next-20200915-syzkaller #0 Not tainted > > inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-R} usage. > kworker/1:1/23 [HC0[0]:SC1[1]:HE0:SE0] takes: > 8880a6ff5f28 (>seqcount#12){+.+-}-{0:0}, at: > xfrm_policy_lookup_inexact_addr+0x57/0x200 net/xfrm/xfrm_policy.c:1909 > {SOFTIRQ-ON-W} state was registered at: > lock_acquire+0x1f2/0xaa0 kernel/locking/lockdep.c:5398 > write_seqcount_t_begin_nested include/linux/seqlock.h:509 [inline] > write_seqcount_t_begin include/linux/seqlock.h:535 [inline] > write_seqlock include/linux/seqlock.h:883 [inline] > xfrm_set_spdinfo+0x302/0x660 net/xfrm/xfrm_user.c:1185 This is >xfrm.policy_hthresh.lock. ... > stack backtrace: > CPU: 1 PID: 23 Comm: kworker/1:1 Not tainted > 5.9.0-rc5-next-20200915-syzkaller #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Workqueue: wg-crypt-wg0 wg_packet_tx_worker > Call Trace: > > __dump_stack lib/dump_stack.c:77 [inline] > dump_stack+0x198/0x1fb lib/dump_stack.c:118 > print_usage_bug kernel/locking/lockdep.c:3694 [inline] > valid_state kernel/locking/lockdep.c:3705 [inline] > mark_lock_irq kernel/locking/lockdep.c:3908 [inline] > mark_lock.cold+0x13/0x10d kernel/locking/lockdep.c:4375 > mark_usage kernel/locking/lockdep.c:4252 [inline] > __lock_acquire+0x1402/0x55d0 kernel/locking/lockdep.c:4750 > lock_acquire+0x1f2/0xaa0 kernel/locking/lockdep.c:5398 > seqcount_lockdep_reader_access+0x139/0x1a0 include/linux/seqlock.h:103 > xfrm_policy_lookup_inexact_addr+0x57/0x200 net/xfrm/xfrm_policy.c:1909 And this is a completely different seqlock. Again lockdep is creating a bogus report by lumping two unrelated locks (but of the same type, in this case seqlock) together. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: possible deadlock in xfrm_user_rcv_msg
#syz dup: inconsistent lock state in xfrm_policy_lookup_inexact_addr -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
IRREVOCABLE COMPENSATION PAYMENT OF $750,000.00 DOLLARS VIA ATM VISA CARD
IRREVOCABLE COMPENSATION PAYMENT OF $750,000.00 DOLLARS VIA ATM VISA CARD We sincerely apologize for the unnecessary delay of your compensation payment. This delay was cursed due to the current EPIDEMIC outbreak that frustrated all effort made to contact you. We have actually been authorized by the newly appointed Minister of Finance and the Governing Body of the UNITED NATIONS (UN) to investigate the unnecessary delay of your payment. During the course of our investigation, we discovered with dismay that your payment has been unnecessarily delayed by corrupt officials of the Bank in an attempt to swindle your fund which has led to so many losses from your end and unnecessary delay in the receipt of your payment. The United Nations has chosen to pay out all the unpaid contract, compensation payment, inheritance fund and lotto winning claims by Companies and individuals to 450 Beneficiaries from United Kingdom, U.S.A, Europe, Canada, United Arab Emirates, Bahrain, Qatar, Saudi Arabia, South America, Australia and Asia and Africa Continent through ATM Visa Card as this is a global payments technology that enables consumers, businesses, financial institutions and governments to use digital currency instead of Cash and Cheques. You are hereby advised to stop dealing with some corrupt and non-officials in the bank or any offices as this is an illegal act and you have to stop if you so wish to receive your payment immediately. We have arranged your payment to be paid to you through an ATM Visa Card and this will be issued on your Name and shall be posted directly to your address via DHL or any courier services available in your country. Upon your contact with us, the sum of $750,000.00 Dollars will be credited into the ATM Visa Card and this will enable you to withdraw your funds in any ATM Machines in your country with a minimum withdrawal of $5000.00 Dollars per day. Your limit can be increased to any amount upon your request. In this regards, you are to contact and furnish the requested information to the Directorate of International Payment and Transfer with the followings; 1. Your Name: 2. Country: 3. Age and Sex: 4. Occupation: 5. Mobile Telephone: 6. Delivery Address: 7. Id Card Identification: We required your urgent response to this email as directed to avoid further delay. Director of International Payment Mrs. Maria Angeles
Re: possible deadlock in xfrm_policy_delete
On Sun, Sep 20, 2020 at 01:22:14PM -0700, syzbot wrote: > Hello, > > syzbot found the following issue on: > > HEAD commit:5fa35f24 Add linux-next specific files for 20200916 > git tree: linux-next > console output: https://syzkaller.appspot.com/x/log.txt?x=1122e2d990 > kernel config: https://syzkaller.appspot.com/x/.config?x=6bdb7e39caf48f53 > dashboard link: https://syzkaller.appspot.com/bug?extid=c32502fd255cb3a44048 > compiler: gcc (GCC) 10.1.0-syz 20200507 > > Unfortunately, I don't have any reproducer for this issue yet. > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > Reported-by: syzbot+c32502fd255cb3a44...@syzkaller.appspotmail.com > > = > WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected > 5.9.0-rc5-next-20200916-syzkaller #0 Not tainted > - > syz-executor.1/13775 [HC0[0]:SC0[4]:HE1:SE0] is trying to acquire: > 88805ee15a58 (>xfrm.xfrm_policy_lock){+...}-{2:2}, at: spin_lock_bh > include/linux/spinlock.h:359 [inline] > 88805ee15a58 (>xfrm.xfrm_policy_lock){+...}-{2:2}, at: > xfrm_policy_delete+0x3a/0x90 net/xfrm/xfrm_policy.c:2236 > > and this task is already holding: > 8880a811b1e0 (k-slock-AF_INET6){+.-.}-{2:2}, at: spin_lock > include/linux/spinlock.h:354 [inline] > 8880a811b1e0 (k-slock-AF_INET6){+.-.}-{2:2}, at: tcp_close+0x6e3/0x1200 > net/ipv4/tcp.c:2503 > which would create a new lock dependency: > (k-slock-AF_INET6){+.-.}-{2:2} -> (>xfrm.xfrm_policy_lock){+...}-{2:2} > > but this new dependency connects a SOFTIRQ-irq-safe lock: > (k-slock-AF_INET6){+.-.}-{2:2} > > ... which became SOFTIRQ-irq-safe at: > lock_acquire+0x1f2/0xaa0 kernel/locking/lockdep.c:5398 > __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] > _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151 > spin_lock include/linux/spinlock.h:354 [inline] > sctp_rcv+0xd96/0x2d50 net/sctp/input.c:231 What's going on with all these bogus lockdep reports? These are two completely different locks, one is for TCP and the other is for SCTP. Why is lockdep suddenly beoming confused about this? FWIW this flood of bogus reports started on 16/Sep. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH] mmap_lock: add tracepoints around lock acquisition
On Thu, Sep 24, 2020 at 12:09 AM Steven Rostedt wrote: > > On Wed, 23 Sep 2020 18:04:17 +0800 > Yafang Shao wrote: > > > > What you can do, and what we have done is the following: > > > > > > (see include/linux/page_ref.h) > > > > > > > > > #ifdef CONFIG_TRACING > > > extern struct tracepoint __tracepoint_mmap_lock_start_locking; > > > extern struct tracepoint __tracepoint_mmap_lock_acquire_returned; > > > > > > #define mmap_lock_tracepoint_active(t) > > > static_key_false(&(__tracepoint_mmap_lock_##t).key) > > > > > > #else > > > #define mmap_lock_tracepoint_active(t) false > > > #endif > > > > > > static inline void mmap_write_lock(struct mm_struct *mm) > > > { > > > if (mmap_lock_tracepoint_active(start_locking)) > > > mmap_lock_start_trace_wrapper(); > > > down_write(>mmap_lock); > > > if (mmap_lock_tracepoint_active(acquire_returned)) > > > mmap_lock_acquire_trace_wrapper(); > > > } > > > > > > > > > -- Steve > > > > > > Great! > > > > Thanks Steve. > > If the above becomes useful, I may just add helper functions into a > header file that you can include. Perhaps call it: tracepoint_active() > and you need to pass in as an argument the name of the tracepoint. > Yes, please. The new helper would be useful. -- Thanks Yafang
Re: [GIT PULL rcu-tasks-trace] 50x speedup for synchronize_rcu_tasks_trace()
On Wed, Sep 23, 2020 at 07:46:15PM -0700, Alexei Starovoitov wrote: > On Tue, Sep 22, 2020 at 9:25 AM Paul E. McKenney wrote: > > > > Hello, Alexei, > > > > This pull request contains eight commits that speed up RCU Tasks Trace > > grace periods by a factor of 50, fix a few race conditions exposed > > by this speedup, and clean up a couple of minor issues. These have > > been exposed to 0day and -next testing, and have passed well over 1,000 > > hours of rcutorture testing, some of which has contained ad-hoc changes > > to further increase race probabilities. So they should be solid! > > (Famous last words...) > > > > I would normally have sent this series up through -tip, but as we > > discussed, going up through the BFP and networking trees provides the > > needed exposure to real-world testing of these changes. Please note > > that the first patch is already in mainline, but given identical SHA-1 > > commit IDs, git should have no problem figuring this out. I will also > > be retaining these commits in -rcu in order to continue exposing them > > to rcutorture testing, but again the identical SHA-1 commit IDs will > > make everything work out. > > Pulled into bpf-next. Thanks a lot. > > Also confirming 50x speedup. > Really nice to see that selftests/bpf are now fast again. > > Not only all bpf developers will be running these patches now, > but the bpf CI system will be exercising them as well. Sounds very good, thank you! If problems arise, you know where to find me. As do most of the people running these patches, I would guess. ;-) Thanx, Paul
Re: [PATCH net-next v3 1/2] net: dsa: untag the bridge pvid from rx skbs
On 9/23/2020 4:08 PM, Vladimir Oltean wrote: On Wed, Sep 23, 2020 at 03:59:46PM -0700, Florian Fainelli wrote: On 9/23/20 3:58 PM, Vladimir Oltean wrote: On Wed, Sep 23, 2020 at 03:54:59PM -0700, Florian Fainelli wrote: Not having much luck with using __vlan_find_dev_deep_rcu() for a reason I don't understand we trip over the proto value being neither of the two support Ethertype and hit the BUG(). + upper_dev = __vlan_find_dev_deep_rcu(br, htons(proto), vid); + if (upper_dev) + return skb; Any ideas? Damn... Yes, of course, the skb->protocol is still ETH_P_XDSA which is where eth_type_trans() on the master left it. proto was obtained from br_vlan_get_proto() a few lines above, and br_vlan_get_proto() just returns br->vlan_proto which defaults to htons(ETH_P_8021Q) from br_vlan_init(). This is not skb->protocol that we are looking at AFAICT. Ok, my mistake. So what is the value of proto in vlan_proto_idx when it fails? To me, the call path looks pretty pass-through for vlan_proto. At the time we crash the proto value is indeed ETH_P_XDSA, but it is not because of the __vlan_find_dev_deep_rcu() call as I was mislead by the traces I was looking it (on ARMv7 the LR was pointing not where I was expecting it to), it is because of the following call trace: netif_receive_skb_list_internal -> __netif_receive_skb_list_core -> __netif_receive_skb_core -> vlan_do_receive() That function does use skb->vlan_proto to determine the VLAN group, at that point we have not set it but we did inherit skb->protocol instead which is ETH_P_XDSA. The following does work though, tested with both br0 and a br0.1 upper: + upper_dev = __vlan_find_dev_deep_rcu(br, htons(proto), vid); + if (upper_dev) { + skb->vlan_proto = vlan_dev_vlan_proto(upper_dev); + return skb; } I should have re-tested v2 and v3 with a bridge upper but I did not otherwise I would have caught that. If that sounds acceptable to you as well, I will submit that tomorrow. Let me know what you think about the 802.1Q upper of a physical switch port in the other email. -- Florian
linux-next: manual merge of the vfio tree with the s390 tree
Hi all, Today's linux-next merge of the vfio tree got a conflict in: arch/s390/pci/pci_bus.c between commit: abb95b7550f8 ("s390/pci: consolidate SR-IOV specific code") from the s390 tree and commit: 08b6e22b850c ("s390/pci: Mark all VFs as not implementing PCI_COMMAND_MEMORY") from the vfio tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc arch/s390/pci/pci_bus.c index 0c0db7c3a404,c93486a9989b.. --- a/arch/s390/pci/pci_bus.c +++ b/arch/s390/pci/pci_bus.c @@@ -135,9 -197,10 +135,10 @@@ void pcibios_bus_add_device(struct pci_ * With pdev->no_vf_scan the common PCI probing code does not * perform PF/VF linking. */ - if (zdev->vfn) + if (zdev->vfn) { - zpci_bus_setup_virtfn(zdev->zbus, pdev, zdev->vfn); + zpci_iov_setup_virtfn(zdev->zbus, pdev, zdev->vfn); - + pdev->no_command_memory = 1; + } } static int zpci_bus_add_device(struct zpci_bus *zbus, struct zpci_dev *zdev) pgptYIFhyV6N1.pgp Description: OpenPGP digital signature
Re: possible deadlock in xfrm_policy_lookup_bytype
#syz dup: inconsistent lock state in xfrm_policy_lookup_inexact_addr -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
RE: [PATCH] PCI: dwc: Added link up check in map_bus of dw_child_pcie_ops
Hi Rob, Thanks a lot for your comments! > -Original Message- > From: Rob Herring > Sent: 2020年9月18日 23:28 > To: Z.q. Hou > Cc: linux-kernel@vger.kernel.org; PCI ; Lorenzo > Pieralisi ; Bjorn Helgaas > ; Gustavo Pimentel > ; Michael Walle ; > Ard Biesheuvel > Subject: Re: [PATCH] PCI: dwc: Added link up check in map_bus of > dw_child_pcie_ops > > On Fri, Sep 18, 2020 at 5:02 AM Z.q. Hou wrote: > > > > Hi Rob, > > > > Thanks a lot for your comments! > > > > > -Original Message- > > > From: Rob Herring > > > Sent: 2020年9月17日 4:29 > > > To: Z.q. Hou > > > Cc: linux-kernel@vger.kernel.org; PCI ; > > > Lorenzo Pieralisi ; Bjorn Helgaas > > > ; Gustavo Pimentel > > > ; Michael Walle > ; > > > Ard Biesheuvel > > > Subject: Re: [PATCH] PCI: dwc: Added link up check in map_bus of > > > dw_child_pcie_ops > > > > > > On Tue, Sep 15, 2020 at 11:49 PM Zhiqiang Hou > > > > wrote: > > > > > > > > From: Hou Zhiqiang > > > > > > > > On NXP Layerscape platforms, it results in SError in the > > > > enumeration of the PCIe controller, which is not connecting with > > > > an Endpoint device. And it doesn't make sense to enumerate the > > > > Endpoints when the PCIe link is down. So this patch added the link > > > > up check to avoid to fire configuration transactions on link down bus. > > > > > > Michael reported the same issue as well. > > > > > > What happens if the link goes down between the check and the access? > > > > This patch cannot cover this case, and will get the SError. > > But I think it makes sense to avoid firing transactions on link down bus. > > That's impossible to do without a race even in h/w. Agree. > > > > It's a racy check. I'd like to find an alternative solution. It's > > > even worse if Layerscape is used in ECAM mode. I looked at the EDK2 > > > setup for layerscape[1] and it looks like root ports are just skipped if > > > link > is down. > > > Maybe a link down just never happens once up, but if so, then we > > > only need to check it once and fail probe. > > > > Many customers connect the FPGA Endpoint, which may establish PCIe > > link after the PCIe enumeration and then rescan the PCIe bus, so I > > think it should not exit the probe of root port even if there is not link up > during enumeration. > > That's a good reason. I want to unify the behavior here as it varies per > platform currently and wasn't sure which way to go. > > > > > I've dug into this a bit more and am curious about the PCIE_ABSERR > > > register setting which is set to: > > > > > > #define PCIE_ABSERR_SETTING 0x9401 /* Forward error of non-posted > > > request */ > > > > > > It seems to me this is not what we want at least for config > > > accesses, but commit 84d897d6993 where this was added seems to say > > > otherwise. Is it not possible to configure the response per access type? > > > > Thanks a lot for your investigation! > > The story is like this: Some customers worry about these silent error > > (DWC PCIe IP won't forward the error of outbound non-post request by > > default), so we were pushed to enable the error forwarding to AXI in > > the commit > > 84d897d6993 as you saw. But it cannot differentiate the config > > transactions from the MEM_rd, except the Vendor ID access, which is > > controlled by a separate bit and it was set to not forward error of access > of Vendor ID. > > So we think it's okay to enable the error forwarding, the SError > > should not occur, because after the enumeration it won't access the > non-existent functions. > > We've rejected upstream support for platforms aborting on config > accesses[1]. I think there's clear consensus that aborting is the wrong > behavior. > > Do MEM_wr errors get forwarded? Seems like that would be enough. Also, > wouldn't page faults catch most OOB accesses anyways? You need things > page aligned anyways with an IOMMU and doing userspace access or guest > assignment. Yes, errors of MEM_wr can be forwarded. > > Here's another idea, how about only enabling forwarding errors if the link is > up? If really would need to be configured any time the link state changes > rather than just at probe. I'm not sure if you have a way to disable it on > link > down though. Dug deeper into this issue and found the setting of not forwarding error of non-existent Vender ID access counts on the link partner: 1. When there is a link partner (namely link up), it will return 0x when read non-existent function Vendor ID and won't forward error to AXI. 2. When no link partner (link down), it will forward the error of reading non-existent function Vendor ID to AXI and result in SError. I think this is a DWC PCIe IP specific issue but not get feedback from design team. I'm thinking to disable this error forwarding just like other platforms, since when these errors (UR, CA and CT) are detected, AER driver can also report the error and try to recover. Thanks, Zhiqiang > > Rob
Re: [RFC 2/2] printk: Add more information about the printk caller
On Wed, Sep 23, 2020 at 03:56:17PM +0200, Petr Mladek wrote: ... > > -static inline u32 printk_caller_id(void) > +static enum printk_caller_ctx get_printk_caller_ctx(void) > +{ > + if (in_nmi()) > + return printk_ctx_nmi; > + > + if (in_irq()) > + return printk_ctx_hardirq; > + > + if (in_softirq()) > + return printk_ctx_softirq; > + > + return printk_ctx_task; > +} > + in_softirq() here will be true for both softirq contexts *and* BH-disabled regions. Did you mean in_serving_softirq() instead? Thanks, -- Ahmed S. Darwish Linutronix GmbH
ARC: allmodconfig: Error: inappropriate arguments for opcode 'mpyd'
arc:allmodconfig build failed with gcc-8, gcc-9 and gcc-10 on Linus's mainline tree. Build log: make -sk KBUILD_BUILD_USER=TuxBuild -C/linux ARCH=arc CROSS_COMPILE=arc-elf32- HOSTCC=gcc CC="sccache arc-elf32-gcc" O=build allmodconfig make -sk KBUILD_BUILD_USER=TuxBuild -C/linux -j16 ARCH=arc CROSS_COMPILE=arc-elf32- HOSTCC=gcc CC="sccache arc-elf32-gcc" O=build uImage In file included from :31: ./usr/include/sound/hdspm.h:48:2: error: unknown type name ‘__u32’ __u32 input_peaks[64]; ^ ./usr/include/sound/hdspm.h:49:2: error: unknown type name ‘__u32’ __u32 playback_peaks[64]; ^ ./usr/include/sound/hdspm.h:50:2: error: unknown type name ‘__u32’ __u32 output_peaks[64]; ^ ./usr/include/sound/hdspm.h:52:2: error: unknown type name ‘__u64’ __u64 input_rms[64]; ^ ./usr/include/sound/hdspm.h:53:2: error: unknown type name ‘__u64’ __u64 playback_rms[64]; ^ ./usr/include/sound/hdspm.h:54:2: error: unknown type name ‘__u64’ __u64 output_rms[64]; ^ ./usr/include/sound/hdspm.h:56:2: error: unknown type name ‘__u8’ __u8 speed; /* enum {ss, ds, qs} */ ^~~~ ./usr/include/sound/hdspm.h:157:2: error: unknown type name ‘__u8’ __u8 card_type; /* enum hdspm_io_type */ ^~~~ ./usr/include/sound/hdspm.h:160:2: error: unknown type name ‘__u64’ __u64 card_clock; ^ ./usr/include/sound/hdspm.h:161:2: error: unknown type name ‘__u32’ __u32 master_period; ^ ./usr/include/sound/hdspm.h:165:4: error: unknown type name ‘__u8’ __u8 sync_wc; /* enum hdspm_sync */ ^~~~ ./usr/include/sound/hdspm.h:166:4: error: unknown type name ‘__u8’ __u8 sync_madi; /* enum hdspm_sync */ ^~~~ ./usr/include/sound/hdspm.h:167:4: error: unknown type name ‘__u8’ __u8 sync_tco; /* enum hdspm_sync */ ^~~~ ./usr/include/sound/hdspm.h:168:4: error: unknown type name ‘__u8’ __u8 sync_in; /* enum hdspm_sync */ ^~~~ ./usr/include/sound/hdspm.h:169:4: error: unknown type name ‘__u8’ __u8 madi_input; /* enum hdspm_madi_input */ ^~~~ ./usr/include/sound/hdspm.h:170:4: error: unknown type name ‘__u8’ __u8 channel_format; /* enum hdspm_madi_channel_format */ ^~~~ ./usr/include/sound/hdspm.h:171:4: error: unknown type name ‘__u8’ __u8 frame_format; /* enum hdspm_madi_frame_format */ ^~~~ ./usr/include/sound/hdspm.h:186:2: error: unknown type name ‘__u8’ __u8 card_type; /* enum hdspm_io_type */ ^~~~ make[3]: *** [../usr/include/Makefile:108: usr/include/sound/hdspm.hdrtest] Error 1 In file included from :31: ./usr/include/sound/hdsp.h:39:2: error: unknown type name ‘__u32’ __u32 input_peaks[26]; ^ ./usr/include/sound/hdsp.h:40:2: error: unknown type name ‘__u32’ __u32 playback_peaks[26]; ^ ./usr/include/sound/hdsp.h:41:2: error: unknown type name ‘__u32’ __u32 output_peaks[28]; ^ ./usr/include/sound/hdsp.h:42:2: error: unknown type name ‘__u64’ __u64 input_rms[26]; ^ ./usr/include/sound/hdsp.h:43:2: error: unknown type name ‘__u64’ __u64 playback_rms[26]; ^ ./usr/include/sound/hdsp.h:45:2: error: unknown type name ‘__u64’ __u64 output_rms[26]; ^ make[3]: *** [../usr/include/Makefile:108: usr/include/sound/hdsp.hdrtest] Error 1 {standard input}: Assembler messages: {standard input}:9360: Error: inappropriate arguments for opcode 'mpyd' make[3]: *** [../scripts/Makefile.build:283: kernel/sched/core.o] Error 1 ../arch/arc/kernel/kgdb.c: In function ‘kgdb_arch_handle_exception’: ../arch/arc/kernel/kgdb.c:141:6: warning: this statement may fall through [-Wimplicit-fallthrough=] if (kgdb_hex2long(, )) ^ ../arch/arc/kernel/kgdb.c:144:2: note: here case 'D': ^~~~ In file included from ./usr/include/linux/netfilter_ipv6/ip6_tables.h:21, from :31: ./usr/include/linux/if.h:28:10: fatal error: sys/socket.h: No such file or directory #include/* for struct sockaddr. */ ^~ compilation terminated. make[3]: *** [../usr/include/Makefile:108: usr/include/linux/netfilter_ipv6/ip6_tables.hdrtest] Error 1 ../arch/arc/kernel/perf_event.c: In function ‘arc_pmu_device_probe’: ../arch/arc/kernel/perf_event.c:645:3: warning: ignoring return value of ‘request_percpu_irq’, declared with attribute warn_unused_result [-Wunused-result] request_percpu_irq(irq, arc_pmu_intr, "ARC perf counters", ^~ this_cpu_ptr(_pmu_cpu)); ~~~ In file included from ./usr/include/linux/if_bonding.h:47, from :31: ./usr/include/linux/if.h:28:10: fatal error: sys/socket.h: No such file or directory #include/* for struct sockaddr. */ ^~ compilation terminated. make[3]: *** [../usr/include/Makefile:108: usr/include/linux/if_bonding.hdrtest] Error 1 In file included from ./usr/include/linux/netfilter_ipv4/ip_tables.h:21, from :31: ./usr/include/linux/if.h:28:10: fatal error: sys/socket.h: No such file or directory
[PATCH] alpha: switch defconfig from the legacy ide driver to libata
Switch from the soon to be removed legacy ide driver to libata. Signed-off-by: Christoph Hellwig --- arch/alpha/configs/defconfig | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/arch/alpha/configs/defconfig b/arch/alpha/configs/defconfig index 6293675db1644a..387174e2422f6b 100644 --- a/arch/alpha/configs/defconfig +++ b/arch/alpha/configs/defconfig @@ -26,13 +26,11 @@ CONFIG_PNP=y CONFIG_ISAPNP=y CONFIG_BLK_DEV_FD=y CONFIG_BLK_DEV_LOOP=m -CONFIG_IDE=y -CONFIG_BLK_DEV_IDECD=y -CONFIG_IDE_GENERIC=y -CONFIG_BLK_DEV_GENERIC=y -CONFIG_BLK_DEV_ALI15X3=y -CONFIG_BLK_DEV_CMD64X=y -CONFIG_BLK_DEV_CY82C693=y +CONFIG_ATA=y +CONFIG_ATA_GENERIC=y +CONFIG_PATA_ALI=y +CONFIG_PATA_CMD64X=y +CONFIG_PATA_CYPRESS=y CONFIG_SCSI=y CONFIG_BLK_DEV_SD=y CONFIG_BLK_DEV_SR=y -- 2.28.0
Re: [PATCH v5 5/6] dt-bindings: dw-apb-ictl: convert to json-schema
On 2020/9/24 11:26, Leizhen (ThunderTown) wrote: > > > On 2020/9/24 4:49, Rob Herring wrote: >> On Fri, Sep 18, 2020 at 07:22:01PM +0800, Zhen Lei wrote: >>> Convert the Synopsys DesignWare APB interrupt controller (dw_apb_ictl) >>> binding to DT schema format using json-schema. >>> >>> Signed-off-by: Zhen Lei >>> --- >>> .../interrupt-controller/snps,dw-apb-ictl.txt | 43 - >>> .../interrupt-controller/snps,dw-apb-ictl.yaml | 75 >>> ++ >>> 2 files changed, 75 insertions(+), 43 deletions(-) >>> delete mode 100644 >>> Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.txt >>> create mode 100644 >>> Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.yaml >>> >>> diff --git >>> a/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.txt >>> >>> b/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.txt >>> deleted file mode 100644 >>> index 2db59df9408f4c6..000 >>> --- >>> a/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.txt >>> +++ /dev/null >>> @@ -1,43 +0,0 @@ >>> -Synopsys DesignWare APB interrupt controller (dw_apb_ictl) >>> - >>> -Synopsys DesignWare provides interrupt controller IP for APB known as >>> -dw_apb_ictl. The IP is used as secondary interrupt controller in some SoCs >>> with >>> -APB bus, e.g. Marvell Armada 1500. It can also be used as primary interrupt >>> -controller in some SoCs, e.g. Hisilicon SD5203. >>> - >>> -Required properties: >>> -- compatible: shall be "snps,dw-apb-ictl" >>> -- reg: physical base address of the controller and length of memory mapped >>> - region starting with ENABLE_LOW register >>> -- interrupt-controller: identifies the node as an interrupt controller >>> -- #interrupt-cells: number of cells to encode an interrupt-specifier, >>> shall be 1 >>> - >>> -Additional required property when it's used as secondary interrupt >>> controller: >>> -- interrupts: interrupt reference to primary interrupt controller >>> - >>> -The interrupt sources map to the corresponding bits in the interrupt >>> -registers, i.e. >>> -- 0 maps to bit 0 of low interrupts, >>> -- 1 maps to bit 1 of low interrupts, >>> -- 32 maps to bit 0 of high interrupts, >>> -- 33 maps to bit 1 of high interrupts, >>> -- (optional) fast interrupts start at 64. >>> - >>> -Example: >>> - /* dw_apb_ictl is used as secondary interrupt controller */ >>> - aic: interrupt-controller@3000 { >>> - compatible = "snps,dw-apb-ictl"; >>> - reg = <0x3000 0xc00>; >>> - interrupt-controller; >>> - #interrupt-cells = <1>; >>> - interrupt-parent = <>; >>> - interrupts = ; >>> - }; >>> - >>> - /* dw_apb_ictl is used as primary interrupt controller */ >>> - vic: interrupt-controller@1013 { >>> - compatible = "snps,dw-apb-ictl"; >>> - reg = <0x1013 0x1000>; >>> - interrupt-controller; >>> - #interrupt-cells = <1>; >>> - }; >>> diff --git >>> a/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.yaml >>> >>> b/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.yaml >>> new file mode 100644 >>> index 000..70c12979c843bf0 >>> --- /dev/null >>> +++ >>> b/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.yaml >>> @@ -0,0 +1,75 @@ >>> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) >>> +%YAML 1.2 >>> +--- >>> +$id: >>> http://devicetree.org/schemas/interrupt-controller/snps,dw-apb-ictl.yaml# >>> +$schema: http://devicetree.org/meta-schemas/core.yaml# >>> + >>> +title: Synopsys DesignWare APB interrupt controller (dw_apb_ictl) >>> + >>> +maintainers: >>> + - Marc Zyngier >> >> Usually this would be an owner for this IP block, not the subsystem >> maintainer. > > OK, I will change it to the author of the file "snps,dw-apb-ictl.txt". > > >> >>> + >>> +description: >>> + Synopsys DesignWare provides interrupt controller IP for APB known as >>> + dw_apb_ictl. The IP is used as secondary interrupt controller in some >>> SoCs >>> + with APB bus, e.g. Marvell Armada 1500. It can also be used as primary >>> + interrupt controller in some SoCs, e.g. Hisilicon SD5203. >>> + >>> +allOf: >>> + - $ref: /schemas/interrupt-controller.yaml# >> >> You can drop this, it's already applied based on node name. > But if we drop this, the "snps,dw-apb-ictl.yaml" can not require that the > node name > must match '^interrupt-controller(@[0-9a-f,]+)*$'. The problem of Patch 6/6 > was > discovered by this. > >> >>> + >>> +properties: >>> + compatible: >>> +const: snps,dw-apb-ictl >>> + >>> + interrupt-controller: true >>> + >>> + reg: >>> +description: >>> + Physical base address of the controller and length of memory mapped >>> + region starting with ENABLE_LOW register. >> >> Need to define how many reg regions (maxItems: 1). > > OK, I
Re: WARNING: SOFTIRQ-READ-safe -> SOFTIRQ-READ-unsafe lock order detected
#syz dup: inconsistent lock state in xfrm_policy_lookup_inexact_addr -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH v2] PCI: Cadence: Add quirk for Gen2 controller to do autonomous speed change.
Hi Nadeem, On 24/09/20 12:04 am, Nadeem Athani wrote: > Cadence controller will not initiate autonomous speed change if > strapped as Gen2. The Retrain bit is set as a quirk to trigger > this speed change. > > Signed-off-by: Nadeem Athani > --- > drivers/pci/controller/cadence/pcie-cadence-host.c | 14 ++ > drivers/pci/controller/cadence/pcie-cadence.h | 15 +++ > 2 files changed, 29 insertions(+) > > diff --git a/drivers/pci/controller/cadence/pcie-cadence-host.c > b/drivers/pci/controller/cadence/pcie-cadence-host.c > index 4550e0d469ca..a2317614268d 100644 > --- a/drivers/pci/controller/cadence/pcie-cadence-host.c > +++ b/drivers/pci/controller/cadence/pcie-cadence-host.c > @@ -83,6 +83,9 @@ static int cdns_pcie_host_init_root_port(struct > cdns_pcie_rc *rc) > struct cdns_pcie *pcie = >pcie; > u32 value, ctrl; > u32 id; > + u32 link_cap = CDNS_PCIE_LINK_CAP_OFFSET; > + u8 sls; > + u16 lnk_ctl; > > /* >* Set the root complex BAR configuration register: > @@ -111,6 +114,17 @@ static int cdns_pcie_host_init_root_port(struct > cdns_pcie_rc *rc) > if (rc->device_id != 0x) > cdns_pcie_rp_writew(pcie, PCI_DEVICE_ID, rc->device_id); > > + /* Quirk to enable autonomous speed change for GEN2 controller */ > + /* Reading Supported Link Speed value */ > + sls = PCI_EXP_LNKCAP_SLS & > + cdns_pcie_rp_readb(pcie, link_cap + PCI_EXP_LNKCAP); > + if (sls == PCI_EXP_LNKCAP_SLS_5_0GB) { > + /* Since this a Gen2 controller, set Retrain Link(RL) bit */ > + lnk_ctl = cdns_pcie_rp_readw(pcie, link_cap + PCI_EXP_LNKCTL); > + lnk_ctl |= PCI_EXP_LNKCTL_RL; > + cdns_pcie_rp_writew(pcie, link_cap + PCI_EXP_LNKCTL, lnk_ctl); > + } Is this workaround required for all Cadence controller? If not, enable this workaround only for versions which doesn't do autonomous speed change. I think this workaround should also be applied only after checking for link status (cdns_pcie_link_up()). And this is also applicable for GEN3/GEN4 controller. So the check should be to see the capability of the connected PCIe device and not the controller itself. Thanks Kishon
Re: inconsistent lock state in xfrm_user_rcv_msg
#syz dup: inconsistent lock state in xfrm_policy_lookup_inexact_addr -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: possible deadlock in xfrm_policy_lookup_inexact_addr
#syz dup: inconsistent lock state in xfrm_policy_lookup_inexact_addr -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
[PATCH v4 6/6] kasan: update documentation for generic kasan
Generic KASAN also supports to record the last two timer and workqueue stacks and print them in KASAN report. So that need to update documentation. Signed-off-by: Walter Wu Suggested-by: Marco Elver Acked-by: Marco Elver Reviewed-by: Dmitry Vyukov Reviewed-by: Andrey Konovalov Cc: Andrey Ryabinin Cc: Alexander Potapenko Cc: Jonathan Corbet --- v3: - Thanks for Marco suggestion --- Documentation/dev-tools/kasan.rst | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/Documentation/dev-tools/kasan.rst b/Documentation/dev-tools/kasan.rst index 38fd5681fade..698ccb65e634 100644 --- a/Documentation/dev-tools/kasan.rst +++ b/Documentation/dev-tools/kasan.rst @@ -190,8 +190,9 @@ function calls GCC directly inserts the code to check the shadow memory. This option significantly enlarges kernel but it gives x1.1-x2 performance boost over outline instrumented kernel. -Generic KASAN prints up to 2 call_rcu() call stacks in reports, the last one -and the second to last. +Generic KASAN also reports the last 2 call stacks to creation of work that +potentially has access to an object. Call stacks for the following are shown: +call_rcu(), timer and workqueue queuing. Software tag-based KASAN -- 2.18.0
[PATCH v4 3/6] kasan: print timer and workqueue stack
The aux_stack[2] is reused to record the call_rcu() call stack, timer init call stack, and enqueuing work call stacks. So that we need to change the auxiliary stack title for common title, print them in KASAN report. Signed-off-by: Walter Wu Suggested-by: Marco Elver Acked-by: Marco Elver Reviewed-by: Dmitry Vyukov Reviewed-by: Andrey Konovalov Cc: Andrey Ryabinin Cc: Alexander Potapenko --- v2: - Thanks for Marco suggestion. - We modify aux stack title name in KASAN report in order to print call_rcu()/timer/workqueue stack. --- mm/kasan/report.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/kasan/report.c b/mm/kasan/report.c index 4f49fa6cd1aa..886809d0a8dd 100644 --- a/mm/kasan/report.c +++ b/mm/kasan/report.c @@ -183,12 +183,12 @@ static void describe_object(struct kmem_cache *cache, void *object, #ifdef CONFIG_KASAN_GENERIC if (alloc_info->aux_stack[0]) { - pr_err("Last call_rcu():\n"); + pr_err("Last potentially related work creation:\n"); print_stack(alloc_info->aux_stack[0]); pr_err("\n"); } if (alloc_info->aux_stack[1]) { - pr_err("Second to last call_rcu():\n"); + pr_err("Second to last potentially related work creation:\n"); print_stack(alloc_info->aux_stack[1]); pr_err("\n"); } -- 2.18.0
[PATCH v4 2/6] workqueue: kasan: record workqueue stack
Records the last two enqueuing work call stacks in order to print them in KASAN report. It is useful for programmers to solve use-after-free or double-free memory workqueue issue. For workqueue it has turned out to be useful to record the enqueuing work call stacks. Because user can see KASAN report to determine whether it is root cause. They don't need to enable debugobjects, but they have a chance to find out the root cause. Signed-off-by: Walter Wu Suggested-by: Marco Elver Acked-by: Marco Elver Acked-by: Tejun Heo Reviewed-by: Dmitry Vyukov Reviewed-by: Andrey Konovalov Cc: Andrey Ryabinin Cc: Alexander Potapenko Cc: Lai Jiangshan --- v2: - Thanks for Marco suggestion. - Remove unnecessary code - reuse kasan_record_aux_stack() and aux_stack to record timer and workqueue stack --- kernel/workqueue.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index c41c3c17b86a..5fea7dc9180f 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1324,6 +1324,9 @@ static void insert_work(struct pool_workqueue *pwq, struct work_struct *work, { struct worker_pool *pool = pwq->pool; + /* record the work call stack in order to print it in KASAN reports */ + kasan_record_aux_stack(work); + /* we own @work, set data and link */ set_work_pwq(work, pwq, extra_flags); list_add_tail(>entry, head); -- 2.18.0
[PATCH v4 1/6] timer: kasan: record timer stack
When analyze use-after-free or double-free issue, recording the timer stacks is helpful to preserve usage history which potentially gives a hint about the affected code. Record the most recent two timer init calls in KASAN which are printed on failure in the KASAN report. For timers it has turned out to be useful to record the stack trace of the timer init call. Because if the UAF root cause is in timer init, then user can see KASAN report to get where it is registered and find out the root cause. It don't need to enable DEBUG_OBJECTS_TIMERS, but they have a chance to find out the root cause. Signed-off-by: Walter Wu Suggested-by: Marco Elver Suggested-by: Thomas Gleixner Acked-by: Marco Elver Reviewed-by: Dmitry Vyukov Reviewed-by: Andrey Konovalov Cc: Andrey Ryabinin Cc: Alexander Potapenko Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd --- v2: - Thanks for Marco and Thomas suggestion. - Remove unnecessary code and fix commit log - reuse kasan_record_aux_stack() and aux_stack to record timer and workqueue stack. --- kernel/time/timer.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/time/timer.c b/kernel/time/timer.c index a16764b0116e..1ed8f8aca7f5 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -796,6 +796,9 @@ static void do_init_timer(struct timer_list *timer, timer->function = func; timer->flags = flags | raw_smp_processor_id(); lockdep_init_map(>lockdep_map, name, key, 0); + + /* record the timer stack in order to print it in KASAN report */ + kasan_record_aux_stack(timer); } /** -- 2.18.0
PATCH v4 5/6] kasan: add tests for workqueue stack recording
Adds a test to verify workqueue stack recording and print it in KASAN report. The KASAN report was as follows(cleaned up slightly): BUG: KASAN: use-after-free in kasan_workqueue_uaf Freed by task 54: kasan_save_stack+0x24/0x50 kasan_set_track+0x24/0x38 kasan_set_free_info+0x20/0x40 __kasan_slab_free+0x10c/0x170 kasan_slab_free+0x10/0x18 kfree+0x98/0x270 kasan_workqueue_work+0xc/0x18 Last potentially related work creation: kasan_save_stack+0x24/0x50 kasan_record_wq_stack+0xa8/0xb8 insert_work+0x48/0x288 __queue_work+0x3e8/0xc40 queue_work_on+0xf4/0x118 kasan_workqueue_uaf+0xfc/0x190 Signed-off-by: Walter Wu Acked-by: Marco Elver Reviewed-by: Dmitry Vyukov Reviewed-by: Andrey Konovalov Cc: Andrey Ryabinin Cc: Alexander Potapenko Cc: Matthias Brugger --- v4: - testcase has merge conflict, so that rebase onto the KASAN-KUNIT --- lib/test_kasan_module.c | 30 ++ 1 file changed, 30 insertions(+) diff --git a/lib/test_kasan_module.c b/lib/test_kasan_module.c index 2e5e7be96955..c3a2d113e757 100644 --- a/lib/test_kasan_module.c +++ b/lib/test_kasan_module.c @@ -115,6 +115,35 @@ static noinline void __init kasan_timer_uaf(void) ((volatile struct timer_list *)timer)->expires; } +static noinline void __init kasan_workqueue_work(struct work_struct *work) +{ + kfree(work); +} + +static noinline void __init kasan_workqueue_uaf(void) +{ + struct workqueue_struct *workqueue; + struct work_struct *work; + + workqueue = create_workqueue("kasan_wq_test"); + if (!workqueue) { + pr_err("Allocation failed\n"); + return; + } + work = kmalloc(sizeof(struct work_struct), GFP_KERNEL); + if (!work) { + pr_err("Allocation failed\n"); + return; + } + + INIT_WORK(work, kasan_workqueue_work); + queue_work(workqueue, work); + destroy_workqueue(workqueue); + + pr_info("use-after-free on workqueue\n"); + ((volatile struct work_struct *)work)->data; +} + static int __init test_kasan_module_init(void) { /* @@ -126,6 +155,7 @@ static int __init test_kasan_module_init(void) copy_user_test(); kasan_rcu_uaf(); kasan_timer_uaf(); + kasan_workqueue_uaf(); kasan_restore_multi_shot(multishot); return -EAGAIN; -- 2.18.0
Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call
On 9/23/20 8:13 PM, Sinan Kaya wrote: On 9/23/2020 10:51 PM, Kuppuswamy, Sathyanarayanan wrote: I see. Can I assume that your system supports DPC? DPC is supposed to recover the link via dpc_reset_link(). Yes. But the affected device/drivers cleanup during error recovery is handled by hotplug handler. So we are facing issue when dealing with non hotplug capable ports. This is confusing. Why would hotplug driver be involved unless port supports hotplug and the link goes down? You said that DLLSC is only supported on hotplug capable ports. hotplug driver is *only* involved when dealing with recovery of hotplug capable ports. For hotplug capable ports, when DPC is triggered and link goes down, DLLSC handler in pciehp driver will remove the affected devices/drivers. Once the link comes back it will re-attach them. Need a better description of symptoms and what triggers hotplug driver to activate. For problem description, please check the following details Current pcie_do_recovery() implementation has following two issues: 1. Fatal (DPC) error recovery is currently broken for non-hotplug capable devices. Current fatal error recovery implementation relies on PCIe hotplug (pciehp) handler for detaching and re-enumerating the affected devices/drivers. pciehp handler listens for DLLSC state changes and handles device/driver detachment on DLLSC_LINK_DOWN event and re-enumeration on DLLSC_LINK_UP event. So when dealing with non-hotplug capable devices, recovery code does not restore the state of the affected devices correctly. Correct implementation should restore the device state and call report_slot_reset() function after resetting the link to restore the state of the device/driver. You can find fatal non-hotplug related issues reported in following links: https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/ https://lore.kernel.org/linux-pci/12115.1588207324@famine/ https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/ 2. For non-fatal errors if report_error_detected() or report_mmio_enabled() functions requests PCI_ERS_RESULT_NEED_RESET then current pcie_do_recovery() implementation does not do the requested explicit device reset, instead just calls the report_slot_reset() on all affected devices. Notifying about the reset via report_slot_reset() without doing the actual device reset is incorrect. To fix above issues, use PCI_ERS_RESULT_NEED_RESET as error state after successful reset_link() operation. This will ensure ->slot_reset() be called after reset_link() operation for fatal errors. Also call pci_bus_reset() to do slot/bus reset() before triggering device specific ->slot_reset() callback. Also, using pci_bus_reset() will restore the state of the devices after performing the reset operation. Even though using pci_bus_reset() will do redundant reset operation after ->reset_link() for fatal errors, it should should affect the functional behavior. Can you expand this a little bit? -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
[PATCH v4 4/6] kasan: add tests for timer stack recording
Adds a test to verify timer stack recording and print it in KASAN report. The KASAN report was as follows(cleaned up slightly): BUG: KASAN: use-after-free in kasan_timer_uaf Freed by task 0: kasan_save_stack+0x24/0x50 kasan_set_track+0x24/0x38 kasan_set_free_info+0x20/0x40 __kasan_slab_free+0x10c/0x170 kasan_slab_free+0x10/0x18 kfree+0x98/0x270 kasan_timer_function+0x1c/0x28 Last potentially related work creation: kasan_save_stack+0x24/0x50 kasan_record_tmr_stack+0xa8/0xb8 init_timer_key+0xf0/0x248 kasan_timer_uaf+0x5c/0xd8 Signed-off-by: Walter Wu Acked-by: Marco Elver Reviewed-by: Dmitry Vyukov Reviewed-by: Andrey Konovalov Cc: Andrey Ryabinin Cc: Alexander Potapenko Cc: Matthias Brugger --- v4: - testcase has merge conflict, so that rebase onto the KASAN-KUNIT --- lib/test_kasan_module.c | 25 + 1 file changed, 25 insertions(+) diff --git a/lib/test_kasan_module.c b/lib/test_kasan_module.c index 2d68db6ae67b..d8234a1db8c9 100644 --- a/lib/test_kasan_module.c +++ b/lib/test_kasan_module.c @@ -12,6 +12,7 @@ #include #include #include +#include #include "../mm/kasan/kasan.h" @@ -91,6 +92,29 @@ static noinline void __init kasan_rcu_uaf(void) call_rcu(_rcu_ptr->rcu, kasan_rcu_reclaim); } +static noinline void __init kasan_timer_function(struct timer_list *timer) +{ + del_timer(timer); + kfree(timer); +} + +static noinline void __init kasan_timer_uaf(void) +{ + struct timer_list *timer; + + timer = kmalloc(sizeof(struct timer_list), GFP_KERNEL); + if (!timer) { + pr_err("Allocation failed\n"); + return; + } + + timer_setup(timer, kasan_timer_function, 0); + add_timer(timer); + msleep(100); + + pr_info("use-after-free on timer\n"); + ((volatile struct timer_list *)timer)->expires; +} static int __init test_kasan_module_init(void) { @@ -102,6 +126,7 @@ static int __init test_kasan_module_init(void) copy_user_test(); kasan_rcu_uaf(); + kasan_timer_uaf(); kasan_restore_multi_shot(multishot); return -EAGAIN; -- 2.18.0
[PATCH v4 0/6] kasan: add workqueue and timer stack for generic KASAN
Syzbot reports many UAF issues for workqueue or timer, see [1] and [2]. In some of these access/allocation happened in process_one_work(), we see the free stack is useless in KASAN report, it doesn't help programmers to solve UAF on workqueue. The same may stand for times. This patchset improves KASAN reports by making them to have workqueue queueing stack and timer stack information. It is useful for programmers to solve use-after-free or double-free memory issue. Generic KASAN also records the last two workqueue and timer stacks and prints them in KASAN report. It is only suitable for generic KASAN. [1]https://groups.google.com/g/syzkaller-bugs/search?q=%22use-after-free%22+process_one_work [2]https://groups.google.com/g/syzkaller-bugs/search?q=%22use-after-free%22%20expire_timers [3]https://bugzilla.kernel.org/show_bug.cgi?id=198437 Walter Wu (6): timer: kasan: record timer stack workqueue: kasan: record workqueue stack kasan: print timer and workqueue stack lib/test_kasan.c: add timer test case lib/test_kasan.c: add workqueue test case kasan: update documentation for generic kasan --- Changes since v3: - testcases have merge conflict, so that need to be rebased onto the KASAN-KUNIT. Changes since v2: - modify kasan document to be readable, Thanks for Marco suggestion. Changes since v1: - Thanks for Marco and Thomas suggestion. - Remove unnecessary code and fix commit log - reuse kasan_record_aux_stack() and aux_stack to record timer and workqueue stack. - change the aux stack title for common name. --- Documentation/dev-tools/kasan.rst | 5 +++-- kernel/time/timer.c | 3 +++ kernel/workqueue.c| 3 +++ lib/test_kasan_module.c | 55 +++ mm/kasan/report.c | 4 ++-- 5 files changed, 66 insertions(+), 4 deletions(-)
Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference
Rafael Aquini writes: > The bug here is quite simple: split_swap_cluster() misses checking for > lock_cluster() returning NULL before committing to change cluster_info->flags. I don't think so. We shouldn't run into this situation firstly. So the "fix" hides the real bug instead of fixing it. Just like we call VM_BUG_ON_PAGE(!PageLocked(head), head) in split_huge_page_to_list() instead of returning if !PageLocked(head) silently. > The fundamental problem has nothing to do with allocating, or not allocating > a swap cluster, but it has to do with the fact that the THP deferred split > scan > can transiently race with swapcache insertion, and the fact that when you run > your swap area on rotational storage cluster_info is _always_ NULL. > split_swap_cluster() needs to check for lock_cluster() returning NULL because > that's one possible case, and it clearly fails to do so. If there's a race, we should fix the race. But the code path for swapcache insertion is, add_to_swap() get_swap_page() /* Return if fails to allocate */ add_to_swap_cache() SetPageSwapCache() While the code path to split THP is, split_huge_page_to_list() if PageSwapCache() split_swap_cluster() Both code paths are protected by the page lock. So there should be some other reasons to trigger the bug. And again, for HDD, a THP shouldn't have PageSwapCache() set at the first place. If so, the bug is that the flag is set and we should fix the setting. > Run a workload that cause multiple THP COW, and add a memory hogger to create > memory pressure so you'll force the reclaimers to kick the registered > shrinkers. The trigger is not heavy swapping, and that's probably why > most swap test cases don't hit it. The window is tight, but you will get the > NULL pointer dereference. Do you have a script to reproduce the bug? > Regardless you find furhter bugs, or not, this patch is needed to correct a > blunt coding mistake. As above. I don't agree with that. Best Regards, Huang, Ying
Re: [PATCH 7/7] perf inject: Remove stale build-id processing
Hi Adrian, Thanks for your review! On Wed, Sep 23, 2020 at 11:36 PM Adrian Hunter wrote: > > On 23/09/20 11:05 am, Namhyung Kim wrote: > > I think we don't need to call build_id__mark_dso_hit() in the > > perf_event__repipe_sample() as it's not used by -b option. In case of > > the -b option is used, it uses perf_event__inject_buildid() instead. > > This can remove unnecessary overhead of finding thread/map for each > > sample event. > > > > Also I suspect HEADER_BUILD_ID feature bit setting since we already > > generated/injected BUILD_ID event into the output stream. So this > > header information seems redundant. I'm not 100% sure about the > > auxtrace usage, but it looks like not related to this directly. > > > > And we now have --buildid-all so users can get the same behavior if > > they want. > > For a perf.data file, don't buildids get written to the HEADER_BUILD_ID > feature section by perf_session__write_header() if the feature flag is set > and if they are hit? Right, that's what perf record does unless -B option is used. But I'm more concerned about the pipe usage which doesn't use the header. > > So, unless -b is used, anything you don't hit you lose i.e. a buildid in the > HEADER_BUILD_ID feature section of the input file, will not be written to > the output file. But perf inject generates PERF_RECORD_HEADER_BUILD_ID events and puts them into the data stream when -b option is used. Do you say perf inject should process build-id even when -b is not used? > > > > Cc: Adrian Hunter > > Signed-off-by: Namhyung Kim > > --- > > tools/perf/builtin-inject.c | 12 > > 1 file changed, 12 deletions(-) > > > > diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c > > index 500428aaa576..0191d72be7c4 100644 > > --- a/tools/perf/builtin-inject.c > > +++ b/tools/perf/builtin-inject.c > > @@ -277,8 +277,6 @@ static int perf_event__repipe_sample(struct perf_tool > > *tool, > > return f(tool, event, sample, evsel, machine); > > } > > > > - build_id__mark_dso_hit(tool, event, sample, evsel, machine); > > - > > I guess that chunk would prevent losing a buildid in a perf.data file? > > > if (inject->itrace_synth_opts.set && sample->aux_sample.size) > > event = perf_inject__cut_auxtrace_sample(inject, event, > > sample); > > > > @@ -767,16 +765,6 @@ static int __cmd_inject(struct perf_inject *inject) > > return ret; > > > > if (!data_out->is_pipe) { > > - if (inject->build_ids) > > - perf_header__set_feat(>header, > > - HEADER_BUILD_ID); > > That could be due to confusion with 'perf buildid-list' which will not show > any buildids from synthesized buildid events unless "with hits" is selected, > so then it looks like there are no buildids. Yeah, it's confusing.. I think we should change the behavior to handle the pipe case properly. > > It could be an advantage to have the buildids also in the HEADER_BUILD_ID > feature section, because then then build-list can list them quickly. I'm not sure it's good to have duplicate build-id info. We may add an option just to update the header section and not inject BUILD_ID events. > > > - /* > > - * Keep all buildids when there is unprocessed AUX data > > because > > - * it is not known which ones the AUX trace hits. > > - */ > > - if (perf_header__has_feat(>header, HEADER_BUILD_ID) > > && > > - inject->have_auxtrace && !inject->itrace_synth_opts.set) > > - dsos__hit_all(session); > > I expect that is definitely needed. OK. Thanks Namhyung > > > /* > >* The AUX areas have been removed and replaced with > >* synthesized hardware events, so clear the feature flag and > > >
Re: [PATCH V3 2/3] arm64/mm/hotplug: Enable MEM_OFFLINE event handling
On 09/23/2020 12:01 PM, Gavin Shan wrote: > Hi Anshuman, > > On 9/21/20 10:05 PM, Anshuman Khandual wrote: >> This enables MEM_OFFLINE memory event handling. It will help intercept any >> possible error condition such as if boot memory some how still got offlined >> even after an explicit notifier failure, potentially by a future change in >> generic hot plug framework. This would help detect such scenarios and help >> debug further. >> >> Cc: Catalin Marinas >> Cc: Will Deacon >> Cc: Mark Rutland >> Cc: Marc Zyngier >> Cc: Steve Capper >> Cc: Mark Brown >> Cc: linux-arm-ker...@lists.infradead.org >> Cc: linux-kernel@vger.kernel.org >> Signed-off-by: Anshuman Khandual >> --- > > I'm not sure if it makes sense since MEM_OFFLINE won't be triggered > after NOTIFY_BAD is returned from MEM_GOING_OFFLINE. NOTIFY_BAD means > the whole offline process is stopped. It would be guranteed by generic > framework from syntax standpoint. Right but the intent here is to catch any deviation in generic hotplug semantics going forward. > > However, this looks good if MEM_OFFLINE is triggered without calling > into MEM_GOING_OFFLINE previously, but it would be a bug from generic > framework. Exactly, this will just ensure that we know about any change or a bug in the generic framework. But if required, this additional check can be enabled only with DEBUG_VM. > >> arch/arm64/mm/mmu.c | 37 - >> 1 file changed, 32 insertions(+), 5 deletions(-) >> >> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c >> index df3b7415b128..6b171bd88bcf 100644 >> --- a/arch/arm64/mm/mmu.c >> +++ b/arch/arm64/mm/mmu.c >> @@ -1482,13 +1482,40 @@ static int prevent_bootmem_remove_notifier(struct >> notifier_block *nb, >> unsigned long end_pfn = arg->start_pfn + arg->nr_pages; >> unsigned long pfn = arg->start_pfn; >> - if (action != MEM_GOING_OFFLINE) >> + if ((action != MEM_GOING_OFFLINE) && (action != MEM_OFFLINE)) >> return NOTIFY_OK; >> - for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) { >> - ms = __pfn_to_section(pfn); >> - if (early_section(ms)) >> - return NOTIFY_BAD; >> + if (action == MEM_GOING_OFFLINE) { >> + for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) { >> + ms = __pfn_to_section(pfn); >> + if (early_section(ms)) { >> + pr_warn("Boot memory offlining attempted\n"); >> + return NOTIFY_BAD; >> + } >> + } >> + } else if (action == MEM_OFFLINE) { >> + for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) { >> + ms = __pfn_to_section(pfn); >> + if (early_section(ms)) { >> + >> + /* >> + * This should have never happened. Boot memory >> + * offlining should have been prevented by this >> + * very notifier. Probably some memory removal >> + * procedure might have changed which would then >> + * require further debug. >> + */ >> + pr_err("Boot memory offlined\n"); >> + >> + /* >> + * Core memory hotplug does not process a return >> + * code from the notifier for MEM_OFFLINE event. >> + * Error condition has been reported. Report as >> + * ignored. >> + */ >> + return NOTIFY_DONE; >> + } >> + } >> } >> return NOTIFY_OK; >> } >> > > It's pretty much irrelevant comment if the patch doesn't make sense: > the logical block for MEM_GOING_OFFLINE would be reused by MEM_OFFLINE > as they looks similar except the return value and error message :) This can be reorganized in the above mentioned format as well. Without much additional code or iteration, it might not need DEBUG_VM as well. for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) { ms = __pfn_to_section(pfn); if (!early_section(ms)) continue; if (action == MEM_GOING_OFFLINE) { pr_warn("Boot memory offlining attempted\n"); return NOTIFY_BAD; } else if (action == MEM_OFFLINE) { pr_err("Boot memory offlined\n"); return NOTIFY_DONE; } } return NOTIFY_OK;
[PATCH v19 08/20] mm: page_idle_get_page() does not need lru_lock
From: Hugh Dickins It is necessary for page_idle_get_page() to recheck PageLRU() after get_page_unless_zero(), but holding lru_lock around that serves no useful purpose, and adds to lru_lock contention: delete it. See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the discussion that led to lru_lock there; but __page_set_anon_rmap() now uses WRITE_ONCE(), and I see no other risk in page_idle_clear_pte_refs() using rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but not entirely prevented by page_count() check in ksm.c's write_protect_page(): that risk being shared with page_referenced() and not helped by lru_lock). Signed-off-by: Hugh Dickins Signed-off-by: Alex Shi Cc: Andrew Morton Cc: Vladimir Davydov Cc: Vlastimil Babka Cc: Minchan Kim Cc: Alex Shi Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/page_idle.c | 4 1 file changed, 4 deletions(-) diff --git a/mm/page_idle.c b/mm/page_idle.c index 057c61df12db..64e5344a992c 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -32,19 +32,15 @@ static struct page *page_idle_get_page(unsigned long pfn) { struct page *page = pfn_to_online_page(pfn); - pg_data_t *pgdat; if (!page || !PageLRU(page) || !get_page_unless_zero(page)) return NULL; - pgdat = page_pgdat(page); - spin_lock_irq(>lru_lock); if (unlikely(!PageLRU(page))) { put_page(page); page = NULL; } - spin_unlock_irq(>lru_lock); return page; } -- 1.8.3.1
[PATCH v19 14/20] mm/mlock: remove __munlock_isolate_lru_page
The func only has one caller, remove it to clean up code and simplify code. Signed-off-by: Alex Shi Acked-by: Hugh Dickins Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Vlastimil Babka Cc: Andrew Morton Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/mlock.c | 31 +-- 1 file changed, 9 insertions(+), 22 deletions(-) diff --git a/mm/mlock.c b/mm/mlock.c index 796c726a0407..d487aa864e86 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -106,26 +106,6 @@ void mlock_vma_page(struct page *page) } /* - * Isolate a page from LRU with optional get_page() pin. - * Assumes lru_lock already held and page already pinned. - */ -static bool __munlock_isolate_lru_page(struct page *page, bool getpage) -{ - if (PageLRU(page)) { - struct lruvec *lruvec; - - lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); - if (getpage) - get_page(page); - ClearPageLRU(page); - del_page_from_lru_list(page, lruvec, page_lru(page)); - return true; - } - - return false; -} - -/* * Finish munlock after successful page isolation * * Page must be locked. This is a wrapper for try_to_munlock() @@ -296,9 +276,16 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) * We already have pin from follow_page_mask() * so we can spare the get_page() here. */ - if (__munlock_isolate_lru_page(page, false)) + if (PageLRU(page)) { + struct lruvec *lruvec; + + ClearPageLRU(page); + lruvec = mem_cgroup_page_lruvec(page, + page_pgdat(page)); + del_page_from_lru_list(page, lruvec, + page_lru(page)); continue; - else + } else __munlock_isolation_failed(page); } else { delta_munlocked++; -- 1.8.3.1
[PATCH v19 04/20] mm/thp: use head for head page in lru_add_page_tail
Since the first parameter is only used by head page, it's better to make it explicit. Signed-off-by: Alex Shi Reviewed-by: Kirill A. Shutemov Acked-by: Hugh Dickins Cc: Andrew Morton Cc: Johannes Weiner Cc: Matthew Wilcox Cc: Hugh Dickins Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/huge_memory.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 211d50f09785..fa0753229b52 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2318,19 +2318,19 @@ static void remap_page(struct page *page) } } -static void lru_add_page_tail(struct page *page, struct page *page_tail, +static void lru_add_page_tail(struct page *head, struct page *page_tail, struct lruvec *lruvec, struct list_head *list) { - VM_BUG_ON_PAGE(!PageHead(page), page); - VM_BUG_ON_PAGE(PageCompound(page_tail), page); - VM_BUG_ON_PAGE(PageLRU(page_tail), page); + VM_BUG_ON_PAGE(!PageHead(head), head); + VM_BUG_ON_PAGE(PageCompound(page_tail), head); + VM_BUG_ON_PAGE(PageLRU(page_tail), head); lockdep_assert_held(_pgdat(lruvec)->lru_lock); if (!list) SetPageLRU(page_tail); - if (likely(PageLRU(page))) - list_add_tail(_tail->lru, >lru); + if (likely(PageLRU(head))) + list_add_tail(_tail->lru, >lru); else if (list) { /* page reclaim is reclaiming a huge page */ get_page(page_tail); -- 1.8.3.1
[PATCH v19 02/20] mm/memcg: bail early from swap accounting if memcg disabled
If we disabled memcg by cgroup_disable=memory, page->memcg will be NULL and so the charge is skipped and that will trigger a warning like below. Let's return from the funcs earlier. anon flags:0x5005b48008000d(locked|uptodate|dirty|swapbacked) raw: 005005b48008000d dead0100 dead0122 8897c7c76ad1 raw: 0022 0002 page dumped because: VM_WARN_ON_ONCE_PAGE(!memcg) ... RIP: 0010:vprintk_emit+0x1f7/0x260 Code: 00 84 d2 74 72 0f b6 15 27 58 64 01 48 c7 c0 00 d4 72 82 84 d2 74 09 f3 90 0f b6 10 84 d2 75 f7 e8 de 0d 00 00 4c 89 e7 57 9d <0f> 1f 44 00 00 e9 62 ff ff ff 80 3d 88 c9 3a 01 00 0f 85 54 fe ff RSP: 0018:c9000faab358 EFLAGS: 0202 RAX: 8272d400 RBX: 005e RCX: 88afd80d0040 RDX: RSI: 0002 RDI: 0202 RBP: c9000faab3a8 R08: 8272d440 R09: 00022480 R10: 00120c77be68bfac R11: 00cd7568 R12: 0202 R13: 0057c0080005 R14: 820a0130 R15: c9000faab3e8 ? vprintk_emit+0x140/0x260 vprintk_default+0x1a/0x20 vprintk_func+0x4f/0xc4 ? vprintk_func+0x4f/0xc4 printk+0x53/0x6a ? xas_load+0xc/0x80 __dump_page.cold.6+0xff/0x4ee ? xas_init_marks+0x23/0x50 ? xas_store+0x30/0x40 ? free_swap_slot+0x43/0xd0 ? put_swap_page+0x119/0x320 ? update_load_avg+0x82/0x580 dump_page+0x9/0xb mem_cgroup_try_charge_swap+0x16e/0x1d0 get_swap_page+0x130/0x210 add_to_swap+0x41/0xc0 shrink_page_list+0x99e/0xdf0 shrink_inactive_list+0x199/0x360 shrink_lruvec+0x40d/0x650 ? _cond_resched+0x14/0x30 ? _cond_resched+0x14/0x30 shrink_node+0x226/0x6e0 do_try_to_free_pages+0xd0/0x400 try_to_free_pages+0xef/0x130 __alloc_pages_slowpath.constprop.127+0x38d/0xbd0 ? ___slab_alloc+0x31d/0x6f0 __alloc_pages_nodemask+0x27f/0x2c0 alloc_pages_vma+0x75/0x220 shmem_alloc_page+0x46/0x90 ? release_pages+0x1ae/0x410 shmem_alloc_and_acct_page+0x77/0x1c0 shmem_getpage_gfp+0x162/0x910 shmem_fault+0x74/0x210 ? filemap_map_pages+0x29c/0x410 __do_fault+0x37/0x190 handle_mm_fault+0x120a/0x1770 exc_page_fault+0x251/0x450 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 Signed-off-by: Alex Shi Reviewed-by: Roman Gushchin Acked-by: Michal Hocko Acked-by: Hugh Dickins Cc: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov Cc: Andrew Morton Cc: cgro...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/memcontrol.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f7094a04be07..368e6ac644f0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7102,6 +7102,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) VM_BUG_ON_PAGE(PageLRU(page), page); VM_BUG_ON_PAGE(page_count(page), page); + if (mem_cgroup_disabled()) + return; + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) return; @@ -7166,6 +7169,9 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry) struct mem_cgroup *memcg; unsigned short oldid; + if (mem_cgroup_disabled()) + return 0; + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) return 0; -- 1.8.3.1
[PATCH v19 13/20] mm/mlock: remove lru_lock on TestClearPageMlocked
In the func munlock_vma_page, comments mentained lru_lock needed for serialization with split_huge_pages. But the page must be PageLocked as well as pages in split_huge_page series funcs. Thus the PageLocked is enough to serialize both funcs. Further more, Hugh Dickins pointed: before splitting in split_huge_page_to_list, the page was unmap_page() to remove pmd/ptes which protect the page from munlock. Thus, no needs to guard __split_huge_page_tail for mlock clean, just keep the lru_lock there for isolation purpose. LKP found a preempt issue on __mod_zone_page_state which need change to mod_zone_page_state. Thanks! Signed-off-by: Alex Shi Acked-by: Hugh Dickins Cc: Kirill A. Shutemov Cc: Vlastimil Babka Cc: Andrew Morton Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/mlock.c | 26 +- 1 file changed, 5 insertions(+), 21 deletions(-) diff --git a/mm/mlock.c b/mm/mlock.c index 884b1216da6a..796c726a0407 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -187,40 +187,24 @@ static void __munlock_isolation_failed(struct page *page) unsigned int munlock_vma_page(struct page *page) { int nr_pages; - pg_data_t *pgdat = page_pgdat(page); /* For try_to_munlock() and to serialize with page migration */ BUG_ON(!PageLocked(page)); - VM_BUG_ON_PAGE(PageTail(page), page); - /* -* Serialize with any parallel __split_huge_page_refcount() which -* might otherwise copy PageMlocked to part of the tail pages before -* we clear it in the head page. It also stabilizes thp_nr_pages(). -*/ - spin_lock_irq(>lru_lock); - if (!TestClearPageMlocked(page)) { /* Potentially, PTE-mapped THP: do not skip the rest PTEs */ - nr_pages = 1; - goto unlock_out; + return 0; } nr_pages = thp_nr_pages(page); - __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); + mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); - if (__munlock_isolate_lru_page(page, true)) { - spin_unlock_irq(>lru_lock); + if (!isolate_lru_page(page)) __munlock_isolated_page(page); - goto out; - } - __munlock_isolation_failed(page); - -unlock_out: - spin_unlock_irq(>lru_lock); + else + __munlock_isolation_failed(page); -out: return nr_pages - 1; } -- 1.8.3.1
[PATCH v19 11/20] mm/lru: move lock into lru_note_cost
We have to move lru_lock into lru_note_cost, since it cycle up on memcg tree, for future per lruvec lru_lock replace. It's a bit ugly and may cost a bit more locking, but benefit from multiple memcg locking could cover the lost. Signed-off-by: Alex Shi Acked-by: Hugh Dickins Cc: Johannes Weiner Cc: Andrew Morton Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/swap.c | 3 +++ mm/vmscan.c | 4 +--- mm/workingset.c | 2 -- 3 files changed, 4 insertions(+), 5 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index 4e33873975a6..6a4b00267384 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -268,7 +268,9 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages) { do { unsigned long lrusize; + struct pglist_data *pgdat = lruvec_pgdat(lruvec); + spin_lock_irq(>lru_lock); /* Record cost event */ if (file) lruvec->file_cost += nr_pages; @@ -292,6 +294,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages) lruvec->file_cost /= 2; lruvec->anon_cost /= 2; } + spin_unlock_irq(>lru_lock); } while ((lruvec = parent_lruvec(lruvec))); } diff --git a/mm/vmscan.c b/mm/vmscan.c index 32102e5d354d..656558bbf399 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1971,19 +1971,17 @@ static int current_may_throttle(void) , false); spin_lock_irq(>lru_lock); - move_pages_to_lru(lruvec, _list); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); - lru_note_cost(lruvec, file, stat.nr_pageout); item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; if (!cgroup_reclaim(sc)) __count_vm_events(item, nr_reclaimed); __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed); - spin_unlock_irq(>lru_lock); + lru_note_cost(lruvec, file, stat.nr_pageout); mem_cgroup_uncharge_list(_list); free_unref_page_list(_list); diff --git a/mm/workingset.c b/mm/workingset.c index 92e66113a577..32e24cda1b4f 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -381,9 +381,7 @@ void workingset_refault(struct page *page, void *shadow) if (workingset) { SetPageWorkingset(page); /* XXX: Move to lru_cache_add() when it supports new vs putback */ - spin_lock_irq(_pgdat(page)->lru_lock); lru_note_cost_page(page); - spin_unlock_irq(_pgdat(page)->lru_lock); inc_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file); } out: -- 1.8.3.1
[PATCH] crypto: x86/poly1305 - Remove assignments with no effect
On Mon, Sep 21, 2020 at 04:56:52PM +0800, kernel test robot wrote: > tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > master > head: eb5f95f1593f7c22dac681b19e815828e2af3efd > commit: d7d7b853566254648df59f7ea27ea05952a6cfa8 crypto: x86/poly1305 - wire > up faster implementations for kernel > date: 8 months ago > :: branch date: 14 hours ago > :: commit date: 8 months ago > compiler: gcc-9 (Debian 9.3.0-15) 9.3.0 > > If you fix the issue, kindly add following tag as appropriate > Reported-by: kernel test robot > > > cppcheck warnings: (new ones prefixed by >>) > >arch/x86/crypto/poly1305_glue.c:165:4: warning: Assignment of function > parameter has no effect outside the function. Did you forget dereferencing > it? [uselessAssignmentPtrArg] > inp += POLY1305_BLOCK_SIZE; > ^ > >> arch/x86/crypto/poly1305_glue.c:166:4: warning: Assignment of function > >> parameter has no effect outside the function. [uselessAssignmentArg] > len -= POLY1305_BLOCK_SIZE; > ^ This patch should fix the problem. ---8<--- This patch removes a few ineffectual assignments from the function crypto_poly1305_setdctxkey. Reported-by: kernel test robot Signed-off-by: Herbert Xu diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c index f85885d6ccd8..e508dbd91813 100644 --- a/arch/x86/crypto/poly1305_glue.c +++ b/arch/x86/crypto/poly1305_glue.c @@ -158,9 +158,6 @@ static unsigned int crypto_poly1305_setdctxkey(struct poly1305_desc_ctx *dctx, dctx->s[1] = get_unaligned_le32([4]); dctx->s[2] = get_unaligned_le32([8]); dctx->s[3] = get_unaligned_le32([12]); - inp += POLY1305_BLOCK_SIZE; - len -= POLY1305_BLOCK_SIZE; - acc += POLY1305_BLOCK_SIZE; dctx->sset = true; } } -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
[PATCH v19 19/20] mm/lru: introduce the relock_page_lruvec function
From: Alexander Duyck Use this new function to replace repeated same code, no func change. When testing for relock we can avoid the need for RCU locking if we simply compare the page pgdat and memcg pointers versus those that the lruvec is holding. By doing this we can avoid the extra pointer walks and accesses of the memory cgroup. In addition we can avoid the checks entirely if lruvec is currently NULL. Signed-off-by: Alexander Duyck Signed-off-by: Alex Shi Acked-by: Hugh Dickins Cc: Johannes Weiner Cc: Andrew Morton Cc: Thomas Gleixner Cc: Andrey Ryabinin Cc: Matthew Wilcox Cc: Mel Gorman Cc: Konstantin Khlebnikov Cc: Hugh Dickins Cc: Tejun Heo Cc: linux-kernel@vger.kernel.org Cc: cgro...@vger.kernel.org Cc: linux...@kvack.org --- include/linux/memcontrol.h | 52 ++ mm/mlock.c | 11 +- mm/swap.c | 33 +++-- mm/vmscan.c| 12 ++- 4 files changed, 62 insertions(+), 46 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 7b170e9028b5..bd8fdeccf6b5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -488,6 +488,22 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *); +static inline bool lruvec_holds_page_lru_lock(struct page *page, + struct lruvec *lruvec) +{ + pg_data_t *pgdat = page_pgdat(page); + const struct mem_cgroup *memcg; + struct mem_cgroup_per_node *mz; + + if (mem_cgroup_disabled()) + return lruvec == >__lruvec; + + mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec); + memcg = page->mem_cgroup ? : root_mem_cgroup; + + return lruvec->pgdat == pgdat && mz->memcg == memcg; +} + struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm); @@ -1023,6 +1039,14 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, return >__lruvec; } +static inline bool lruvec_holds_page_lru_lock(struct page *page, + struct lruvec *lruvec) +{ + pg_data_t *pgdat = page_pgdat(page); + + return lruvec == >__lruvec; +} + static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) { return NULL; @@ -1469,6 +1493,34 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, spin_unlock_irqrestore(>lru_lock, flags); } +/* Don't lock again iff page's lruvec locked */ +static inline struct lruvec *relock_page_lruvec_irq(struct page *page, + struct lruvec *locked_lruvec) +{ + if (locked_lruvec) { + if (lruvec_holds_page_lru_lock(page, locked_lruvec)) + return locked_lruvec; + + unlock_page_lruvec_irq(locked_lruvec); + } + + return lock_page_lruvec_irq(page); +} + +/* Don't lock again iff page's lruvec locked */ +static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page, + struct lruvec *locked_lruvec, unsigned long *flags) +{ + if (locked_lruvec) { + if (lruvec_holds_page_lru_lock(page, locked_lruvec)) + return locked_lruvec; + + unlock_page_lruvec_irqrestore(locked_lruvec, *flags); + } + + return lock_page_lruvec_irqsave(page, flags); +} + #ifdef CONFIG_CGROUP_WRITEBACK struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb); diff --git a/mm/mlock.c b/mm/mlock.c index ab164a675c25..55b3b3672977 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -277,16 +277,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) * so we can spare the get_page() here. */ if (TestClearPageLRU(page)) { - struct lruvec *new_lruvec; - - new_lruvec = mem_cgroup_page_lruvec(page, - page_pgdat(page)); - if (new_lruvec != lruvec) { - if (lruvec) - unlock_page_lruvec_irq(lruvec); - lruvec = lock_page_lruvec_irq(page); - } - + lruvec = relock_page_lruvec_irq(page, lruvec); del_page_from_lru_list(page, lruvec, page_lru(page)); continue; diff --git a/mm/swap.c b/mm/swap.c index 622218bd18e5..375fb2e0683a 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -210,19 +210,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
[PATCH v19 12/20] mm/vmscan: remove lruvec reget in move_pages_to_lru
A isolated page shouldn't be recharged by memcg since the memcg migration isn't possible at the time. So remove unnecessary regetting. Thanks to Alexander Duyck for pointing this out. Signed-off-by: Alex Shi Acked-by: Hugh Dickins Cc: Alexander Duyck Cc: Andrew Morton Cc: Konstantin Khlebnikov Cc: Michal Hocko Cc: Hugh Dickins Cc: Johannes Weiner Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org Cc: cgro...@vger.kernel.org --- mm/vmscan.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 656558bbf399..953a068b3fc7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1885,7 +1885,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, continue; } - lruvec = mem_cgroup_page_lruvec(page, pgdat); + VM_BUG_ON_PAGE(mem_cgroup_page_lruvec(page, page_pgdat(page)) + != lruvec, page); lru = page_lru(page); nr_pages = thp_nr_pages(page); -- 1.8.3.1
[PATCH v19 15/20] mm/lru: introduce TestClearPageLRU
Currently lru_lock still guards both lru list and page's lru bit, that's ok. but if we want to use specific lruvec lock on the page, we need to pin down the page's lruvec/memcg during locking. Just taking lruvec lock first may be undermined by the page's memcg charge/migration. To fix this problem, we could clear the lru bit out of locking and use it as pin down action to block the page isolation in memcg changing. So now a standard steps of page isolation is following: 1, get_page(); #pin the page avoid to be free 2, TestClearPageLRU(); #block other isolation like memcg change 3, spin_lock on lru_lock; #serialize lru list access 4, delete page from lru list; The step 2 could be optimzed/replaced in scenarios which page is unlikely be accessed or be moved between memcgs. This patch start with the first part: TestClearPageLRU, which combines PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This function will be used as page isolation precondition to prevent other isolations some where else. Then there are may !PageLRU page on lru list, need to remove BUG() checking accordingly. There 2 rules for lru bit now: 1, the lru bit still indicate if a page on lru list, just in some temporary moment(isolating), the page may have no lru bit when it's on lru list. but the page still must be on lru list when the lru bit set. 2, have to remove lru bit before delete it from lru list. As Andrew Morton mentioned this change would dirty cacheline for page isn't on LRU. But the lost would be acceptable in Rong Chen report: https://lore.kernel.org/lkml/20200304090301.GB5972@shao2-debian/ Suggested-by: Johannes Weiner Signed-off-by: Alex Shi Acked-by: Hugh Dickins Cc: Hugh Dickins Cc: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov Cc: Andrew Morton Cc: linux-kernel@vger.kernel.org Cc: cgro...@vger.kernel.org Cc: linux...@kvack.org --- include/linux/page-flags.h | 1 + mm/mlock.c | 3 +-- mm/vmscan.c| 39 +++ 3 files changed, 21 insertions(+), 22 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6be1aa559b1e..9554ed1387dc 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -326,6 +326,7 @@ static inline void page_init_poison(struct page *page, size_t size) PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD) __CLEARPAGEFLAG(Dirty, dirty, PF_HEAD) PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD) + TESTCLEARFLAG(LRU, lru, PF_HEAD) PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD) TESTCLEARFLAG(Active, active, PF_HEAD) PAGEFLAG(Workingset, workingset, PF_HEAD) diff --git a/mm/mlock.c b/mm/mlock.c index d487aa864e86..7b0e6334be6f 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -276,10 +276,9 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) * We already have pin from follow_page_mask() * so we can spare the get_page() here. */ - if (PageLRU(page)) { + if (TestClearPageLRU(page)) { struct lruvec *lruvec; - ClearPageLRU(page); lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); del_page_from_lru_list(page, lruvec, diff --git a/mm/vmscan.c b/mm/vmscan.c index 953a068b3fc7..9d06c609c60e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1540,7 +1540,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, */ int __isolate_lru_page(struct page *page, isolate_mode_t mode) { - int ret = -EINVAL; + int ret = -EBUSY; /* Only take pages on the LRU. */ if (!PageLRU(page)) @@ -1550,8 +1550,6 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode) if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE)) return ret; - ret = -EBUSY; - /* * To minimise LRU disruption, the caller can indicate that it only * wants to isolate pages it will be able to operate on without @@ -1598,8 +1596,10 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode) * sure the page is not being freed elsewhere -- the * page release code relies on it. */ - ClearPageLRU(page); - ret = 0; + if (TestClearPageLRU(page)) + ret = 0; + else + put_page(page); } return ret; @@ -1665,8 +1665,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, page = lru_to_page(src); prefetchw_prev_lru_page(page, src, flags); -
[PATCH v19 17/20] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
Hugh Dickins' found a memcg change bug on original version: If we want to change the pgdat->lru_lock to memcg's lruvec lock, we have to serialize mem_cgroup_move_account during pagevec_lru_move_fn. The possible bad scenario would like: cpu 0 cpu 1 lruvec = mem_cgroup_page_lruvec() if (!isolate_lru_page()) mem_cgroup_move_account spin_lock_irqsave(>lru_lock <== wrong lock. So we need TestClearPageLRU to block isolate_lru_page(), that serializes the memcg change. and then removing the PageLRU check in move_fn callee as the consequence. __pagevec_lru_add_fn() is different from the others, because the pages it deals with are, by definition, not yet on the lru. TestClearPageLRU is not needed and would not work, so __pagevec_lru_add() goes its own way. Reported-by: Hugh Dickins Signed-off-by: Alex Shi Acked-by: Hugh Dickins Cc: Andrew Morton Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/swap.c | 44 +++- 1 file changed, 35 insertions(+), 9 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index 6a4b00267384..50418dcc5e24 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -222,8 +222,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec, spin_lock_irqsave(>lru_lock, flags); } + /* block memcg migration during page moving between lru */ + if (!TestClearPageLRU(page)) + continue; + lruvec = mem_cgroup_page_lruvec(page, pgdat); (*move_fn)(page, lruvec); + + SetPageLRU(page); } if (pgdat) spin_unlock_irqrestore(>lru_lock, flags); @@ -233,7 +239,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec, static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec) { - if (PageLRU(page) && !PageUnevictable(page)) { + if (!PageUnevictable(page)) { del_page_from_lru_list(page, lruvec, page_lru(page)); ClearPageActive(page); add_page_to_lru_list_tail(page, lruvec, page_lru(page)); @@ -306,7 +312,7 @@ void lru_note_cost_page(struct page *page) static void __activate_page(struct page *page, struct lruvec *lruvec) { - if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { + if (!PageActive(page) && !PageUnevictable(page)) { int lru = page_lru_base_type(page); int nr_pages = thp_nr_pages(page); @@ -362,7 +368,8 @@ void activate_page(struct page *page) page = compound_head(page); spin_lock_irq(>lru_lock); - __activate_page(page, mem_cgroup_page_lruvec(page, pgdat)); + if (PageLRU(page)) + __activate_page(page, mem_cgroup_page_lruvec(page, pgdat)); spin_unlock_irq(>lru_lock); } #endif @@ -521,9 +528,6 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec) bool active; int nr_pages = thp_nr_pages(page); - if (!PageLRU(page)) - return; - if (PageUnevictable(page)) return; @@ -564,7 +568,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec) static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec) { - if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) { + if (PageActive(page) && !PageUnevictable(page)) { int lru = page_lru_base_type(page); int nr_pages = thp_nr_pages(page); @@ -581,7 +585,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec) static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec) { - if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) && + if (PageAnon(page) && PageSwapBacked(page) && !PageSwapCache(page) && !PageUnevictable(page)) { bool active = PageActive(page); int nr_pages = thp_nr_pages(page); @@ -979,7 +983,29 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec) */ void __pagevec_lru_add(struct pagevec *pvec) { - pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn); + int i; + struct pglist_data *pgdat = NULL; + struct lruvec *lruvec; + unsigned long flags = 0; + + for (i = 0; i < pagevec_count(pvec); i++) { + struct page *page = pvec->pages[i]; + struct pglist_data *pagepgdat = page_pgdat(page); + + if (pagepgdat != pgdat) { + if (pgdat) + spin_unlock_irqrestore(>lru_lock, flags); + pgdat = pagepgdat; + spin_lock_irqsave(>lru_lock, flags); + } + + lruvec = mem_cgroup_page_lruvec(page, pgdat); + __pagevec_lru_add_fn(page,
[PATCH v19 10/20] mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
Fold the PGROTATED event collection into pagevec_move_tail_fn call back func like other funcs does in pagevec_lru_move_fn. Thus we could save func call pagevec_move_tail(). Now all usage of pagevec_lru_move_fn are same and no needs of its 3rd parameter. It's just simply the calling. No functional change. [l...@intel.com: found a build issue in the original patch, thanks] Signed-off-by: Alex Shi Acked-by: Hugh Dickins Cc: Andrew Morton Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/swap.c | 65 ++- 1 file changed, 23 insertions(+), 42 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index 1bc2fc5d8b7c..4e33873975a6 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -204,8 +204,7 @@ int get_kernel_page(unsigned long start, int write, struct page **pages) EXPORT_SYMBOL_GPL(get_kernel_page); static void pagevec_lru_move_fn(struct pagevec *pvec, - void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg), - void *arg) + void (*move_fn)(struct page *page, struct lruvec *lruvec)) { int i; struct pglist_data *pgdat = NULL; @@ -224,7 +223,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec, } lruvec = mem_cgroup_page_lruvec(page, pgdat); - (*move_fn)(page, lruvec, arg); + (*move_fn)(page, lruvec); } if (pgdat) spin_unlock_irqrestore(>lru_lock, flags); @@ -232,35 +231,22 @@ static void pagevec_lru_move_fn(struct pagevec *pvec, pagevec_reinit(pvec); } -static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec, -void *arg) +static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec) { - int *pgmoved = arg; - if (PageLRU(page) && !PageUnevictable(page)) { del_page_from_lru_list(page, lruvec, page_lru(page)); ClearPageActive(page); add_page_to_lru_list_tail(page, lruvec, page_lru(page)); - (*pgmoved) += thp_nr_pages(page); + __count_vm_events(PGROTATED, thp_nr_pages(page)); } } /* - * pagevec_move_tail() must be called with IRQ disabled. - * Otherwise this may cause nasty races. - */ -static void pagevec_move_tail(struct pagevec *pvec) -{ - int pgmoved = 0; - - pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, ); - __count_vm_events(PGROTATED, pgmoved); -} - -/* * Writeback is about to end against a page which has been marked for immediate * reclaim. If it still appears to be reclaimable, move it to the tail of the * inactive list. + * + * rotate_reclaimable_page() must disable IRQs, to prevent nasty races. */ void rotate_reclaimable_page(struct page *page) { @@ -273,7 +259,7 @@ void rotate_reclaimable_page(struct page *page) local_lock_irqsave(_rotate.lock, flags); pvec = this_cpu_ptr(_rotate.pvec); if (!pagevec_add(pvec, page) || PageCompound(page)) - pagevec_move_tail(pvec); + pagevec_lru_move_fn(pvec, pagevec_move_tail_fn); local_unlock_irqrestore(_rotate.lock, flags); } } @@ -315,8 +301,7 @@ void lru_note_cost_page(struct page *page) page_is_file_lru(page), thp_nr_pages(page)); } -static void __activate_page(struct page *page, struct lruvec *lruvec, - void *arg) +static void __activate_page(struct page *page, struct lruvec *lruvec) { if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { int lru = page_lru_base_type(page); @@ -340,7 +325,7 @@ static void activate_page_drain(int cpu) struct pagevec *pvec = _cpu(lru_pvecs.activate_page, cpu); if (pagevec_count(pvec)) - pagevec_lru_move_fn(pvec, __activate_page, NULL); + pagevec_lru_move_fn(pvec, __activate_page); } static bool need_activate_page_drain(int cpu) @@ -358,7 +343,7 @@ void activate_page(struct page *page) pvec = this_cpu_ptr(_pvecs.activate_page); get_page(page); if (!pagevec_add(pvec, page) || PageCompound(page)) - pagevec_lru_move_fn(pvec, __activate_page, NULL); + pagevec_lru_move_fn(pvec, __activate_page); local_unlock(_pvecs.lock); } } @@ -374,7 +359,7 @@ void activate_page(struct page *page) page = compound_head(page); spin_lock_irq(>lru_lock); - __activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL); + __activate_page(page, mem_cgroup_page_lruvec(page, pgdat)); spin_unlock_irq(>lru_lock); } #endif @@ -527,8 +512,7 @@ void lru_cache_add_inactive_or_unevictable(struct page *page, * be write it out by flusher threads as this is much more effective * than the single-page writeout from reclaim. */ -static void
[PATCH v19 06/20] mm/thp: narrow lru locking
lru_lock and page cache xa_lock have no obvious reason to be taken one way round or the other: until now, lru_lock has been taken before page cache xa_lock, when splitting a THP; but nothing else takes them together. Reverse that ordering: let's narrow the lru locking - but leave local_irq_disable to block interrupts throughout, like before. Hugh Dickins point: split_huge_page_to_list() was already silly, to be using the _irqsave variant: it's just been taking sleeping locks, so would already be broken if entered with interrupts enabled. So we can save passing flags argument down to __split_huge_page(). Why change the lock ordering here? That was hard to decide. One reason: when this series reaches per-memcg lru locking, it relies on the THP's memcg to be stable when taking the lru_lock: that is now done after the THP's refcount has been frozen, which ensures page memcg cannot change. Another reason: previously, lock_page_memcg()'s move_lock was presumed to nest inside lru_lock; but now lru_lock must nest inside (page cache lock inside) move_lock, so it becomes possible to use lock_page_memcg() to stabilize page memcg before taking its lru_lock. That is not the mechanism used in this series, but it is an option we want to keep open. [Hugh Dickins: rewrite commit log] Signed-off-by: Alex Shi Reviewed-by: Kirill A. Shutemov Acked-by: Hugh Dickins Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Andrea Arcangeli Cc: Johannes Weiner Cc: Matthew Wilcox Cc: Andrew Morton Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/huge_memory.c | 25 + 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8b92cd197218..63af7611afaf 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2402,7 +2402,7 @@ static void __split_huge_page_tail(struct page *head, int tail, } static void __split_huge_page(struct page *page, struct list_head *list, - pgoff_t end, unsigned long flags) + pgoff_t end) { struct page *head = compound_head(page); pg_data_t *pgdat = page_pgdat(head); @@ -2411,8 +2411,6 @@ static void __split_huge_page(struct page *page, struct list_head *list, unsigned long offset = 0; int i; - lruvec = mem_cgroup_page_lruvec(head, pgdat); - /* complete memcg works before add pages to LRU */ mem_cgroup_split_huge_fixup(head); @@ -2424,6 +2422,11 @@ static void __split_huge_page(struct page *page, struct list_head *list, xa_lock(_cache->i_pages); } + /* prevent PageLRU to go away from under us, and freeze lru stats */ + spin_lock(>lru_lock); + + lruvec = mem_cgroup_page_lruvec(head, pgdat); + for (i = HPAGE_PMD_NR - 1; i >= 1; i--) { __split_huge_page_tail(head, i, lruvec, list); /* Some pages can be beyond i_size: drop them from page cache */ @@ -2443,6 +2446,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, } ClearPageCompound(head); + spin_unlock(>lru_lock); + /* Caller disabled irqs, so they are still disabled here */ split_page_owner(head, HPAGE_PMD_ORDER); @@ -2460,8 +2465,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, page_ref_add(head, 2); xa_unlock(>mapping->i_pages); } - - spin_unlock_irqrestore(>lru_lock, flags); + local_irq_enable(); remap_page(head); @@ -2600,12 +2604,10 @@ bool can_split_huge_page(struct page *page, int *pextra_pins) int split_huge_page_to_list(struct page *page, struct list_head *list) { struct page *head = compound_head(page); - struct pglist_data *pgdata = NODE_DATA(page_to_nid(head)); struct deferred_split *ds_queue = get_deferred_split_queue(head); struct anon_vma *anon_vma = NULL; struct address_space *mapping = NULL; int count, mapcount, extra_pins, ret; - unsigned long flags; pgoff_t end; VM_BUG_ON_PAGE(is_huge_zero_page(head), head); @@ -2666,9 +2668,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) unmap_page(head); VM_BUG_ON_PAGE(compound_mapcount(head), head); - /* prevent PageLRU to go away from under us, and freeze lru stats */ - spin_lock_irqsave(>lru_lock, flags); - + /* block interrupt reentry in xa_lock and spinlock */ + local_irq_disable(); if (mapping) { XA_STATE(xas, >i_pages, page_index(head)); @@ -2698,7 +2699,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) __dec_node_page_state(head, NR_FILE_THPS); } - __split_huge_page(page, list, end, flags); + __split_huge_page(page, list, end); if (PageSwapCache(head)) { swp_entry_t
[PATCH v19 16/20] mm/compaction: do page isolation first in compaction
Currently, compaction would get the lru_lock and then do page isolation which works fine with pgdat->lru_lock, since any page isoltion would compete for the lru_lock. If we want to change to memcg lru_lock, we have to isolate the page before getting lru_lock, thus isoltion would block page's memcg change which relay on page isoltion too. Then we could safely use per memcg lru_lock later. The new page isolation use previous introduced TestClearPageLRU() + pgdat lru locking which will be changed to memcg lru lock later. Hugh Dickins fixed following bugs in this patch's early version: Fix lots of crashes under compaction load: isolate_migratepages_block() must clean up appropriately when rejecting a page, setting PageLRU again if it had been cleared; and a put_page() after get_page_unless_zero() cannot safely be done while holding locked_lruvec - it may turn out to be the final put_page(), which will take an lruvec lock when PageLRU. And move __isolate_lru_page_prepare back after get_page_unless_zero to make trylock_page() safe: trylock_page() is not safe to use at this time: its setting PG_locked can race with the page being freed or allocated ("Bad page"), and can also erase flags being set by one of those "sole owners" of a freshly allocated page who use non-atomic __SetPageFlag(). Suggested-by: Johannes Weiner Signed-off-by: Alex Shi Acked-by: Hugh Dickins Cc: Andrew Morton Cc: Matthew Wilcox Cc: linux-kernel@vger.kernel.org Cc: linux...@kvack.org --- include/linux/swap.h | 2 +- mm/compaction.c | 42 +- mm/vmscan.c | 43 ++- 3 files changed, 56 insertions(+), 31 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 43e6b3458f58..550fdfdc3506 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -357,7 +357,7 @@ extern void lru_cache_add_inactive_or_unevictable(struct page *page, extern unsigned long zone_reclaimable_pages(struct zone *zone); extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); -extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); +extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode); extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, diff --git a/mm/compaction.c b/mm/compaction.c index 176dcded298e..16a9f024433d 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -887,6 +887,7 @@ static bool too_many_isolated(pg_data_t *pgdat) if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) { if (!cc->ignore_skip_hint && get_pageblock_skip(page)) { low_pfn = end_pfn; + page = NULL; goto isolate_abort; } valid_page = page; @@ -968,6 +969,21 @@ static bool too_many_isolated(pg_data_t *pgdat) if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page)) goto isolate_fail; + /* +* Be careful not to clear PageLRU until after we're +* sure the page is not being freed elsewhere -- the +* page release code relies on it. +*/ + if (unlikely(!get_page_unless_zero(page))) + goto isolate_fail; + + if (__isolate_lru_page_prepare(page, isolate_mode) != 0) + goto isolate_fail_put; + + /* Try isolate the page */ + if (!TestClearPageLRU(page)) + goto isolate_fail_put; + /* If we already hold the lock, we can skip some rechecking */ if (!locked) { locked = compact_lock_irqsave(>lru_lock, @@ -980,10 +996,6 @@ static bool too_many_isolated(pg_data_t *pgdat) goto isolate_abort; } - /* Recheck PageLRU and PageCompound under lock */ - if (!PageLRU(page)) - goto isolate_fail; - /* * Page become compound since the non-locked check, * and it's on LRU. It can only be a THP so the order @@ -991,16 +1003,13 @@ static bool too_many_isolated(pg_data_t *pgdat) */ if (unlikely(PageCompound(page) && !cc->alloc_contig)) { low_pfn += compound_nr(page) - 1; - goto isolate_fail; + SetPageLRU(page); + goto isolate_fail_put;
[PATCH v19 07/20] mm/vmscan: remove unnecessary lruvec adding
We don't have to add a freeable page into lru and then remove from it. This change saves a couple of actions and makes the moving more clear. The SetPageLRU needs to be kept before put_page_testzero for list integrity, otherwise: #0 move_pages_to_lru #1 release_pages if !put_page_testzero if (put_page_testzero()) !PageLRU //skip lru_lock SetPageLRU() list_add(>lru,) list_add(>lru,) [a...@linux-foundation.org: coding style fixes] Signed-off-by: Alex Shi Acked-by: Hugh Dickins Cc: Andrew Morton Cc: Johannes Weiner Cc: Tejun Heo Cc: Matthew Wilcox Cc: Hugh Dickins Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/vmscan.c | 38 +- 1 file changed, 25 insertions(+), 13 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 466fc3144fff..32102e5d354d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1850,26 +1850,30 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, while (!list_empty(list)) { page = lru_to_page(list); VM_BUG_ON_PAGE(PageLRU(page), page); + list_del(>lru); if (unlikely(!page_evictable(page))) { - list_del(>lru); spin_unlock_irq(>lru_lock); putback_lru_page(page); spin_lock_irq(>lru_lock); continue; } - lruvec = mem_cgroup_page_lruvec(page, pgdat); + /* +* The SetPageLRU needs to be kept here for list integrity. +* Otherwise: +* #0 move_pages_to_lru #1 release_pages +* if !put_page_testzero +*if (put_page_testzero()) +* !PageLRU //skip lru_lock +* SetPageLRU() +* list_add(>lru,) +*list_add(>lru,) +*/ SetPageLRU(page); - lru = page_lru(page); - nr_pages = thp_nr_pages(page); - update_lru_size(lruvec, lru, page_zonenum(page), nr_pages); - list_move(>lru, >lists[lru]); - - if (put_page_testzero(page)) { + if (unlikely(put_page_testzero(page))) { __ClearPageLRU(page); __ClearPageActive(page); - del_page_from_lru_list(page, lruvec, lru); if (unlikely(PageCompound(page))) { spin_unlock_irq(>lru_lock); @@ -1877,11 +1881,19 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, spin_lock_irq(>lru_lock); } else list_add(>lru, _to_free); - } else { - nr_moved += nr_pages; - if (PageActive(page)) - workingset_age_nonresident(lruvec, nr_pages); + + continue; } + + lruvec = mem_cgroup_page_lruvec(page, pgdat); + lru = page_lru(page); + nr_pages = thp_nr_pages(page); + + update_lru_size(lruvec, lru, page_zonenum(page), nr_pages); + list_add(>lru, >lists[lru]); + nr_moved += nr_pages; + if (PageActive(page)) + workingset_age_nonresident(lruvec, nr_pages); } /* -- 1.8.3.1
[PATCH v19 18/20] mm/lru: replace pgdat lru_lock with lruvec lock
This patch moves per node lru_lock into lruvec, thus bring a lru_lock for each of memcg per node. So on a large machine, each of memcg don't have to suffer from per node pgdat->lru_lock competition. They could go fast with their self lru_lock. After move memcg charge before lru inserting, page isolation could serialize page's memcg, then per memcg lruvec lock is stable and could replace per node lru lock. In func isolate_migratepages_block, compact_unlock_should_abort and lock_page_lruvec_irqsave are open coded to work with compact_control. Also add a debug func in locking which may give some clues if there are sth out of hands. Daniel Jordan's testing show 62% improvement on modified readtwice case on his 2P * 10 core * 2 HT broadwell box. https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3lo...@ca-dmjordan1.us.oracle.com/ On a large machine with memcg enabled but not used, the page's lruvec seeking pass a few pointers, that may lead to lru_lock holding time increase and a bit regression. Hugh Dickins helped on patch polish, thanks! Signed-off-by: Alex Shi Acked-by: Hugh Dickins Cc: Hugh Dickins Cc: Andrew Morton Cc: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov Cc: Yang Shi Cc: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Tejun Heo Cc: linux-kernel@vger.kernel.org Cc: linux...@kvack.org Cc: cgro...@vger.kernel.org --- include/linux/memcontrol.h | 58 + include/linux/mmzone.h | 3 +- mm/compaction.c| 56 +++- mm/huge_memory.c | 11 ++--- mm/memcontrol.c| 62 -- mm/mlock.c | 22 +++--- mm/mmzone.c| 1 + mm/page_alloc.c| 1 - mm/swap.c | 105 + mm/vmscan.c| 55 +++- 10 files changed, 249 insertions(+), 125 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d0b036123c6a..7b170e9028b5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -494,6 +494,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, struct mem_cgroup *get_mem_cgroup_from_page(struct page *page); +struct lruvec *lock_page_lruvec(struct page *page); +struct lruvec *lock_page_lruvec_irq(struct page *page); +struct lruvec *lock_page_lruvec_irqsave(struct page *page, + unsigned long *flags); + +#ifdef CONFIG_DEBUG_VM +void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page); +#else +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) +{ +} +#endif + static inline struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){ return css ? container_of(css, struct mem_cgroup, css) : NULL; @@ -1035,6 +1048,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg) { } +static inline struct lruvec *lock_page_lruvec(struct page *page) +{ + struct pglist_data *pgdat = page_pgdat(page); + + spin_lock(>__lruvec.lru_lock); + return >__lruvec; +} + +static inline struct lruvec *lock_page_lruvec_irq(struct page *page) +{ + struct pglist_data *pgdat = page_pgdat(page); + + spin_lock_irq(>__lruvec.lru_lock); + return >__lruvec; +} + +static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page, + unsigned long *flagsp) +{ + struct pglist_data *pgdat = page_pgdat(page); + + spin_lock_irqsave(>__lruvec.lru_lock, *flagsp); + return >__lruvec; +} + static inline struct mem_cgroup * mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup *prev, @@ -1282,6 +1320,10 @@ static inline void count_memcg_page_event(struct page *page, void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) { } + +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) +{ +} #endif /* CONFIG_MEMCG */ /* idx can be of type enum memcg_stat_item or node_stat_item */ @@ -1411,6 +1453,22 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec) return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec)); } +static inline void unlock_page_lruvec(struct lruvec *lruvec) +{ + spin_unlock(>lru_lock); +} + +static inline void unlock_page_lruvec_irq(struct lruvec *lruvec) +{ + spin_unlock_irq(>lru_lock); +} + +static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, + unsigned long flags) +{ + spin_unlock_irqrestore(>lru_lock, flags); +} + #ifdef CONFIG_CGROUP_WRITEBACK struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 8379432f4f2f..7727f4c373f7 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -273,6 +273,8 @@ enum lruvec_flags { }; struct lruvec { + /* per lruvec lru_lock for memcg */ +
[PATCH v19 03/20] mm/thp: move lru_add_page_tail func to huge_memory.c
The func is only used in huge_memory.c, defining it in other file with a CONFIG_TRANSPARENT_HUGEPAGE macro restrict just looks weird. Let's move it THP. And make it static as Hugh Dickin suggested. Signed-off-by: Alex Shi Reviewed-by: Kirill A. Shutemov Acked-by: Hugh Dickins Cc: Andrew Morton Cc: Johannes Weiner Cc: Matthew Wilcox Cc: Hugh Dickins Cc: linux-kernel@vger.kernel.org Cc: linux...@kvack.org --- include/linux/swap.h | 2 -- mm/huge_memory.c | 30 ++ mm/swap.c| 33 - 3 files changed, 30 insertions(+), 35 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 661046994db4..43e6b3458f58 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -338,8 +338,6 @@ extern void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages); extern void lru_note_cost_page(struct page *); extern void lru_cache_add(struct page *); -extern void lru_add_page_tail(struct page *page, struct page *page_tail, -struct lruvec *lruvec, struct list_head *head); extern void activate_page(struct page *); extern void mark_page_accessed(struct page *); extern void lru_add_drain(void); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index faadc449cca5..211d50f09785 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2318,6 +2318,36 @@ static void remap_page(struct page *page) } } +static void lru_add_page_tail(struct page *page, struct page *page_tail, + struct lruvec *lruvec, struct list_head *list) +{ + VM_BUG_ON_PAGE(!PageHead(page), page); + VM_BUG_ON_PAGE(PageCompound(page_tail), page); + VM_BUG_ON_PAGE(PageLRU(page_tail), page); + lockdep_assert_held(_pgdat(lruvec)->lru_lock); + + if (!list) + SetPageLRU(page_tail); + + if (likely(PageLRU(page))) + list_add_tail(_tail->lru, >lru); + else if (list) { + /* page reclaim is reclaiming a huge page */ + get_page(page_tail); + list_add_tail(_tail->lru, list); + } else { + /* +* Head page has not yet been counted, as an hpage, +* so we must account for each subpage individually. +* +* Put page_tail on the list at the correct position +* so they all end up in order. +*/ + add_page_to_lru_list_tail(page_tail, lruvec, + page_lru(page_tail)); + } +} + static void __split_huge_page_tail(struct page *head, int tail, struct lruvec *lruvec, struct list_head *list) { diff --git a/mm/swap.c b/mm/swap.c index e7bdf094f76a..1bc2fc5d8b7c 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -935,39 +935,6 @@ void __pagevec_release(struct pagevec *pvec) } EXPORT_SYMBOL(__pagevec_release); -#ifdef CONFIG_TRANSPARENT_HUGEPAGE -/* used by __split_huge_page_refcount() */ -void lru_add_page_tail(struct page *page, struct page *page_tail, - struct lruvec *lruvec, struct list_head *list) -{ - VM_BUG_ON_PAGE(!PageHead(page), page); - VM_BUG_ON_PAGE(PageCompound(page_tail), page); - VM_BUG_ON_PAGE(PageLRU(page_tail), page); - lockdep_assert_held(_pgdat(lruvec)->lru_lock); - - if (!list) - SetPageLRU(page_tail); - - if (likely(PageLRU(page))) - list_add_tail(_tail->lru, >lru); - else if (list) { - /* page reclaim is reclaiming a huge page */ - get_page(page_tail); - list_add_tail(_tail->lru, list); - } else { - /* -* Head page has not yet been counted, as an hpage, -* so we must account for each subpage individually. -* -* Put page_tail on the list at the correct position -* so they all end up in order. -*/ - add_page_to_lru_list_tail(page_tail, lruvec, - page_lru(page_tail)); - } -} -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ - static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec, void *arg) { -- 1.8.3.1
[PATCH v19 05/20] mm/thp: Simplify lru_add_page_tail()
Simplify lru_add_page_tail(), there are actually only two cases possible: split_huge_page_to_list(), with list supplied and head isolated from lru by its caller; or split_huge_page(), with NULL list and head on lru - because when head is racily isolated from lru, the isolator's reference will stop the split from getting any further than its page_ref_freeze(). So decide between the two cases by "list", but add VM_WARN_ON()s to verify that they match our lru expectations. [Hugh Dickins: rewrite commit log] Signed-off-by: Alex Shi Reviewed-by: Kirill A. Shutemov Acked-by: Hugh Dickins Cc: Kirill A. Shutemov Cc: Andrew Morton Cc: Johannes Weiner Cc: Matthew Wilcox Cc: Hugh Dickins Cc: Mika Penttilä Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/huge_memory.c | 21 ++--- 1 file changed, 6 insertions(+), 15 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index fa0753229b52..8b92cd197218 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2326,25 +2326,16 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail, VM_BUG_ON_PAGE(PageLRU(page_tail), head); lockdep_assert_held(_pgdat(lruvec)->lru_lock); - if (!list) - SetPageLRU(page_tail); - - if (likely(PageLRU(head))) - list_add_tail(_tail->lru, >lru); - else if (list) { + if (list) { /* page reclaim is reclaiming a huge page */ + VM_WARN_ON(PageLRU(head)); get_page(page_tail); list_add_tail(_tail->lru, list); } else { - /* -* Head page has not yet been counted, as an hpage, -* so we must account for each subpage individually. -* -* Put page_tail on the list at the correct position -* so they all end up in order. -*/ - add_page_to_lru_list_tail(page_tail, lruvec, - page_lru(page_tail)); + /* head is still on lru (and we have it frozen) */ + VM_WARN_ON(!PageLRU(head)); + SetPageLRU(page_tail); + list_add_tail(_tail->lru, >lru); } } -- 1.8.3.1
[PATCH v19 20/20] mm/lru: revise the comments of lru_lock
From: Hugh Dickins Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to fix the incorrect comments in code. Also fixed some zone->lru_lock comment error from ancient time. etc. I struggled to understand the comment above move_pages_to_lru() (surely it never calls page_referenced()), and eventually realized that most of it had got separated from shrink_active_list(): move that comment back. Signed-off-by: Hugh Dickins Signed-off-by: Alex Shi Cc: Andrew Morton Cc: Tejun Heo Cc: Andrey Ryabinin Cc: Jann Horn Cc: Mel Gorman Cc: Johannes Weiner Cc: Matthew Wilcox Cc: Hugh Dickins Cc: cgro...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux...@kvack.org --- Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 ++-- Documentation/admin-guide/cgroup-v1/memory.rst | 21 +-- Documentation/trace/events-kmem.rst| 2 +- Documentation/vm/unevictable-lru.rst | 22 +--- include/linux/mm_types.h | 2 +- include/linux/mmzone.h | 3 +- mm/filemap.c | 4 +-- mm/rmap.c | 4 +-- mm/vmscan.c| 41 -- 9 files changed, 50 insertions(+), 64 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst index 3f7115e07b5d..0b9f91589d3d 100644 --- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst +++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst @@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. 8. LRU == -Each memcg has its own private LRU. Now, its handling is under global - VM's control (means that it's handled under global pgdat->lru_lock). - Almost all routines around memcg's LRU is called by global LRU's - list management functions under pgdat->lru_lock. - - A special function is mem_cgroup_isolate_pages(). This scans - memcg's private LRU and call __isolate_lru_page() to extract a page - from LRU. - - (By __isolate_lru_page(), the page is removed from both of global and - private LRU.) - + Each memcg has its own vector of LRUs (inactive anon, active anon, + inactive file, active file, unevictable) of pages from each node, + each LRU handled under a single lru_lock for that memcg and node. 9. Typical Tests. = diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 12757e63b26c..24450696579f 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -285,20 +285,17 @@ When oom event notifier is registered, event will be delivered. 2.6 Locking --- - lock_page_cgroup()/unlock_page_cgroup() should not be called under - the i_pages lock. +Lock order is as follows: - Other lock order is following: + Page lock (PG_locked bit of page->flags) +mm->page_table_lock or split pte_lock + lock_page_memcg (memcg->move_lock) +mapping->i_pages lock + lruvec->lru_lock. - PG_locked. - mm->page_table_lock - pgdat->lru_lock - lock_page_cgroup. - - In many cases, just lock_page_cgroup() is called. - - per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by - pgdat->lru_lock, it has no lock of its own. +Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by +lruvec->lru_lock; PG_lru bit of page->flags is cleared before +isolating a page from its LRU under lruvec->lru_lock. 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM) --- diff --git a/Documentation/trace/events-kmem.rst b/Documentation/trace/events-kmem.rst index 555484110e36..68fa75247488 100644 --- a/Documentation/trace/events-kmem.rst +++ b/Documentation/trace/events-kmem.rst @@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered. Broadly speaking, pages are taken off the LRU lock in bulk and freed in batch with a page list. Significant amounts of activity here could indicate that the system is under memory pressure and can also indicate -contention on the zone->lru_lock. +contention on the lruvec->lru_lock. 4. Per-CPU Allocator Activity = diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst index 17d0861b0f1d..0e1490524f53 100644 --- a/Documentation/vm/unevictable-lru.rst +++ b/Documentation/vm/unevictable-lru.rst @@ -33,7 +33,7 @@ reclaim in Linux. The problems have been observed at customer sites on large memory x86_64 systems. To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of -main memory will have over 32 million 4k pages in a single zone. When a large +main memory will have over 32 million
[PATCH v19 09/20] mm/memcg: add debug checking in lock_page_memcg
Add a debug checking in lock_page_memcg, then we could get alarm if anything wrong here. Suggested-by: Johannes Weiner Signed-off-by: Alex Shi Acked-by: Hugh Dickins Cc: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov Cc: Andrew Morton Cc: cgro...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/memcontrol.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 368e6ac644f0..6acc6a60c52b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2097,6 +2097,12 @@ struct mem_cgroup *lock_page_memcg(struct page *page) if (unlikely(!memcg)) return NULL; +#ifdef CONFIG_PROVE_LOCKING + local_irq_save(flags); + might_lock(>move_lock); + local_irq_restore(flags); +#endif + if (atomic_read(>moving_account) <= 0) return memcg; -- 1.8.3.1
[PATCH v19 00/20] per memcg lru_lock
The new version rebased on v5.9-rc6 with line by line review by Hugh Dickins. Millions thanks! And removed the 4th part from last version which do post optimization, that we can repost after the main part get merged. About one year coding and review, now I believe it's ready. So now this patchset includes 3 parts: 1, some code cleanup and minimum optimization as a preparation. 2, use TestCleanPageLRU as page isolation's precondition. 3, replace per node lru_lock with per memcg per node lru_lock. Current lru_lock is one for each of node, pgdat->lru_lock, that guard for lru lists, but now we had moved the lru lists into memcg for long time. Still using per node lru_lock is clearly unscalable, pages on each of memcgs have to compete each others for a whole lru_lock. This patchset try to use per lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make it scalable for memcgs and get performance gain. Currently lru_lock still guards both lru list and page's lru bit, that's ok. but if we want to use specific lruvec lock on the page, we need to pin down the page's lruvec/memcg during locking. Just taking lruvec lock first may be undermined by the page's memcg charge/migration. To fix this problem, we could take out the page's lru bit clear and use it as pin down action to block the memcg changes. That's the reason for new atomic func TestClearPageLRU. So now isolating a page need both actions: TestClearPageLRU and hold the lru_lock. The typical usage of this is isolate_migratepages_block() in compaction.c we have to take lru bit before lru lock, that serialized the page isolation in memcg page charge/migration which will change page's lruvec and new lru_lock in it. The above solution suggested by Johannes Weiner, and based on his new memcg charge path, then have this patchset. (Hugh Dickins tested and contributed much code from compaction fix to general code polish, thanks a lot!). Daniel Jordan's testing show 62% improvement on modified readtwice case on his 2P * 10 core * 2 HT broadwell box. https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3lo...@ca-dmjordan1.us.oracle.com/ Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this idea 8 years ago, and others who give comments as well: Daniel Jordan, Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc. Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu, and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks! Alex Shi (17): mm/memcg: warning on !memcg after readahead page charged mm/memcg: bail early from swap accounting if memcg disabled mm/thp: move lru_add_page_tail func to huge_memory.c mm/thp: use head for head page in lru_add_page_tail mm/thp: Simplify lru_add_page_tail() mm/thp: narrow lru locking mm/vmscan: remove unnecessary lruvec adding mm/memcg: add debug checking in lock_page_memcg mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn mm/lru: move lock into lru_note_cost mm/vmscan: remove lruvec reget in move_pages_to_lru mm/mlock: remove lru_lock on TestClearPageMlocked mm/mlock: remove __munlock_isolate_lru_page mm/lru: introduce TestClearPageLRU mm/compaction: do page isolation first in compaction mm/swap.c: serialize memcg changes in pagevec_lru_move_fn mm/lru: replace pgdat lru_lock with lruvec lock Alexander Duyck (1): mm/lru: introduce the relock_page_lruvec function Hugh Dickins (2): mm: page_idle_get_page() does not need lru_lock mm/lru: revise the comments of lru_lock Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +- Documentation/admin-guide/cgroup-v1/memory.rst | 21 +-- Documentation/trace/events-kmem.rst| 2 +- Documentation/vm/unevictable-lru.rst | 22 +-- include/linux/memcontrol.h | 110 +++ include/linux/mm_types.h | 2 +- include/linux/mmdebug.h| 13 ++ include/linux/mmzone.h | 6 +- include/linux/page-flags.h | 1 + include/linux/swap.h | 4 +- mm/compaction.c| 94 +++--- mm/filemap.c | 4 +- mm/huge_memory.c | 45 +++-- mm/memcontrol.c| 85 - mm/mlock.c | 63 ++- mm/mmzone.c| 1 + mm/page_alloc.c| 1 - mm/page_idle.c | 4 - mm/rmap.c | 4 +- mm/swap.c | 199 mm/vmscan.c| 203 +++-- mm/workingset.c| 2 - 22 files changed, 523 insertions(+), 378
[PATCH v19 01/20] mm/memcg: warning on !memcg after readahead page charged
Add VM_WARN_ON_ONCE_PAGE() macro. Since readahead page is charged on memcg too, in theory we don't have to check this exception now. Before safely remove them all, add a warning for the unexpected !memcg. Signed-off-by: Alex Shi Acked-by: Michal Hocko Acked-by: Hugh Dickins Cc: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov Cc: Andrew Morton Cc: cgro...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- include/linux/mmdebug.h | 13 + mm/memcontrol.c | 11 --- 2 files changed, 17 insertions(+), 7 deletions(-) diff --git a/include/linux/mmdebug.h b/include/linux/mmdebug.h index 2ad72d2c8cc5..4ed52879ce55 100644 --- a/include/linux/mmdebug.h +++ b/include/linux/mmdebug.h @@ -37,6 +37,18 @@ BUG(); \ } \ } while (0) +#define VM_WARN_ON_ONCE_PAGE(cond, page) ({ \ + static bool __section(.data.once) __warned; \ + int __ret_warn_once = !!(cond); \ + \ + if (unlikely(__ret_warn_once && !__warned)) { \ + dump_page(page, "VM_WARN_ON_ONCE_PAGE(" __stringify(cond)")");\ + __warned = true;\ + WARN_ON(1); \ + } \ + unlikely(__ret_warn_once); \ +}) + #define VM_WARN_ON(cond) (void)WARN_ON(cond) #define VM_WARN_ON_ONCE(cond) (void)WARN_ON_ONCE(cond) #define VM_WARN_ONCE(cond, format...) (void)WARN_ONCE(cond, format) @@ -48,6 +60,7 @@ #define VM_BUG_ON_MM(cond, mm) VM_BUG_ON(cond) #define VM_WARN_ON(cond) BUILD_BUG_ON_INVALID(cond) #define VM_WARN_ON_ONCE(cond) BUILD_BUG_ON_INVALID(cond) +#define VM_WARN_ON_ONCE_PAGE(cond, page) BUILD_BUG_ON_INVALID(cond) #define VM_WARN_ONCE(cond, format...) BUILD_BUG_ON_INVALID(cond) #define VM_WARN(cond, format...) BUILD_BUG_ON_INVALID(cond) #endif diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cfa6cbad21d5..f7094a04be07 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1322,10 +1322,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd } memcg = page->mem_cgroup; - /* -* Swapcache readahead pages are added to the LRU - and -* possibly migrated - before they are charged. -*/ + VM_WARN_ON_ONCE_PAGE(!memcg, page); if (!memcg) memcg = root_mem_cgroup; @@ -6912,8 +6909,8 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage) if (newpage->mem_cgroup) return; - /* Swapcache readahead pages can get replaced before being charged */ memcg = oldpage->mem_cgroup; + VM_WARN_ON_ONCE_PAGE(!memcg, oldpage); if (!memcg) return; @@ -7110,7 +7107,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) memcg = page->mem_cgroup; - /* Readahead page, never charged */ + VM_WARN_ON_ONCE_PAGE(!memcg, page); if (!memcg) return; @@ -7174,7 +7171,7 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry) memcg = page->mem_cgroup; - /* Readahead page, never charged */ + VM_WARN_ON_ONCE_PAGE(!memcg, page); if (!memcg) return 0; -- 1.8.3.1
[RFC PATCH 23/24] vdpa_sim: filter destination mac address
Add a simple unicast filter to filter out the dest MAC doesn't match to the one stored in the config. Signed-off-by: Jason Wang --- drivers/vdpa/vdpa_sim/vdpa_sim.c | 49 1 file changed, 31 insertions(+), 18 deletions(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index b21670e054ba..66d901fb4c57 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -171,6 +171,22 @@ static void vdpasim_complete(struct vdpasim_virtqueue *vq, size_t len) local_bh_enable(); } +static bool receive_filter(struct vdpasim *vdpasim, size_t len) +{ + bool modern = vdpasim->features & (1ULL << VIRTIO_F_VERSION_1); + size_t hdr_len = modern ? sizeof(struct virtio_net_hdr_v1) : + sizeof(struct virtio_net_hdr); + + if (len < ETH_ALEN + hdr_len) + return false; + + if (!strncmp(vdpasim->buffer + hdr_len, +vdpasim->config.mac, ETH_ALEN)) + return true; + + return false; +} + static void vdpasim_work(struct work_struct *work) { struct vdpasim *vdpasim = container_of(work, struct @@ -178,7 +194,6 @@ static void vdpasim_work(struct work_struct *work) struct vdpasim_virtqueue *txq = >vqs[1]; struct vdpasim_virtqueue *rxq = >vqs[0]; ssize_t read, write; - size_t total_write; int pkts = 0; int err; @@ -191,36 +206,34 @@ static void vdpasim_work(struct work_struct *work) goto out; while (true) { - total_write = 0; err = vringh_getdesc_iotlb(>vring, >riov, NULL, >head, GFP_ATOMIC); if (err <= 0) break; + read = vringh_iov_pull_iotlb(>vring, >riov, +vdpasim->buffer, +PAGE_SIZE); + + if (!receive_filter(vdpasim, read)) { + vdpasim_complete(txq, 0); + continue; + } + err = vringh_getdesc_iotlb(>vring, NULL, >wiov, >head, GFP_ATOMIC); if (err <= 0) { - vringh_complete_iotlb(>vring, txq->head, 0); + vdpasim_complete(txq, 0); break; } - while (true) { - read = vringh_iov_pull_iotlb(>vring, >riov, -vdpasim->buffer, -PAGE_SIZE); - if (read <= 0) - break; - - write = vringh_iov_push_iotlb(>vring, >wiov, - vdpasim->buffer, read); - if (write <= 0) - break; - - total_write += write; - } + write = vringh_iov_push_iotlb(>vring, >wiov, + vdpasim->buffer, read); + if (write <= 0) + break; vdpasim_complete(txq, 0); - vdpasim_complete(rxq, total_write); + vdpasim_complete(rxq, write); if (++pkts > 4) { schedule_work(>work); -- 2.20.1
Re: [PATCH v2 1/4] dt-bindings: arm: hisilicon: add binding for SD5203 SoC
On 2020/9/24 4:59, Rob Herring wrote: > On Sat, Sep 19, 2020 at 02:45:52PM +0800, Zhen Lei wrote: >> Add devicetree binding for Hisilicon SD5203 SoC. >> >> Signed-off-by: Zhen Lei >> --- >> Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt | 5 + >> 1 file changed, 5 insertions(+) > > Please convert this to DT schema format first. OK, I will do it. > >> >> diff --git a/Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt >> b/Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt >> index a97f643e7d1c760..5d80070bfb13fc0 100644 >> --- a/Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt >> +++ b/Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt >> @@ -56,6 +56,11 @@ HiP07 D05 Board >> Required root node properties: >> - compatible = "hisilicon,hip07-d05"; >> >> +SD5203 SoC >> +Required root node properties: >> +- compatible = "hisilicon,sd5203"; >> + >> + >> Hisilicon system controller >> >> Required properties: >> -- >> 1.8.3 >> >> > > . >
Re: [PATCH v5 5/6] dt-bindings: dw-apb-ictl: convert to json-schema
On 2020/9/24 4:49, Rob Herring wrote: > On Fri, Sep 18, 2020 at 07:22:01PM +0800, Zhen Lei wrote: >> Convert the Synopsys DesignWare APB interrupt controller (dw_apb_ictl) >> binding to DT schema format using json-schema. >> >> Signed-off-by: Zhen Lei >> --- >> .../interrupt-controller/snps,dw-apb-ictl.txt | 43 - >> .../interrupt-controller/snps,dw-apb-ictl.yaml | 75 >> ++ >> 2 files changed, 75 insertions(+), 43 deletions(-) >> delete mode 100644 >> Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.txt >> create mode 100644 >> Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.yaml >> >> diff --git >> a/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.txt >> >> b/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.txt >> deleted file mode 100644 >> index 2db59df9408f4c6..000 >> --- >> a/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.txt >> +++ /dev/null >> @@ -1,43 +0,0 @@ >> -Synopsys DesignWare APB interrupt controller (dw_apb_ictl) >> - >> -Synopsys DesignWare provides interrupt controller IP for APB known as >> -dw_apb_ictl. The IP is used as secondary interrupt controller in some SoCs >> with >> -APB bus, e.g. Marvell Armada 1500. It can also be used as primary interrupt >> -controller in some SoCs, e.g. Hisilicon SD5203. >> - >> -Required properties: >> -- compatible: shall be "snps,dw-apb-ictl" >> -- reg: physical base address of the controller and length of memory mapped >> - region starting with ENABLE_LOW register >> -- interrupt-controller: identifies the node as an interrupt controller >> -- #interrupt-cells: number of cells to encode an interrupt-specifier, shall >> be 1 >> - >> -Additional required property when it's used as secondary interrupt >> controller: >> -- interrupts: interrupt reference to primary interrupt controller >> - >> -The interrupt sources map to the corresponding bits in the interrupt >> -registers, i.e. >> -- 0 maps to bit 0 of low interrupts, >> -- 1 maps to bit 1 of low interrupts, >> -- 32 maps to bit 0 of high interrupts, >> -- 33 maps to bit 1 of high interrupts, >> -- (optional) fast interrupts start at 64. >> - >> -Example: >> -/* dw_apb_ictl is used as secondary interrupt controller */ >> -aic: interrupt-controller@3000 { >> -compatible = "snps,dw-apb-ictl"; >> -reg = <0x3000 0xc00>; >> -interrupt-controller; >> -#interrupt-cells = <1>; >> -interrupt-parent = <>; >> -interrupts = ; >> -}; >> - >> -/* dw_apb_ictl is used as primary interrupt controller */ >> -vic: interrupt-controller@1013 { >> -compatible = "snps,dw-apb-ictl"; >> -reg = <0x1013 0x1000>; >> -interrupt-controller; >> -#interrupt-cells = <1>; >> -}; >> diff --git >> a/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.yaml >> >> b/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.yaml >> new file mode 100644 >> index 000..70c12979c843bf0 >> --- /dev/null >> +++ >> b/Documentation/devicetree/bindings/interrupt-controller/snps,dw-apb-ictl.yaml >> @@ -0,0 +1,75 @@ >> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) >> +%YAML 1.2 >> +--- >> +$id: >> http://devicetree.org/schemas/interrupt-controller/snps,dw-apb-ictl.yaml# >> +$schema: http://devicetree.org/meta-schemas/core.yaml# >> + >> +title: Synopsys DesignWare APB interrupt controller (dw_apb_ictl) >> + >> +maintainers: >> + - Marc Zyngier > > Usually this would be an owner for this IP block, not the subsystem > maintainer. OK, I will change it to the author of the file "snps,dw-apb-ictl.txt". > >> + >> +description: >> + Synopsys DesignWare provides interrupt controller IP for APB known as >> + dw_apb_ictl. The IP is used as secondary interrupt controller in some SoCs >> + with APB bus, e.g. Marvell Armada 1500. It can also be used as primary >> + interrupt controller in some SoCs, e.g. Hisilicon SD5203. >> + >> +allOf: >> + - $ref: /schemas/interrupt-controller.yaml# > > You can drop this, it's already applied based on node name. But if we drop this, the "snps,dw-apb-ictl.yaml" can not require that the node name must match '^interrupt-controller(@[0-9a-f,]+)*$'. The problem of Patch 6/6 was discovered by this. > >> + >> +properties: >> + compatible: >> +const: snps,dw-apb-ictl >> + >> + interrupt-controller: true >> + >> + reg: >> +description: >> + Physical base address of the controller and length of memory mapped >> + region starting with ENABLE_LOW register. > > Need to define how many reg regions (maxItems: 1). OK, I will add it. > >> + >> + interrupts: >> +description: >> + Interrupt reference to primary interrupt controller. >> + >> + The interrupt sources map to the corresponding
[RFC PATCH 24/24] vdpasim: control virtqueue support
This patch introduces the control virtqueue support for vDPA simulator. This is a requirement for supporting advanced features like multiqueue. A requirement for control virtqueue is to isolate its memory access from the rx/tx virtqueues. This is because when using vDPA device for VM, the control virqueue may not be assigned to VM directly but shadowed by userspace VMM (Qemu). The isolation is done via the virtqueue groups and ASID support in vDPA through vhost-vdpa. The simulator is extended to have: 1) three virtqueues: RXVQ, TXVQ and CVQ (control virtqueue) 2) two virtqueue groups: group 0 contains RX/TX, group 1 contains CVQ 3) two address spaces and the simulator simply implements the address spaces by mapping it 1:1 to IOTLB. For the VM use cases, userspace(Qemu) may set AS 0 to group 0 and AS 1 to group 1. So we have: 1) The IOTLB for virtqueue group 0 contains the mappings of guest, so RX and TX can be assigned to guest directly. 2) The IOTLB for virtqueue group 1 contains the mappings of CVQ which is the buffers that allocated and managed by userspace only so that guest can not access the CVQ of vhost-vdpa. For the other use cases, since AS 0 is associated to all virtqueue groups by default. All virtqueues share the same mapping by default. To demonstrate the function, VIRITO_NET_F_CTRL_MACADDR is implemented in the simulator for the driver to set mac address. Signed-off-by: Jason Wang --- drivers/vdpa/vdpa_sim/vdpa_sim.c | 189 +++ 1 file changed, 166 insertions(+), 23 deletions(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index 66d901fb4c57..3459539c4460 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -56,14 +56,18 @@ struct vdpasim_virtqueue { #define VDPASIM_QUEUE_MAX 256 #define VDPASIM_DEVICE_ID 0x1 #define VDPASIM_VENDOR_ID 0 -#define VDPASIM_VQ_NUM 0x2 +#define VDPASIM_VQ_NUM 0x3 +#define VDPASIM_AS_NUM 0x2 +#define VDPASIM_GROUP_NUM 0x2 #define VDPASIM_NAME "vdpasim-netdev" static u64 vdpasim_features = (1ULL << VIRTIO_F_ANY_LAYOUT) | (1ULL << VIRTIO_F_VERSION_1) | (1ULL << VIRTIO_F_ACCESS_PLATFORM) | (1ULL << VIRTIO_NET_F_MTU) | - (1ULL << VIRTIO_NET_F_MAC); + (1ULL << VIRTIO_NET_F_MAC) | + (1ULL << VIRTIO_NET_F_CTRL_VQ) | + (1ULL << VIRTIO_NET_F_CTRL_MAC_ADDR); /* State of each vdpasim device */ struct vdpasim { @@ -143,11 +147,17 @@ static void vdpasim_reset(struct vdpasim *vdpasim) { int i; - for (i = 0; i < VDPASIM_VQ_NUM; i++) + spin_lock(>iommu_lock); + + for (i = 0; i < VDPASIM_VQ_NUM; i++) { vdpasim_vq_reset(>vqs[i]); + vringh_set_iotlb(>vqs[i].vring, +>iommu[0]); + } - spin_lock(>iommu_lock); - vhost_iotlb_reset(vdpasim->iommu); + for (i = 0; i < VDPASIM_AS_NUM; i++) { + vhost_iotlb_reset(>iommu[i]); + } spin_unlock(>iommu_lock); vdpasim->features = 0; @@ -187,6 +197,80 @@ static bool receive_filter(struct vdpasim *vdpasim, size_t len) return false; } +virtio_net_ctrl_ack vdpasim_handle_ctrl_mac(struct vdpasim *vdpasim, + u8 cmd) +{ + struct vdpasim_virtqueue *cvq = >vqs[2]; + virtio_net_ctrl_ack status = VIRTIO_NET_ERR; + size_t read; + + switch (cmd) { + case VIRTIO_NET_CTRL_MAC_ADDR_SET: + read = vringh_iov_pull_iotlb(>vring, >riov, +(void *)vdpasim->config.mac, +ETH_ALEN); + if (read == ETH_ALEN) + status = VIRTIO_NET_OK; + break; + default: + break; + } + + return status; +} + +static void vdpasim_handle_cvq(struct vdpasim *vdpasim) +{ + struct vdpasim_virtqueue *cvq = >vqs[2]; + virtio_net_ctrl_ack status = VIRTIO_NET_ERR; + struct virtio_net_ctrl_hdr ctrl; + size_t read, write; + int err; + + if (!(vdpasim->features & (1ULL << VIRTIO_NET_F_CTRL_VQ))) + return; + + if (!cvq->ready) + return; + + while (true) { + err = vringh_getdesc_iotlb(>vring, >riov, >wiov, + >head, GFP_ATOMIC); + if (err <= 0) + break; + + read = vringh_iov_pull_iotlb(>vring, >riov, , +sizeof(ctrl)); + if (read != sizeof(ctrl)) + break; + + switch (ctrl.class) { + case VIRTIO_NET_CTRL_MAC: + status = vdpasim_handle_ctrl_mac(vdpasim, ctrl.cmd);
[RFC PATCH 22/24] vdpa_sim: factor out buffer completion logic
This patch factors out the buffer completion logic in order to support future features. Signed-off-by: Jason Wang --- drivers/vdpa/vdpa_sim/vdpa_sim.c | 33 +--- 1 file changed, 18 insertions(+), 15 deletions(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index ca5c2d0db905..b21670e054ba 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -155,6 +155,22 @@ static void vdpasim_reset(struct vdpasim *vdpasim) ++vdpasim->generation; } +static void vdpasim_complete(struct vdpasim_virtqueue *vq, size_t len) +{ + /* Make sure data is wrote before advancing index */ + smp_wmb(); + + vringh_complete_iotlb(>vring, vq->head, len); + + /* Make sure used is visible before rasing the interrupt. */ + smp_wmb(); + + local_bh_disable(); + if (vq->cb) + vq->cb(vq->private); + local_bh_enable(); +} + static void vdpasim_work(struct work_struct *work) { struct vdpasim *vdpasim = container_of(work, struct @@ -203,21 +219,8 @@ static void vdpasim_work(struct work_struct *work) total_write += write; } - /* Make sure data is wrote before advancing index */ - smp_wmb(); - - vringh_complete_iotlb(>vring, txq->head, 0); - vringh_complete_iotlb(>vring, rxq->head, total_write); - - /* Make sure used is visible before rasing the interrupt. */ - smp_wmb(); - - local_bh_disable(); - if (txq->cb) - txq->cb(txq->private); - if (rxq->cb) - rxq->cb(rxq->private); - local_bh_enable(); + vdpasim_complete(txq, 0); + vdpasim_complete(rxq, total_write); if (++pkts > 4) { schedule_work(>work); -- 2.20.1
[RFC PATCH 21/24] vdpa_sim: advertise VIRTIO_NET_F_MAC
We advertise mac address via config space, so let's advertise VIRTIO_NET_F_MAC. Signed-off-by: Jason Wang --- drivers/vdpa/vdpa_sim/vdpa_sim.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index 4b2d0d3fbc87..ca5c2d0db905 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -62,7 +62,8 @@ struct vdpasim_virtqueue { static u64 vdpasim_features = (1ULL << VIRTIO_F_ANY_LAYOUT) | (1ULL << VIRTIO_F_VERSION_1) | (1ULL << VIRTIO_F_ACCESS_PLATFORM) | - (1ULL << VIRTIO_NET_F_MTU); + (1ULL << VIRTIO_NET_F_MTU) | + (1ULL << VIRTIO_NET_F_MAC); /* State of each vdpasim device */ struct vdpasim { -- 2.20.1
[RFC PATCH 20/24] vdpa_sim: advertise VIRTIO_NET_F_MTU
We've already reported maximum mtu via config space, so let's advertise the feature. Signed-off-by: Jason Wang --- drivers/vdpa/vdpa_sim/vdpa_sim.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index d1764a64578d..4b2d0d3fbc87 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -61,7 +61,8 @@ struct vdpasim_virtqueue { static u64 vdpasim_features = (1ULL << VIRTIO_F_ANY_LAYOUT) | (1ULL << VIRTIO_F_VERSION_1) | - (1ULL << VIRTIO_F_ACCESS_PLATFORM); + (1ULL << VIRTIO_F_ACCESS_PLATFORM) | + (1ULL << VIRTIO_NET_F_MTU); /* State of each vdpasim device */ struct vdpasim { -- 2.20.1
[RFC PATCH 16/24] vhost-vdpa: uAPI to get virtqueue group id
Follows the support for virtqueue group in vDPA. This patches introduces uAPI to get the virtqueue group ID for a specific virtqueue in vhost-vdpa. Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 8 include/uapi/linux/vhost.h | 4 2 files changed, 12 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 4d97a59824a1..a234d3783e16 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -433,6 +433,14 @@ static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd, return -EFAULT; ops->set_vq_ready(vdpa, idx, s.num); return 0; + case VHOST_VDPA_GET_VRING_GROUP: + s.index = idx; + s.num = ops->get_vq_group(vdpa, idx); + if (s.num >= vdpa->ngroups) + return -EIO; + else if (copy_to_user(argp, , sizeof s)) + return -EFAULT; + return 0; case VHOST_GET_VRING_BASE: r = ops->get_vq_state(v->vdpa, idx, _state); if (r) diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h index 99bdf50efc50..d1c4b5561fee 100644 --- a/include/uapi/linux/vhost.h +++ b/include/uapi/linux/vhost.h @@ -147,4 +147,8 @@ /* Get the number of address spaces. */ #define VHOST_VDPA_GET_AS_NUM _IOR(VHOST_VIRTIO, 0x79, unsigned int) + +/* Get the group for a virtqueue: read index, write group in num */ +#define VHOST_VDPA_GET_VRING_GROUP _IOWR(VHOST_VIRTIO, 0x79, \ + struct vhost_vring_state) #endif -- 2.20.1
[RFC PATCH 19/24] vdpa_sim: use separated iov for reading and writing
In order to support control virtqueue whose commands have both in and out descriptors, we need to use separated iov for reading and writing in vdpa_sim. Signed-off-by: Jason Wang --- drivers/vdpa/vdpa_sim/vdpa_sim.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index 5dc04ec271bb..d1764a64578d 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -40,7 +40,8 @@ MODULE_PARM_DESC(batch_mapping, "Batched mapping 1 -Enable; 0 - Disable"); struct vdpasim_virtqueue { struct vringh vring; - struct vringh_kiov iov; + struct vringh_kiov riov; + struct vringh_kiov wiov; unsigned short head; bool ready; u64 desc_addr; @@ -173,12 +174,12 @@ static void vdpasim_work(struct work_struct *work) while (true) { total_write = 0; - err = vringh_getdesc_iotlb(>vring, >iov, NULL, + err = vringh_getdesc_iotlb(>vring, >riov, NULL, >head, GFP_ATOMIC); if (err <= 0) break; - err = vringh_getdesc_iotlb(>vring, NULL, >iov, + err = vringh_getdesc_iotlb(>vring, NULL, >wiov, >head, GFP_ATOMIC); if (err <= 0) { vringh_complete_iotlb(>vring, txq->head, 0); @@ -186,13 +187,13 @@ static void vdpasim_work(struct work_struct *work) } while (true) { - read = vringh_iov_pull_iotlb(>vring, >iov, + read = vringh_iov_pull_iotlb(>vring, >riov, vdpasim->buffer, PAGE_SIZE); if (read <= 0) break; - write = vringh_iov_push_iotlb(>vring, >iov, + write = vringh_iov_push_iotlb(>vring, >wiov, vdpasim->buffer, read); if (write <= 0) break; -- 2.20.1
[RFC PATCH 18/24] vhost-vdpa: support ASID based IOTLB API
This patch extends the vhost-vdpa to support ASID based IOTLB API. The vhost-vdpa device will allocated multple IOTLBs for vDPA device that supports multiple address spaces. The IOTLBs and vDPA device memory mappings is determined and maintained through ASID. Note that we still don't support vDPA device with more than one address spaces that depends on platform IOMMU. This work will be done by moving the IOMMU logic from vhost-vDPA to vDPA device driver. Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 106 --- 1 file changed, 79 insertions(+), 27 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 978cf97dc03a..99ac13b2ed11 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -29,7 +29,8 @@ enum { VHOST_VDPA_BACKEND_FEATURES = (1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2) | - (1ULL << VHOST_BACKEND_F_IOTLB_BATCH), + (1ULL << VHOST_BACKEND_F_IOTLB_BATCH) | + (1ULL << VHOST_BACKEND_F_IOTLB_ASID), }; #define VHOST_VDPA_DEV_MAX (1U << MINORBITS) @@ -58,12 +59,20 @@ struct vhost_vdpa { struct eventfd_ctx *config_ctx; int in_batch; int used_as; + u32 batch_asid; }; static DEFINE_IDA(vhost_vdpa_ida); static dev_t vhost_vdpa_major; +static inline u32 iotlb_to_asid(struct vhost_iotlb *iotlb) +{ + struct vhost_vdpa_as *as = container_of(iotlb, struct + vhost_vdpa_as, iotlb); + return as->id; +} + static struct vhost_vdpa_as *asid_to_as(struct vhost_vdpa *v, u32 asid) { struct hlist_head *head = >as[asid % VHOST_VDPA_IOTLB_BUCKETS]; @@ -76,6 +85,16 @@ static struct vhost_vdpa_as *asid_to_as(struct vhost_vdpa *v, u32 asid) return NULL; } +static struct vhost_iotlb *asid_to_iotlb(struct vhost_vdpa *v, u32 asid) +{ + struct vhost_vdpa_as *as = asid_to_as(v, asid); + + if (!as) + return NULL; + + return >iotlb; +} + static struct vhost_vdpa_as *vhost_vdpa_alloc_as(struct vhost_vdpa *v, u32 asid) { struct hlist_head *head = >as[asid % VHOST_VDPA_IOTLB_BUCKETS]; @@ -84,6 +103,9 @@ static struct vhost_vdpa_as *vhost_vdpa_alloc_as(struct vhost_vdpa *v, u32 asid) if (asid_to_as(v, asid)) return NULL; + if (asid >= v->vdpa->nas) + return NULL; + as = kmalloc(sizeof(*as), GFP_KERNEL); if (!as) return NULL; @@ -96,13 +118,20 @@ static struct vhost_vdpa_as *vhost_vdpa_alloc_as(struct vhost_vdpa *v, u32 asid) return as; } -static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) +static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, + u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); - /* Remove default address space is not allowed */ - if (asid == 0) - return -EINVAL; + if (as) + return as; + + return vhost_vdpa_alloc_as(v, asid); +} + +static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) +{ + struct vhost_vdpa_as *as = asid_to_as(v, asid); if (!as) return -EINVAL; @@ -623,6 +652,7 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb, { struct vdpa_device *vdpa = v->vdpa; const struct vdpa_config_ops *ops = vdpa->config; + u32 asid = iotlb_to_asid(iotlb); int r = 0; r = vhost_iotlb_add_range(iotlb, iova, iova + size - 1, @@ -631,10 +661,10 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb, return r; if (ops->dma_map) { - r = ops->dma_map(vdpa, 0, iova, size, pa, perm); + r = ops->dma_map(vdpa, asid, iova, size, pa, perm); } else if (ops->set_map) { if (!v->in_batch) - r = ops->set_map(vdpa, 0, iotlb); + r = ops->set_map(vdpa, asid, iotlb); } else { r = iommu_map(v->domain, iova, pa, size, perm_to_iommu_flags(perm)); @@ -643,23 +673,32 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb, return r; } -static void vhost_vdpa_unmap(struct vhost_vdpa *v, -struct vhost_iotlb *iotlb, -u64 iova, u64 size) +static int vhost_vdpa_unmap(struct vhost_vdpa *v, + struct vhost_iotlb *iotlb, + u64 iova, u64 size) { struct vdpa_device *vdpa = v->vdpa; const struct vdpa_config_ops *ops = vdpa->config; + u32 asid = iotlb_to_asid(iotlb); + + if (!iotlb) + return -EINVAL; vhost_vdpa_iotlb_unmap(v, iotlb, iova, iova + size - 1); if (ops->dma_map) { - ops->dma_unmap(vdpa, 0, iova, size); +
[RFC PATCH 17/24] vhost-vdpa: introduce uAPI to set group ASID
Follows the vDPA support for associating ASID to a specific virtqueue group. This patch adds a uAPI to support setting them from userspace. Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 8 include/uapi/linux/vhost.h | 4 2 files changed, 12 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index a234d3783e16..978cf97dc03a 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -441,6 +441,14 @@ static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd, else if (copy_to_user(argp, , sizeof s)) return -EFAULT; return 0; + case VHOST_VDPA_SET_GROUP_ASID: + if (copy_from_user(, argp, sizeof(s))) + return -EFAULT; + if (s.num >= vdpa->ngroups) + return -EINVAL; + if (!ops->set_group_asid) + return -ENOTSUPP; + return ops->set_group_asid(vdpa, idx, s.num); case VHOST_GET_VRING_BASE: r = ops->get_vq_state(v->vdpa, idx, _state); if (r) diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h index d1c4b5561fee..f3de9e45c518 100644 --- a/include/uapi/linux/vhost.h +++ b/include/uapi/linux/vhost.h @@ -151,4 +151,8 @@ /* Get the group for a virtqueue: read index, write group in num */ #define VHOST_VDPA_GET_VRING_GROUP _IOWR(VHOST_VIRTIO, 0x79, \ struct vhost_vring_state) +/* Set the ASID for a virtqueue group. */ +#define VHOST_VDPA_SET_GROUP_ASID _IOW(VHOST_VIRTIO, 0x7A, \ +struct vhost_vring_state) + #endif -- 2.20.1
Re: [PATCH V3 1/3] arm64/mm/hotplug: Register boot memory hot remove notifier earlier
On 09/23/2020 11:34 AM, Gavin Shan wrote: > Hi Anshuman, > > On 9/21/20 10:05 PM, Anshuman Khandual wrote: >> This moves memory notifier registration earlier in the boot process from >> device_initcall() to early_initcall() which will help in guarding against >> potential early boot memory offline requests. Even though there should not >> be any actual offlinig requests till memory block devices are initialized >> with memory_dev_init() but then generic init sequence might just change in >> future. Hence an early registration for the memory event notifier would be >> helpful. While here, just skip the registration if CONFIG_MEMORY_HOTREMOVE >> is not enabled and also call out when memory notifier registration fails. >> >> Cc: Catalin Marinas >> Cc: Will Deacon >> Cc: Mark Rutland >> Cc: Marc Zyngier >> Cc: Steve Capper >> Cc: Mark Brown >> Cc: linux-arm-ker...@lists.infradead.org >> Cc: linux-kernel@vger.kernel.org >> Signed-off-by: Anshuman Khandual >> --- >> arch/arm64/mm/mmu.c | 14 -- >> 1 file changed, 12 insertions(+), 2 deletions(-) >> > > With the following nit-picky comments resolved: > > Reviewed-by: Gavin Shan > >> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c >> index 75df62fea1b6..df3b7415b128 100644 >> --- a/arch/arm64/mm/mmu.c >> +++ b/arch/arm64/mm/mmu.c >> @@ -1499,7 +1499,17 @@ static struct notifier_block >> prevent_bootmem_remove_nb = { >> static int __init prevent_bootmem_remove_init(void) >> { >> - return register_memory_notifier(_bootmem_remove_nb); >> + int ret = 0; >> + >> + if (!IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) >> + return ret; >> + >> + ret = register_memory_notifier(_bootmem_remove_nb); >> + if (!ret) >> + return ret; >> + >> + pr_err("Notifier registration failed - boot memory can be removed\n"); >> + return ret; >> } > > It might be cleaner if the duplicated return statements can be > avoided. Besides, it's always nice to print the errno even though Thought about it, just that the error message was too long. > zero is always returned from register_memory_notifier(). So I guess > you probably need something like below: > > ret = register_memory_notifier(_bootmem_remove_nb); > if (ret) > pr_err("%s: Error %d registering notifier\n", __func__, ret) > > return ret; Sure, will do. > > > register_memory_notifier # 0 is returned on > !CONFIG_MEMORY_HOTPLUG_SPARSE > blocking_notifier_chain_register > notifier_chain_register # 0 is always returned > >> -device_initcall(prevent_bootmem_remove_init); >> +early_initcall(prevent_bootmem_remove_init); >> #endif >> > > Cheers, > Gavin > >
[RFC PATCH 15/24] vhost-vdpa: introduce uAPI to get the number of address spaces
A new uAPI is introduced for the userspace to know the address spaces that is supported by a specific device. Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 3 +++ include/uapi/linux/vhost.h | 2 ++ 2 files changed, 5 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 4b8882f55bc9..4d97a59824a1 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -532,6 +532,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, r = copy_to_user(argp, >vdpa->ngroups, sizeof(v->vdpa->ngroups)); break; + case VHOST_VDPA_GET_AS_NUM: + r = copy_to_user(argp, >vdpa->nas, sizeof(v->vdpa->nas)); + break; case VHOST_SET_LOG_BASE: case VHOST_SET_LOG_FD: r = -ENOIOCTLCMD; diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h index 19f1acdfe3ea..99bdf50efc50 100644 --- a/include/uapi/linux/vhost.h +++ b/include/uapi/linux/vhost.h @@ -145,4 +145,6 @@ /* Get the number of virtqueue groups. */ #define VHOST_VDPA_GET_GROUP_NUM _IOR(VHOST_VIRTIO, 0x78, unsigned int) +/* Get the number of address spaces. */ +#define VHOST_VDPA_GET_AS_NUM _IOR(VHOST_VIRTIO, 0x79, unsigned int) #endif -- 2.20.1
[RFC PATCH 10/24] vdpa: introduce config operations for associating ASID to a virtqueue group
This patch introduces a new bus operation to allow the vDPA bus driver to associate an ASID to a virtqueue group. Signed-off-by: Jason Wang --- include/linux/vdpa.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 1e1163daa352..e2394995a3cd 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -160,6 +160,12 @@ struct vdpa_device { * @get_generation:Get device config generation (optional) * @vdev: vdpa device * Returns u32: device generation + * @set_group_asid:Set address space identifier for a + * virtqueue group + * @vdev: vdpa device + * @group: virtqueue group + * @asid: address space id for this group + * Returns integer: success (0) or error (< 0) * @set_map: Set device memory mapping (optional) * Needed for device that using device * specific DMA translation (on-chip IOMMU) @@ -237,6 +243,10 @@ struct vdpa_config_ops { u64 iova, u64 size, u64 pa, u32 perm); int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid, u64 iova, u64 size); + int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group, + unsigned int asid); + + /* Free device resources */ void (*free)(struct vdpa_device *vdev); -- 2.20.1
Re: [PATCH v9 10/20] gpiolib: cdev: support GPIO_V2_LINE_SET_CONFIG_IOCTL
On Wed, Sep 23, 2020 at 07:15:46PM +0300, Andy Shevchenko wrote: > On Wed, Sep 23, 2020 at 7:14 PM Andy Shevchenko > wrote: > > > > On Tue, Sep 22, 2020 at 5:35 AM Kent Gibson wrote: > > > > > > Add support for GPIO_V2_LINE_SET_CONFIG_IOCTL, the uAPI v2 > > > line set config ioctl. > > > > +static long linereq_set_config_unlocked(struct linereq *lr, > > > + struct gpio_v2_line_config *lc) > > > +{ > > > + struct gpio_desc *desc; > > > + unsigned int i; > > > + u64 flags; > > > + bool polarity_change; > > > + int ret; > > > + > > > + for (i = 0; i < lr->num_lines; i++) { > > > + desc = lr->lines[i].desc; > > > + flags = gpio_v2_line_config_flags(lc, i); > > > > > + polarity_change = > > > + (test_bit(FLAG_ACTIVE_LOW, >flags) != > > > +((flags & GPIO_V2_LINE_FLAG_ACTIVE_LOW) != 0)); > > > > Comparison > > Comparison between int / long (not all archs are agreed on this) and > boolean is not the best we can do. > There is no bool to int comparision here. There are two comparisons - the inner int vs int => bool and the outer bool vs bool. The "!= 0" is effectively an implicit cast to bool, as is your new_polarity initialisation below. > What about > > bool old_polarity = test_bit(FLAG_ACTIVE_LOW, >flags); > bool new_polarity = flags & GPIO_V2_LINE_FLAG_ACTIVE_LOW; > > old_polarity ^ new_polarity > So using bitwise operators on bools is ok?? > and move this under INPUT conditional? > It has to be before the gpio_v2_line_config_flags_to_desc_flags() call, as that modifies the desc flags, including the new polarity, so polarity_change would then always be false :-). Cheers, Kent. > > > + > > > + gpio_v2_line_config_flags_to_desc_flags(flags, > > > >flags); > > > + /* > > > +* Lines have to be requested explicitly for input > > > +* or output, else the line will be treated "as is". > > > +*/ > > > + if (flags & GPIO_V2_LINE_FLAG_OUTPUT) { > > > + int val = gpio_v2_line_config_output_value(lc, i); > > > + > > > + edge_detector_stop(>lines[i]); > > > + ret = gpiod_direction_output(desc, val); > > > + if (ret) > > > + return ret; > > > + } else if (flags & GPIO_V2_LINE_FLAG_INPUT) { > > > + ret = gpiod_direction_input(desc); > > > + if (ret) > > > + return ret; > > > + > > > + ret = edge_detector_update(>lines[i], > > > + flags & GPIO_V2_LINE_EDGE_FLAGS, > > > + polarity_change); > > > + if (ret) > > > + return ret; > > > + } > > > + > > > + blocking_notifier_call_chain(>gdev->notifier, > > > +GPIO_V2_LINE_CHANGED_CONFIG, > > > +desc); > > > + } > > > + return 0; > > > +} > > -- > With Best Regards, > Andy Shevchenko
[RFC PATCH 14/24] vhost-vdpa: introduce uAPI to get the number of virtqueue groups
Follows the vDPA support for multiple address spaces, this patch introduce uAPI for the userspace to know the number of virtqueue groups supported by the vDPA device. Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 4 include/uapi/linux/vhost.h | 4 2 files changed, 8 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 1ba7e95619b5..4b8882f55bc9 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -528,6 +528,10 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, case VHOST_VDPA_GET_VRING_NUM: r = vhost_vdpa_get_vring_num(v, argp); break; + case VHOST_VDPA_GET_GROUP_NUM: + r = copy_to_user(argp, >vdpa->ngroups, +sizeof(v->vdpa->ngroups)); + break; case VHOST_SET_LOG_BASE: case VHOST_SET_LOG_FD: r = -ENOIOCTLCMD; diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h index c26452782882..19f1acdfe3ea 100644 --- a/include/uapi/linux/vhost.h +++ b/include/uapi/linux/vhost.h @@ -141,4 +141,8 @@ /* Set event fd for config interrupt*/ #define VHOST_VDPA_SET_CONFIG_CALL _IOW(VHOST_VIRTIO, 0x77, int) + +/* Get the number of virtqueue groups. */ +#define VHOST_VDPA_GET_GROUP_NUM _IOR(VHOST_VIRTIO, 0x78, unsigned int) + #endif -- 2.20.1
[RFC PATCH 13/24] vhost-vdpa: introduce ASID based IOTLB
This patch introduces the support of ASID based IOTLB by tagging IOTLB with a unique ASID. This is a must for supporting ASID based vhost IOTLB API by the following patches. IOTLB were stored in a hlist and new IOTLB will be allocated when a new ASID is seen via IOTLB API and destoryed when there's no mapping associated with an ASID. Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 94 +--- 1 file changed, 72 insertions(+), 22 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 6552987544d7..1ba7e95619b5 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -34,13 +34,21 @@ enum { #define VHOST_VDPA_DEV_MAX (1U << MINORBITS) +#define VHOST_VDPA_IOTLB_BUCKETS 16 + +struct vhost_vdpa_as { + struct hlist_node hash_link; + struct vhost_iotlb iotlb; + u32 id; +}; + struct vhost_vdpa { struct vhost_dev vdev; struct iommu_domain *domain; struct vhost_virtqueue *vqs; struct completion completion; struct vdpa_device *vdpa; - struct vhost_iotlb *iotlb; + struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS]; struct device dev; struct cdev cdev; atomic_t opened; @@ -49,12 +57,64 @@ struct vhost_vdpa { int minor; struct eventfd_ctx *config_ctx; int in_batch; + int used_as; }; static DEFINE_IDA(vhost_vdpa_ida); static dev_t vhost_vdpa_major; +static struct vhost_vdpa_as *asid_to_as(struct vhost_vdpa *v, u32 asid) +{ + struct hlist_head *head = >as[asid % VHOST_VDPA_IOTLB_BUCKETS]; + struct vhost_vdpa_as *as; + + hlist_for_each_entry(as, head, hash_link) + if (as->id == asid) + return as; + + return NULL; +} + +static struct vhost_vdpa_as *vhost_vdpa_alloc_as(struct vhost_vdpa *v, u32 asid) +{ + struct hlist_head *head = >as[asid % VHOST_VDPA_IOTLB_BUCKETS]; + struct vhost_vdpa_as *as; + + if (asid_to_as(v, asid)) + return NULL; + + as = kmalloc(sizeof(*as), GFP_KERNEL); + if (!as) + return NULL; + + vhost_iotlb_init(>iotlb, 0, 0); + as->id = asid; + hlist_add_head(>hash_link, head); + ++v->used_as; + + return as; +} + +static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) +{ + struct vhost_vdpa_as *as = asid_to_as(v, asid); + + /* Remove default address space is not allowed */ + if (asid == 0) + return -EINVAL; + + if (!as) + return -EINVAL; + + hlist_del(>hash_link); + vhost_iotlb_reset(>iotlb); + kfree(as); + --v->used_as; + + return 0; +} + static void handle_vq_kick(struct vhost_work *work) { struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue, @@ -513,15 +573,6 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, } } -static void vhost_vdpa_iotlb_free(struct vhost_vdpa *v) -{ - struct vhost_iotlb *iotlb = v->iotlb; - - vhost_vdpa_iotlb_unmap(v, iotlb, 0ULL, 0ULL - 1); - kfree(v->iotlb); - v->iotlb = NULL; -} - static int perm_to_iommu_flags(u32 perm) { int flags = 0; @@ -681,7 +732,8 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid, struct vhost_vdpa *v = container_of(dev, struct vhost_vdpa, vdev); struct vdpa_device *vdpa = v->vdpa; const struct vdpa_config_ops *ops = vdpa->config; - struct vhost_iotlb *iotlb = v->iotlb; + struct vhost_vdpa_as *as = asid_to_as(v, 0); + struct vhost_iotlb *iotlb = >iotlb; int r = 0; if (asid != 0) @@ -775,6 +827,7 @@ static void vhost_vdpa_cleanup(struct vhost_vdpa *v) { vhost_dev_cleanup(>vdev); kfree(v->vdev.vqs); + vhost_vdpa_remove_as(v, 0); } static int vhost_vdpa_open(struct inode *inode, struct file *filep) @@ -807,23 +860,18 @@ static int vhost_vdpa_open(struct inode *inode, struct file *filep) vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false, vhost_vdpa_process_iotlb_msg); - dev->iotlb = vhost_iotlb_alloc(0, 0); - if (!dev->iotlb) { - r = -ENOMEM; - goto err_init_iotlb; - } + if (!vhost_vdpa_alloc_as(v, 0)) + goto err_alloc_as; r = vhost_vdpa_alloc_domain(v); if (r) - goto err_alloc_domain; + goto err_alloc_as; filep->private_data = v; return 0; -err_alloc_domain: - vhost_vdpa_iotlb_free(v); -err_init_iotlb: +err_alloc_as: vhost_vdpa_cleanup(v); err: atomic_dec(>opened); @@ -851,7 +899,6 @@ static int vhost_vdpa_release(struct inode *inode, struct file *filep) filep->private_data = NULL; vhost_vdpa_reset(v); vhost_dev_stop(>vdev); - vhost_vdpa_iotlb_free(v); vhost_vdpa_free_domain(v);
[RFC PATCH 11/24] vhost_iotlb: split out IOTLB initialization
This patch split out IOTLB initialization logic into a new helper. This allows vhost to implement device specific IOTLB allocation logic. Signed-off-by: Jason Wang --- drivers/vhost/iotlb.c | 23 ++- include/linux/vhost_iotlb.h | 2 ++ 2 files changed, 20 insertions(+), 5 deletions(-) diff --git a/drivers/vhost/iotlb.c b/drivers/vhost/iotlb.c index 1f0ca6e44410..36d785efd038 100644 --- a/drivers/vhost/iotlb.c +++ b/drivers/vhost/iotlb.c @@ -98,6 +98,23 @@ void vhost_iotlb_del_range(struct vhost_iotlb *iotlb, u64 start, u64 last) } EXPORT_SYMBOL_GPL(vhost_iotlb_del_range); +/** + * vhost_iotlb_init - initialize a vhost IOTLB + * @iotlb: the IOTLB that needs to be initialized + * @limit: maximum number of IOTLB entries + * @flags: VHOST_IOTLB_FLAG_XXX + */ +void vhost_iotlb_init(struct vhost_iotlb *iotlb, unsigned int limit, + unsigned int flags) +{ + iotlb->root = RB_ROOT_CACHED; + iotlb->limit = limit; + iotlb->nmaps = 0; + iotlb->flags = flags; + INIT_LIST_HEAD(>list); +} +EXPORT_SYMBOL_GPL(vhost_iotlb_init); + /** * vhost_iotlb_alloc - add a new vhost IOTLB * @limit: maximum number of IOTLB entries @@ -112,11 +129,7 @@ struct vhost_iotlb *vhost_iotlb_alloc(unsigned int limit, unsigned int flags) if (!iotlb) return NULL; - iotlb->root = RB_ROOT_CACHED; - iotlb->limit = limit; - iotlb->nmaps = 0; - iotlb->flags = flags; - INIT_LIST_HEAD(>list); + vhost_iotlb_init(iotlb, limit, flags); return iotlb; } diff --git a/include/linux/vhost_iotlb.h b/include/linux/vhost_iotlb.h index 6b09b786a762..c0df193ec3e1 100644 --- a/include/linux/vhost_iotlb.h +++ b/include/linux/vhost_iotlb.h @@ -33,6 +33,8 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb, u64 start, u64 last, u64 addr, unsigned int perm); void vhost_iotlb_del_range(struct vhost_iotlb *iotlb, u64 start, u64 last); +void vhost_iotlb_init(struct vhost_iotlb *iotlb, unsigned int limit, + unsigned int flags); struct vhost_iotlb *vhost_iotlb_alloc(unsigned int limit, unsigned int flags); void vhost_iotlb_free(struct vhost_iotlb *iotlb); void vhost_iotlb_reset(struct vhost_iotlb *iotlb); -- 2.20.1
[RFC PATCH 12/24] vhost: support ASID in IOTLB API
This patches allows userspace to send ASID based IOTLB message to vhost. This idea is to use the reserved u32 field in the existing V2 IOTLB message. Vhost device should advertise this capability via VHOST_BACKEND_F_IOTLB_ASID backend feature. Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 5 - drivers/vhost/vhost.c| 23 ++- drivers/vhost/vhost.h| 4 ++-- include/uapi/linux/vhost_types.h | 5 - 4 files changed, 28 insertions(+), 9 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index eeefcd971e3f..6552987544d7 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -675,7 +675,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v, return ret; } -static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, +static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid, struct vhost_iotlb_msg *msg) { struct vhost_vdpa *v = container_of(dev, struct vhost_vdpa, vdev); @@ -684,6 +684,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, struct vhost_iotlb *iotlb = v->iotlb; int r = 0; + if (asid != 0) + return -EINVAL; + r = vhost_dev_check_owner(dev); if (r) return r; diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index b45519ca66a7..060662b12de3 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -463,7 +463,7 @@ void vhost_dev_init(struct vhost_dev *dev, struct vhost_virtqueue **vqs, int nvqs, int iov_limit, int weight, int byte_weight, bool use_worker, - int (*msg_handler)(struct vhost_dev *dev, + int (*msg_handler)(struct vhost_dev *dev, u32 asid, struct vhost_iotlb_msg *msg)) { struct vhost_virtqueue *vq; @@ -1079,11 +1079,14 @@ static bool umem_access_ok(u64 uaddr, u64 size, int access) return true; } -static int vhost_process_iotlb_msg(struct vhost_dev *dev, +static int vhost_process_iotlb_msg(struct vhost_dev *dev, u16 asid, struct vhost_iotlb_msg *msg) { int ret = 0; + if (asid != 0) + return -EINVAL; + mutex_lock(>mutex); vhost_dev_lock_vqs(dev); switch (msg->type) { @@ -1130,6 +1133,7 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev, struct vhost_iotlb_msg msg; size_t offset; int type, ret; + u16 asid = 0; ret = copy_from_iter(, sizeof(type), from); if (ret != sizeof(type)) { @@ -1145,7 +1149,16 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev, offset = offsetof(struct vhost_msg, iotlb) - sizeof(int); break; case VHOST_IOTLB_MSG_V2: - offset = sizeof(__u32); + if (vhost_backend_has_feature(dev->vqs[0], + VHOST_BACKEND_F_IOTLB_ASID)) { + ret = copy_from_iter(, sizeof(asid), from); + if (ret != sizeof(asid)) { + ret = -EINVAL; + goto done; + } + offset = sizeof(__u16); + } else + offset = sizeof(__u32); break; default: ret = -EINVAL; @@ -1160,9 +1173,9 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev, } if (dev->msg_handler) - ret = dev->msg_handler(dev, ); + ret = dev->msg_handler(dev, asid, ); else - ret = vhost_process_iotlb_msg(dev, ); + ret = vhost_process_iotlb_msg(dev, asid, ); if (ret) { ret = -EFAULT; goto done; diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 9032d3c2a9f4..05e7aaf6071b 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -162,7 +162,7 @@ struct vhost_dev { int byte_weight; u64 kcov_handle; bool use_worker; - int (*msg_handler)(struct vhost_dev *dev, + int (*msg_handler)(struct vhost_dev *dev, u32 asid, struct vhost_iotlb_msg *msg); }; @@ -170,7 +170,7 @@ bool vhost_exceeds_weight(struct vhost_virtqueue *vq, int pkts, int total_len); void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs, int iov_limit, int weight, int byte_weight, bool use_worker, - int (*msg_handler)(struct vhost_dev *dev, + int (*msg_handler)(struct vhost_dev *dev, u32 asid, struct vhost_iotlb_msg *msg)); long vhost_dev_set_owner(struct vhost_dev *dev); bool vhost_dev_has_owner(struct vhost_dev *dev); diff --git
[RFC PATCH 09/24] vdpa: multiple address spaces support
This patches introduces the multiple address spaces support for vDPA device. This idea is to identify a specific address space via an dedicated identifier - ASID. During vDPA device allocation, vDPA device driver needs to report the number of address spaces supported by the device then the DMA mapping ops of the vDPA device needs to be extended to support ASID. This helps to isolate the DMA among the virtqueues. E.g in the case of virtio-net, the control virtqueue will not be assigned directly to guest. This RFC patch only converts for the device that wants its own IOMMU/DMA translation logic. So it will rejects the device with more that 1 address space that depends on platform IOMMU. The plan to moving all the DMA mapping logic to the vDPA device driver instead of doing it in vhost-vDPA (otherwise it could result a very complicated APIs and actually vhost-vDPA doesn't care about how the actual composition/emulation were done in the device driver). Signed-off-by: Jason Wang --- drivers/vdpa/ifcvf/ifcvf_main.c | 2 +- drivers/vdpa/mlx5/net/mlx5_vnet.c | 5 +++-- drivers/vdpa/vdpa.c | 4 +++- drivers/vdpa/vdpa_sim/vdpa_sim.c | 10 ++ drivers/vhost/vdpa.c | 14 +- include/linux/vdpa.h | 23 --- 6 files changed, 38 insertions(+), 20 deletions(-) diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c index e6a0be374e51..86cdf5f8bcae 100644 --- a/drivers/vdpa/ifcvf/ifcvf_main.c +++ b/drivers/vdpa/ifcvf/ifcvf_main.c @@ -440,7 +440,7 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id) adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa, dev, _vdpa_ops, - IFCVF_MAX_QUEUE_PAIRS * 2, 1); + IFCVF_MAX_QUEUE_PAIRS * 2, 1, 1); if (adapter == NULL) { IFCVF_ERR(pdev, "Failed to allocate vDPA structure"); diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 4e480f4f754e..db7404e121bf 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1788,7 +1788,8 @@ static u32 mlx5_vdpa_get_generation(struct vdpa_device *vdev) return mvdev->generation; } -static int mlx5_vdpa_set_map(struct vdpa_device *vdev, struct vhost_iotlb *iotlb) +static int mlx5_vdpa_set_map(struct vdpa_device *vdev, unsigned int asid, +struct vhost_iotlb *iotlb) { struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev); @@ -1931,7 +1932,7 @@ void *mlx5_vdpa_add_dev(struct mlx5_core_dev *mdev) max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS); ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, _vdpa_ops, -2 * mlx5_vdpa_max_qps(max_vqs), 1); +2 * mlx5_vdpa_max_qps(max_vqs), 1, 1); if (IS_ERR(ndev)) return ndev; diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c index 46399746ec7c..05195fa7865d 100644 --- a/drivers/vdpa/vdpa.c +++ b/drivers/vdpa/vdpa.c @@ -63,6 +63,7 @@ static void vdpa_release_dev(struct device *d) * @config: the bus operations that is supported by this device * @nvqs: number of virtqueues supported by this device * @ngroups: number of groups supported by this device + * @nas: number of address spaces supported by this device * @size: size of the parent structure that contains private data * * Driver should use vdpa_alloc_device() wrapper macro instead of @@ -74,7 +75,7 @@ static void vdpa_release_dev(struct device *d) struct vdpa_device *__vdpa_alloc_device(struct device *parent, const struct vdpa_config_ops *config, int nvqs, unsigned int ngroups, - size_t size) + unsigned int nas, size_t size) { struct vdpa_device *vdev; int err = -EINVAL; @@ -102,6 +103,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent, vdev->features_valid = false; vdev->nvqs = nvqs; vdev->ngroups = ngroups; + vdev->nas = nas; err = dev_set_name(>dev, "vdpa%u", vdev->index); if (err) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index 6669c561bc6e..5dc04ec271bb 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -354,7 +354,7 @@ static struct vdpasim *vdpasim_create(void) ops = _net_config_ops; vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops, - VDPASIM_VQ_NUM, 1); + VDPASIM_VQ_NUM, 1, 1); if (!vdpasim) goto err_alloc; @@ -581,7 +581,7