Re: [PATCH] riscv: fix bugon.cocci warnings
On Sun, Feb 28, 2021 at 12:10:22PM +0100, Julia Lawall wrote: > From: kernel test robot > > Use BUG_ON instead of a if condition followed by BUG. > > Generated by: scripts/coccinelle/misc/bugon.cocci > > Fixes: c22b0bcb1dd0 ("riscv: Add kprobes supported") > CC: Guo Ren > Reported-by: kernel test robot > Signed-off-by: kernel test robot > Signed-off-by: Julia Lawall Reviewed-by: Pekka Enberg
Re: [PATCH] riscv: Fixup CONFIG_GENERIC_TIME_VSYSCALL
On Sat, Jan 2, 2021 at 3:26 PM wrote: > > From: Guo Ren > > The patch fix commit: ad5d112 ("riscv: use vDSO common flow to > reduce the latency of the time-related functions"). > > The GENERIC_TIME_VSYSCALL should be CONFIG_GENERIC_TIME_VSYSCALL > or vgettimeofday won't work. > > Signed-off-by: Guo Ren > Cc: Atish Patra > Cc: Palmer Dabbelt > Cc: Vincent Chen Reviewed-by: Pekka Enberg
Re: [PATCH] riscv: mm: abort uaccess retries upon fatal signal
On Wed, Dec 30, 2020 at 4:40 PM wrote: > From: Guo Ren > > Pick up the patch from the 'Link' made by Mark Rutland. Keep the > same with x86, arm, arm64, arc, sh, power. > > Link: > https://lore.kernel.org/linux-arm-kernel/1499782763-31418-1-git-send-email-mark.rutl...@arm.com/ > Signed-off-by: Guo Ren > Cc: Mark Rutland > Cc: Pekka Enberg > Cc: Palmer Dabbelt Reviewed-by: Pekka Enberg
Re: [PATCH v2 0/2] Let illegal access to user-space memory die
Hi Eric, On Thu, Dec 3, 2020 at 8:51 AM Eric Lin wrote: > > Accesses to user-space memory without calling uaccess routine > leads to hanging in page fault handler. Like arm64, we let it > die earlier in page fault handler. > > Changes in v2: > -Add a die_kernel_fault() helper > -Split one long line code into two Please also make no_context() use the new helper. Other than that: Reviewed-by: Pekka Enberg
Re: [PATCH] riscv/mm: Prevent kernel module access user-space memory without uaccess routines
On Mon, Nov 30, 2020 at 7:33 AM Eric Lin wrote: > > In the page fault handler, an access to user-space memory > without get/put_user() or copy_from/to_user() routines is > not resolved properly. Like arm and other architectures, > we need to let it die earlier in page fault handler. Fix looks good to me. Can you elaborate on how you found the issue and how the bug manifests itself? > > Signed-off-by: Eric Lin > Cc: Alan Kao > --- > arch/riscv/mm/fault.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c > index 3c8b9e433c67..a452cfa266a2 100644 > --- a/arch/riscv/mm/fault.c > +++ b/arch/riscv/mm/fault.c > @@ -232,6 +232,9 @@ asmlinkage void do_page_fault(struct pt_regs *regs) > if (user_mode(regs)) > flags |= FAULT_FLAG_USER; > > + if (!user_mode(regs) && addr < TASK_SIZE && unlikely(!(regs->status & > SR_SUM))) > + die(regs, "Accessing user space memory without uaccess > routines\n"); Let's introduce a die_kernel_fault() helper (similar to arm64, for example) to ensure same semantics for the different kernel faults. You can extract the helper from no_context(). > + > perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); > > if (cause == EXC_STORE_PAGE_FAULT) > -- > 2.17.0 > > > ___ > linux-riscv mailing list > linux-ri...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv
Re: [PATCH v3] riscv: fix pfn_to_virt err in do_page_fault().
On Mon, Oct 26, 2020 at 08:26:54PM +0800, liush wrote: > From: Liu Shaohua > > The argument to pfn_to_virt() should be pfn not the value of CSR_SATP. > > Reviewed-by: Palmer Dabbelt > Reviewed-by: Anup Patel > Signed-off-by: liush Reviewed-by: Pekka Enberg > --- > arch/riscv/mm/fault.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c > index 1359e21..3c8b9e4 100644 > --- a/arch/riscv/mm/fault.c > +++ b/arch/riscv/mm/fault.c > @@ -86,6 +86,7 @@ static inline void vmalloc_fault(struct pt_regs *regs, int > code, unsigned long a > pmd_t *pmd, *pmd_k; > pte_t *pte_k; > int index; > + unsigned long pfn; > > /* User mode accesses just cause a SIGSEGV */ > if (user_mode(regs)) > @@ -100,7 +101,8 @@ static inline void vmalloc_fault(struct pt_regs *regs, > int code, unsigned long a >* of a task switch. >*/ > index = pgd_index(addr); > - pgd = (pgd_t *)pfn_to_virt(csr_read(CSR_SATP)) + index; > + pfn = csr_read(CSR_SATP) & SATP_PPN; > + pgd = (pgd_t *)pfn_to_virt(pfn) + index; > pgd_k = init_mm.pgd + index; > > if (!pgd_present(*pgd_k)) { > -- > 2.7.4 >
Re: [PATCH -next] riscv/mm/fault: fix old-style-declaration warning
Hi, On Wed, Sep 9, 2020 at 2:20 PM Wei Yongjun wrote: > > gcc report build warning as follows: > > arch/riscv/mm/fault.c:81:1: warning: > 'inline' is not at beginning of declaration [-Wold-style-declaration] >81 | static void inline vmalloc_fault(struct pt_regs *regs, int code, > unsigned long addr) > | ^~ > > This commit fix it by moving 'inline' after 'static'. > > Signed-off-by: Wei Yongjun Thanks for the fix, but Palmer committed similar fix from me to for-next: https://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git/commit/?h=for-next&id=6f36a9635082e1d6910bc7853d0c9fd12d7890b5 - Pekka
Re: [PATCH v4 0/3] Get cache information from userland
For the series: Reviewed-by: Pekka Enberg - Pekka
Re: [PATCH v3 1/3] riscv: Set more data to cacheinfo
Hi, On Mon, Aug 31, 2020 at 9:15 AM Zong Li wrote: > If the sets is one, it means that the cache is fully associative, then > we don't need to fill the ways number, just keep way number as zero, > so here we want to find the fully associative case first and make the > if expression fail at the beginning. We might also rewrite it as > follow: [snip] Thanks for the explanation! The rewritten version is much easier to read, at least to me. - Pekka
Re: [PATCH v3 3/3] riscv: Add cache information in AUX vector
On Fri, Aug 28, 2020 at 10:09 AM Zong Li wrote: > +uintptr_t get_cache_geometry(u32 level, enum cache_type type) > +{ > + struct cacheinfo *this_leaf = get_cacheinfo(level, type); > + uintptr_t ret = (this_leaf->ways_of_associativity << 16 | > +this_leaf->coherency_line_size); You are dereferencing "this_leaf" without checking if it's NULL here. > + > + return this_leaf ? ret : 0; > +} - Pekka
Re: [PATCH v3 1/3] riscv: Set more data to cacheinfo
Hi, On Fri, Aug 28, 2020 at 10:09 AM Zong Li wrote: > > Set cacheinfo.{size,sets,line_size} for each cache node, then we can > get these information from userland through auxiliary vector. > > Signed-off-by: Zong Li > --- > arch/riscv/kernel/cacheinfo.c | 59 ++- > 1 file changed, 44 insertions(+), 15 deletions(-) > > diff --git a/arch/riscv/kernel/cacheinfo.c b/arch/riscv/kernel/cacheinfo.c > index bd0f122965c3..8b85abfbd77a 100644 > --- a/arch/riscv/kernel/cacheinfo.c > +++ b/arch/riscv/kernel/cacheinfo.c > @@ -25,12 +25,46 @@ cache_get_priv_group(struct cacheinfo *this_leaf) > return NULL; > } > > -static void ci_leaf_init(struct cacheinfo *this_leaf, > -struct device_node *node, > -enum cache_type type, unsigned int level) > +static void ci_leaf_init(struct cacheinfo *this_leaf, enum cache_type type, > +unsigned int level, unsigned int size, > +unsigned int sets, unsigned int line_size) > { > this_leaf->level = level; > this_leaf->type = type; > + this_leaf->size = size; > + this_leaf->number_of_sets = sets; > + this_leaf->coherency_line_size = line_size; > + > + /* > +* If the cache is fully associative, there is no need to > +* check the other properties. > +*/ > + if (!(sets == 1) && (sets > 0 && size > 0 && line_size > 0)) Can you explain what this is attempting to do? AFAICT, the if expression can be reduced to "sets > 1 && size > 0 && size > 0", but what do you mean with the comment about fully associative caches? > + this_leaf->ways_of_associativity = (size / sets) / line_size; > +}
Re: [PATCH v2 17/23] riscv: use asm-generic/mmu_context.h for no-op implementations
On Wed, Aug 26, 2020 at 5:54 PM Nicholas Piggin wrote: > > Cc: Paul Walmsley > Cc: Palmer Dabbelt > Cc: Albert Ou > Cc: linux-ri...@lists.infradead.org > Acked-by: Palmer Dabbelt > Signed-off-by: Nicholas Piggin Reviewed-by: Pekka Enberg
[RFC PATCH] x86/mm/fault: Inline page fault paths to reduce kernel text size
From: Pekka Enberg The commit 92181f190b649f7ef2b79cbf5c00f26ccc66da2a ("x86: optimise x86's do_page_fault (C entry point for the page fault path)") from 2009 shows significant stack savings when infrequent page fault handling paths are moved out of line with the "noinline" annotation, with some increase in kernel text size. However, a decade of GCC improvements (and changes in the code) has eliminated such wins in stack usage: With "noinline": 0b30 : b30: 41 57 push %r15 b32: 41 56 push %r14 b34: 41 55 push %r13 b36: 49 89 d5mov%rdx,%r13 b39: 41 54 push %r12 b3b: 49 89 fcmov%rdi,%r12 b3e: 55 push %rbp b3f: 48 89 f5mov%rsi,%rbp b42: 53 push %rbx b43: 65 48 8b 04 25 00 00mov%gs:0x0,%rax b4a: 00 00 b4c: 48 83 ec 18 sub$0x18,%rsp With "inline": 08a0 : 8a0: 41 57 push %r15 8a2: 41 56 push %r14 8a4: 41 55 push %r13 8a6: 49 89 d5mov%rdx,%r13 8a9: 41 54 push %r12 8ab: 49 89 fcmov%rdi,%r12 8ae: 55 push %rbp 8af: 48 89 f5mov%rsi,%rbp 8b2: 53 push %rbx 8b3: 65 48 8b 04 25 00 00mov%gs:0x0,%rax 8ba: 00 00 8bc: 48 83 ec 18 sub$0x18,%rsp So all we're left with is the increase in kernel text: add/remove: 1/5 grow/shrink: 2/1 up/down: 1049/-1144 (-95) Function old new delta spurious_kernel_fault.part - 559+559 do_user_addr_fault 9881446+458 do_kern_addr_fault 103 135 +32 bad_area_nosemaphore 13 - -13 bad_area 66 - -66 pgtable_bad 107 --107 mm_fault_error 199 --199 bad_area_access_error226 --226 spurious_kernel_fault553 20-533 Total: Before=7359, After=7264, chg -1.29% Therefore, let's inline the page fault handling paths to reduce kernel text size. Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Signed-off-by: Pekka Enberg --- arch/x86/mm/fault.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 35f1498e9832..b8893684d80c 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -533,7 +533,7 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad dump_pagetable(address); } -static noinline void +static inline void pgtable_bad(struct pt_regs *regs, unsigned long error_code, unsigned long address) { @@ -577,7 +577,7 @@ static void set_signal_archinfo(unsigned long address, tsk->thread.cr2 = address; } -static noinline void +static inline void no_context(struct pt_regs *regs, unsigned long error_code, unsigned long address, int signal, int si_code) { @@ -788,7 +788,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code, no_context(regs, error_code, address, SIGSEGV, si_code); } -static noinline void +static inline void bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code, unsigned long address) { @@ -809,7 +809,7 @@ __bad_area(struct pt_regs *regs, unsigned long error_code, __bad_area_nosemaphore(regs, error_code, address, pkey, si_code); } -static noinline void +static inline void bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address) { __bad_area(regs, error_code, address, 0, SEGV_MAPERR); @@ -832,7 +832,7 @@ static inline bool bad_area_access_from_pkeys(unsigned long error_code, return false; } -static noinline void +static inline void bad_area_access_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, struct vm_area_struct *vma) { @@ -905,7 +905,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address); } -static noinline void +static inline void mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsign
Re: [PATCH v2 0/3] mm/slub: Fix count_partial() problem
On Mon, Aug 10, 2020 at 3:18 PM Xunlei Pang wrote: > > v1->v2: > - Improved changelog and variable naming for PATCH 1~2. > - PATCH3 adds per-cpu counter to avoid performance regression > in concurrent __slab_free(). > > [Testing] > On my 32-cpu 2-socket physical machine: > Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz > perf stat --null --repeat 10 -- hackbench 20 thread 2 > > == original, no patched > 19.211637055 seconds time elapsed >( +- 0.57% ) > > == patched with patch1~2 > Performance counter stats for 'hackbench 20 thread 2' (10 runs): > > 21.731833146 seconds time elapsed >( +- 0.17% ) > > == patched with patch1~3 > Performance counter stats for 'hackbench 20 thread 2' (10 runs): > > 19.112106847 seconds time elapsed >( +- 0.64% ) > > > Xunlei Pang (3): > mm/slub: Introduce two counters for partial objects > mm/slub: Get rid of count_partial() > mm/slub: Use percpu partial free counter > > mm/slab.h | 2 + > mm/slub.c | 124 > +++--- > 2 files changed, 89 insertions(+), 37 deletions(-) We probably need to wrap the counters under CONFIG_SLUB_DEBUG because AFAICT all the code that uses them is also wrapped under it. An alternative approach for this patch would be to somehow make the lock in count_partial() more granular, but I don't know how feasible that actually is. Anyway, I am OK with this approach: Reviewed-by: Pekka Enberg You still need to convince Christoph, though, because he had objections over this approach. - Pekka
Re: [PATCH 1/2] mm/slub: Introduce two counters for the partial objects
Hi Christopher, On Tue, Aug 11, 2020 at 3:52 PM Christopher Lameter wrote: > > On Fri, 7 Aug 2020, Pekka Enberg wrote: > > > Why do you consider this to be a fast path? This is all partial list > > accounting when we allocate/deallocate a slab, no? Just like > > ___slab_alloc() says, I assumed this to be the slow path... What am I > > missing? > > I thought these were per object counters? If you just want to count the > number of slabs then you do not need the lock at all. We already have a > counter for the number of slabs. The patch attempts to speed up count_partial(), which holds on to the "n->list_lock" (with IRQs off) for the whole duration it takes to walk the partial slab list: spin_lock_irqsave(&n->list_lock, flags); list_for_each_entry(page, &n->partial, slab_list) x += get_count(page); spin_unlock_irqrestore(&n->list_lock, flags); It's counting the number of *objects*, but the counters are only updated in bulk when we add/remove a slab to/from the partial list. The counter updates are therefore *not* in the fast-path AFAICT. Xunlei, please correct me if I'm reading your patches wrong. On Tue, Aug 11, 2020 at 3:52 PM Christopher Lameter wrote: > > No objections to alternative fixes, of course, but wrapping the > > counters under CONFIG_DEBUG seems like just hiding the actual issue... > > CONFIG_DEBUG is on by default. It just compiles in the debug code and > disables it so we can enable it with a kernel boot option. This is because > we have had numerous issues in the past with "production" kernels that > could not be recompiled with debug options. So just running the prod > kernel with another option will allow you to find hard to debug issues in > a full scale producton deployment with potentially proprietary modules > etc. Yeah, it's been too long since I last looked at the code and did not realize even count_partial() is wrapped in CONFIG_DEBUG. So by all means, let's also wrap the counters with that too. - Pekka
Re: [PATCH] mm: sort freelist by rank number
Hi KyongHo and David, On 07.08.20 09:08, Pekka Enberg wrote: > > > I think having more knowledge of DRAM controller details in the OS > > > would be potentially beneficial for better page allocation policy, so > > > maybe try come up with something more generic, even if the fallback to > > > providing this information is a kernel command line option. On Tue, Aug 18, 2020 at 12:02 PM Cho KyongHo wrote: > I don't find if there is a way to deliver detailed DRAM information > through ACPI, ATAG or something similar. But I didn't find. Perhaps that's still the case today then. In the context of this patch, what I am hoping is that we'd turn the "dram_rank_granule" parameter -- even if it's still a manually configured thing -- into an abstraction that we can later extend. IOW, something similar to what the "energy model" (kernel/power/energy_model.c) today is. On Mon, Aug 10, 2020 at 09:32:18AM +0200, David Hildenbrand wrote: > I guess one universal approach is by measuring access times ... not what > we might be looking for :) I don't think it's an unreasonable approach, but measurement accuracy and speed will determine if it's actually practical. There are examples of this kind of approach elsewhere too. For example, MCTOP [1] which aims to provide topology-aware OS abstractions, uses cache latency measurements to infer the topology. 1. https://timharris.uk/papers/2017-eurosys-mctop.pdf Regards, - Pekka
Re: [PATCH -next] x86/mpparse: remove duplicate include
On Wed, Aug 19, 2020 at 07:29:10PM +0800, Wang Hai wrote: > Remove asm/io_apic.h which is included more than once > > Reported-by: Hulk Robot > Signed-off-by: Wang Hai Reviewed-by: Pekka Enberg
Re: [PATCH -next] mm: slab: Remove duplicate include
On Tue, Aug 18, 2020 at 2:43 PM YueHaibing wrote: > > Remove duplicate header which is included twice. > > Signed-off-by: YueHaibing Reviewed-by: Pekka Enberg
Re: [PATCH] mm/slub: fix missing ALLOC_SLOWPATH stat when bulk alloc
On Tue, Aug 11, 2020 at 5:25 AM wrote: > > From: Abel Wu > > The ALLOC_SLOWPATH statistics is missing in bulk allocation now. > Fix it by doing statistics in alloc slow path. > > Signed-off-by: Abel Wu Reviewed-by: Pekka Enberg
Re: Odd-sized kmem_cache_alloc and slub_debug=Z
Hi Marco and Kees, On Fri, Aug 07, 2020 at 08:06PM +0300, Pekka Enberg wrote: > > Anything interesting in your .config? The fault does not reproduce > > with 5.8.0 + x86-64 defconfig. On Fri, Aug 7, 2020 at 8:18 PM Marco Elver wrote: > It's quite close to defconfig, just some extra options for my test > environment. But none that I'd imagine change this behaviour -- but > maybe I missed something. I've attached my config. Also, just in case, > I'm on mainline from Tuesday: 2324d50d051ec0f14a548e78554fb02513d6dcef. Yeah, it reproduces with defconfig too, as long as you remember to pass "slub_debug=Z"... :-/ The following seems to be the culprit: commit 3202fa62fb43087387c65bfa9c100feffac74aa6 Author: Kees Cook Date: Wed Apr 1 21:04:27 2020 -0700 slub: relocate freelist pointer to middle of object Reverting this commit and one of it's follow up fixes from Kees from v5.8 makes the issue go away for me. Btw, please note that caches with size 24 and larger do not trigger this bug, so the issue is that with small enough object size, we're stomping on allocator metadata (I assume part of the freelist). - Pekka
Re: [PATCH 1/2] mm/slub: Introduce two counters for the partial objects
Hi Christopher, On Fri, 7 Aug 2020, Pekka Enberg wrote: > > I think we can just default to the counters. After all, if I > > understood correctly, we're talking about up to 100 ms time period > > with IRQs disabled when count_partial() is called. As this is > > triggerable from user space, that's a performance bug whatever way you > > look at it. On Fri, Aug 7, 2020 at 4:02 PM Christopher Lameter wrote: > Well yes under extreme conditions and this is only happening for sysfs > counter retrieval. You will likely get some stall even in less extreme conditions, and in any case, the kernel should not allow user space to trigger such a stall. On Fri, Aug 7, 2020 at 4:02 PM Christopher Lameter wrote: > There could be other solutions to this. This solution here is penalizing > evertu hotpath slab allocation for the sake of relatively infrequently > used counter monitoring. There the possibility of not traversing the list > ande simply estimating the value based on the number of slab pages > allocated on that node. Why do you consider this to be a fast path? This is all partial list accounting when we allocate/deallocate a slab, no? Just like ___slab_alloc() says, I assumed this to be the slow path... What am I missing? No objections to alternative fixes, of course, but wrapping the counters under CONFIG_DEBUG seems like just hiding the actual issue... - Pekka
Re: Odd-sized kmem_cache_alloc and slub_debug=Z
Hi Marco, On Fri, Aug 7, 2020 at 7:07 PM Marco Elver wrote: > I found that the below debug-code using kmem_cache_alloc(), when using > slub_debug=Z, results in the following crash: > > general protection fault, probably for non-canonical address > 0xcca41caea170: [#1] PREEMPT SMP PTI > CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.8.0+ #1 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 > 04/01/2014 > RIP: 0010:freelist_dereference mm/slub.c:272 [inline] > RIP: 0010:get_freepointer mm/slub.c:278 [inline] > RIP: 0010:deactivate_slab+0x54/0x460 mm/slub.c:2111 > Code: 8b bc c7 e0 00 00 00 48 85 d2 0f 84 00 01 00 00 49 89 d5 31 c0 > 48 89 44 24 08 66 66 2e 0f 1f 84 00 00 00 00 00 90 44 8b 43 20 <4b> 8b 44 05 > 00 48 85 c0 0f 84 1e 01 00 00 4c 89 ed 49 89 c5 8b 43 > RSP: :a7e03e18 EFLAGS: 00010046 > RAX: RBX: a3a41c972340 RCX: > RDX: cca41caea160 RSI: e7c6a072ba80 RDI: a3a41c972340 > RBP: a3a41caea008 R08: 0010 R09: a3a41caea01d > R10: a7f8dc50 R11: a68f44c0 R12: a3a41c972340 > R13: cca41caea160 R14: e7c6a072ba80 R15: a3a41c96d540 > FS: () GS:a3a41fc0() > knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: a3a051c01000 CR3: 00045140a001 CR4: 00770ef0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > PKRU: > Call Trace: > ___slab_alloc+0x336/0x340 mm/slub.c:2690 > __slab_alloc mm/slub.c:2714 [inline] > slab_alloc_node mm/slub.c:2788 [inline] > slab_alloc mm/slub.c:2832 [inline] > kmem_cache_alloc+0x135/0x200 mm/slub.c:2837 > start_kernel+0x3d6/0x44e init/main.c:1049 > secondary_startup_64+0xb6/0xc0 arch/x86/kernel/head_64.S:243 > > Any ideas what might be wrong? > > This does not crash when redzones are not enabled. > > Thanks, > -- Marco > > -- >8 -- > > diff --git a/init/main.c b/init/main.c > index 15bd0efff3df..f4aa5bb3f2ec 100644 > --- a/init/main.c > +++ b/init/main.c > @@ -1041,6 +1041,16 @@ asmlinkage __visible void __init start_kernel(void) > sfi_init_late(); > kcsan_init(); > > + /* DEBUG CODE */ > + { > + struct kmem_cache *c = kmem_cache_create("test", 21, 1, 0, > NULL); > + char *buf; > + BUG_ON(!c); > + buf = kmem_cache_alloc(c, GFP_KERNEL); > + kmem_cache_free(c, buf); > + kmem_cache_destroy(c); > + } > + > /* Do the rest non-__init'ed, we're now alive */ > arch_call_rest_init(); Anything interesting in your .config? The fault does not reproduce with 5.8.0 + x86-64 defconfig. - Pekka
Re: [PATCH 1/2] mm/slub: Introduce two counters for the partial objects
On Thu, Aug 6, 2020 at 3:42 PM Vlastimil Babka wrote: > > On 7/2/20 10:32 AM, Xunlei Pang wrote: > > The node list_lock in count_partial() spend long time iterating > > in case of large amount of partial page lists, which can cause > > thunder herd effect to the list_lock contention, e.g. it cause > > business response-time jitters when accessing "/proc/slabinfo" > > in our production environments. > > > > This patch introduces two counters to maintain the actual number > > of partial objects dynamically instead of iterating the partial > > page lists with list_lock held. > > > > New counters of kmem_cache_node are: pfree_objects, ptotal_objects. > > The main operations are under list_lock in slow path, its performance > > impact is minimal. > > > > Co-developed-by: Wen Yang > > Signed-off-by: Xunlei Pang > > This or similar things seem to be reported every few months now, last time was > here [1] AFAIK. The solution was to just stop counting at some point. > > Shall we perhaps add these counters under CONFIG_SLUB_DEBUG then and be done > with it? If anyone needs the extreme performance and builds without > CONFIG_SLUB_DEBUG, I'd assume they also don't have userspace programs reading > /proc/slabinfo periodically anyway? I think we can just default to the counters. After all, if I understood correctly, we're talking about up to 100 ms time period with IRQs disabled when count_partial() is called. As this is triggerable from user space, that's a performance bug whatever way you look at it. Whoever needs to eliminate these counters from fast-path, can wrap them in a CONFIG_MAKE_SLABINFO_EXTREMELY_SLOW option. So for this patch, with updated information about the severity of the problem, and the hackbench numbers: Acked-by: Pekka Enberg Christoph, others, any objections? - Pekka
Re: [PATCH] mm: sort freelist by rank number
Hi Cho and David, On Mon, Aug 3, 2020 at 10:57 AM David Hildenbrand wrote: > > On 03.08.20 08:10, pullip@samsung.com wrote: > > From: Cho KyongHo > > > > LPDDR5 introduces rank switch delay. If three successive DRAM accesses > > happens and the first and the second ones access one rank and the last > > access happens on the other rank, the latency of the last access will > > be longer than the second one. > > To address this panelty, we can sort the freelist so that a specific > > rank is allocated prior to another rank. We expect the page allocator > > can allocate the pages from the same rank successively with this > > change. It will hopefully improves the proportion of the consecutive > > memory accesses to the same rank. > > This certainly needs performance numbers to justify ... and I am sorry, > "hopefully improves" is not a valid justification :) > > I can imagine that this works well initially, when there hasn't been a > lot of memory fragmentation going on. But quickly after your system is > under stress, I doubt this will be very useful. Proof me wrong. ;) > > ... I dislike this manual setting of "dram_rank_granule". Yet another mm > feature that can only be enabled by a magic command line parameter where > users have to guess the right values. > > (side note, there have been similar research approaches to improve > energy consumption by switching off ranks when not needed). I was thinking of the exact same thing. PALLOC [1] comes to mind, but perhaps there are more recent ones? I also dislike the manual knob, but is there a way for the OS to detect this by itself? My (perhaps outdated) understanding was that the DRAM address mapping scheme, for example, is not exposed to the OS. I think having more knowledge of DRAM controller details in the OS would be potentially beneficial for better page allocation policy, so maybe try come up with something more generic, even if the fallback to providing this information is a kernel command line option. [1] http://cs-people.bu.edu/rmancuso/files/papers/palloc-rtas2014.pdf - Pekka
Re: [PATCH v2 1/2] riscv: Fix build warning for mm/init
On Thu, Jul 16, 2020 at 4:06 PM Zong Li wrote: > > Add static keyword for resource_init, this function is only used in this > object file. > > The warning message as follow (with W=1 build): > > arch/riscv/mm/init.c:520:13: > warning: no previous prototype for 'resource_init' [-Wmissing-prototypes] > > Signed-off-by: Zong Li Reviewed-by: Pekka Enberg - Pekka
Re: [PATCH 2/2] riscv: fix build warning of mm/pageattr
On Thu, Jul 16, 2020 at 10:11 AM Pekka Enberg wrote: > > On Thu, Jul 16, 2020 at 9:16 AM Zong Li wrote: > > > > Add hearder for missing prototype. Also, static keyword should be at > > beginning of declaration. > > > > Signed-off-by: Zong Li > > Which prototype is missing? Aah, never mind, you mention them in the cover letter. I think patch description would be a better place to ensure they end up in git logs. For both patches: Reviewed-by: Pekka Enberg - Pekka
Re: [PATCH 2/2] riscv: fix build warning of mm/pageattr
On Thu, Jul 16, 2020 at 9:16 AM Zong Li wrote: > > Add hearder for missing prototype. Also, static keyword should be at > beginning of declaration. > > Signed-off-by: Zong Li Which prototype is missing? - Pekka
Re: [PATCH 17/26] mm/riscv: Use general page fault accounting
Hi Palmer, On Sat, Jul 11, 2020 at 10:43 PM Palmer Dabbelt wrote: > This still slightly changes the accounting numbers, but I don't think it does > so in a way that's meaningful enough to care about. SIGBUS is the only one > that might happen frequently enough to notice, I doubt anyone cares about > whether faults are accounted for during OOM. What do you mean? AFAICT, this patch _fixes_ the accounting to be correct on page fault restarts. I sent a patch to fix this up some time ago, but of course this generic solutions is better: http://lists.infradead.org/pipermail/linux-riscv/2020-June/000775.html - Pekka
Re: [PATCH 1/2] mm/slub: Introduce two counters for the partial objects
Hi! (Sorry for the delay, I missed your response.) On Fri, Jul 3, 2020 at 12:38 PM xunlei wrote: > > On 2020/7/2 PM 7:59, Pekka Enberg wrote: > > On Thu, Jul 2, 2020 at 11:32 AM Xunlei Pang > > wrote: > >> The node list_lock in count_partial() spend long time iterating > >> in case of large amount of partial page lists, which can cause > >> thunder herd effect to the list_lock contention, e.g. it cause > >> business response-time jitters when accessing "/proc/slabinfo" > >> in our production environments. > > > > Would you have any numbers to share to quantify this jitter? I have no > > We have HSF RT(High-speed Service Framework Response-Time) monitors, the > RT figures fluctuated randomly, then we deployed a tool detecting "irq > off" and "preempt off" to dump the culprit's calltrace, capturing the > list_lock cost up to 100ms with irq off issued by "ss", this also caused > network timeouts. Thanks for the follow up. This sounds like a good enough motivation for this patch, but please include it in the changelog. > > objections to this approach, but I think the original design > > deliberately made reading "/proc/slabinfo" more expensive to avoid > > atomic operations in the allocation/deallocation paths. It would be > > good to understand what is the gain of this approach before we switch > > to it. Maybe even run some slab-related benchmark (not sure if there's > > something better than hackbench these days) to see if the overhead of > > this approach shows up. > > I thought that before, but most atomic operations are serialized by the > list_lock. Another possible way is to hold list_lock in __slab_free(), > then these two counters can be changed from atomic to long. > > I also have no idea what's the standard SLUB benchmark for the > regression test, any specific suggestion? I don't know what people use these days. When I did benchmarking in the past, hackbench and netperf were known to be slab-allocation intensive macro-benchmarks. Christoph also had some SLUB micro-benchmarks, but I don't think we ever merged them into the tree. - Pekka
Re: [PATCH v2] mm/page_alloc: skip setting nodemask when we are in interrupt
On Mon, Jul 6, 2020 at 5:59 AM Muchun Song wrote: > When we are in the interrupt context, it is irrelevant to the > current task context. If we use current task's mems_allowed, we > can fair to alloc pages in the fast path and fall back to slow > path memory allocation when the current node(which is the current > task mems_allowed) does not have enough memory to allocate. In > this case, it slows down the memory allocation speed of interrupt > context. So we can skip setting the nodemask to allow any node > to allocate memory, so that fast path allocation can success. > > Signed-off-by: Muchun Song Reviewed-by: Pekka Enberg
Re: [PATCH V1 0/5] riscv: Add k/uprobe supported
Hi Guo, On Sat, Jul 4, 2020 at 6:34 AM wrote: > > > There is no single step exception in riscv ISA, so utilize ebreak to > > > simulate. Some pc related instructions couldn't be executed out of line > > > and some system/fence instructions couldn't be a trace site at all. > > > So we give out a reject list and simulate list in decode-insn.c. On Sat, Jul 4, 2020 at 2:40 PM Pekka Enberg wrote: > > Can you elaborate on what you mean by this? Why would you need a > > single-step facility for kprobes? Is it for executing the instruction > > that was replaced with a probe breakpoint? On Sat, Jul 4, 2020 at 5:55 PM Guo Ren wrote: > It's the single-step exception, not single-step facility! Aah, right, I didn't read the specification carefully enough. Thanks for taking the time to clarify this! FWIW, for the whole series: Reviewed-by: Pekka Enberg - Pekka
Re: [PATCH V1 0/5] riscv: Add k/uprobe supported
On Sat, Jul 4, 2020 at 6:34 AM wrote: > The patchset includes kprobe/uprobe support and some related fixups. Nice! On Sat, Jul 4, 2020 at 6:34 AM wrote: > There is no single step exception in riscv ISA, so utilize ebreak to > simulate. Some pc related instructions couldn't be executed out of line > and some system/fence instructions couldn't be a trace site at all. > So we give out a reject list and simulate list in decode-insn.c. Can you elaborate on what you mean by this? Why would you need a single-step facility for kprobes? Is it for executing the instruction that was replaced with a probe breakpoint? Also, the "Debug Specification" [1] specifies a single-step facility for RISC-V -- why is that not useful for implementing kprobes? 1. https://riscv.org/specifications/debug-specification/ - Pekka
Re: [PATCH RESEND] mm/page_alloc: skip setting nodemask when we are in interrupt
On 03.07.20 08:34, Pekka Enberg wrote: > > if (cpusets_enabled()) { > > *alloc_mask |= __GFP_HARDWALL; > > if (!in_interrupt() && !ac->nodemask) > > ac->nodemask = &cpuset_current_mems_allowed; > > else > > *alloc_flags |= ALLOC_CPUSET; > > } On Fri, Jul 3, 2020 at 10:20 AM David Hildenbrand wrote: > ^ looks much cleaner as well. Do we want to add a summarizing comment? I see no harm in adding one. I'm sure the next person starting a journey in the maze some call the page allocator will appreciate it. - Pekka
Re: [PATCH RESEND] mm/page_alloc: skip setting nodemask when we are in interrupt
On Fri, Jul 3, 2020 at 9:14 AM Muchun Song wrote: > > When we are in the interrupt context, it is irrelevant to the > current task context. If we use current task's mems_allowed, we > can fair to alloc pages in the fast path and fall back to slow > path memory allocation when the current node(which is the current > task mems_allowed) does not have enough memory to allocate. In > this case, it slows down the memory allocation speed of interrupt > context. So we can skip setting the nodemask to allow any node > to allocate memory, so that fast path allocation can success. > > Signed-off-by: Muchun Song > --- > mm/page_alloc.c | 8 +--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index b48336e20bdcd..a6c36cd557d1d 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4726,10 +4726,12 @@ static inline bool prepare_alloc_pages(gfp_t > gfp_mask, unsigned int order, > > if (cpusets_enabled()) { > *alloc_mask |= __GFP_HARDWALL; > - if (!ac->nodemask) > - ac->nodemask = &cpuset_current_mems_allowed; > - else > + if (!ac->nodemask) { > + if (!in_interrupt()) > + ac->nodemask = &cpuset_current_mems_allowed; If !ac->nodemask and in_interrupt() the ALLOC_CPUSET flag is not set, which by-passes the __cpuset_zone_allowed() check for allocations. This works fine because in the case if in_interrupt() the function allows allocation on any zone/node. > + } else { > *alloc_flags |= ALLOC_CPUSET; > + } > } However, if you write the condition as follows: if (cpusets_enabled()) { *alloc_mask |= __GFP_HARDWALL; if (!in_interrupt() && !ac->nodemask) ac->nodemask = &cpuset_current_mems_allowed; else *alloc_flags |= ALLOC_CPUSET; } then the code is future-proof in case of __cpuset_zone_allowed() is one day extended to support IRQ context too (it probably should eventually respect IRQ SMP affinity). > > fs_reclaim_acquire(gfp_mask); > -- > 2.11.0 > >
Re: [PATCH 1/2] mm/slub: Introduce two counters for the partial objects
On Thu, Jul 2, 2020 at 11:32 AM Xunlei Pang wrote: > The node list_lock in count_partial() spend long time iterating > in case of large amount of partial page lists, which can cause > thunder herd effect to the list_lock contention, e.g. it cause > business response-time jitters when accessing "/proc/slabinfo" > in our production environments. Would you have any numbers to share to quantify this jitter? I have no objections to this approach, but I think the original design deliberately made reading "/proc/slabinfo" more expensive to avoid atomic operations in the allocation/deallocation paths. It would be good to understand what is the gain of this approach before we switch to it. Maybe even run some slab-related benchmark (not sure if there's something better than hackbench these days) to see if the overhead of this approach shows up. > This patch introduces two counters to maintain the actual number > of partial objects dynamically instead of iterating the partial > page lists with list_lock held. > > New counters of kmem_cache_node are: pfree_objects, ptotal_objects. > The main operations are under list_lock in slow path, its performance > impact is minimal. > > Co-developed-by: Wen Yang > Signed-off-by: Xunlei Pang > --- > mm/slab.h | 2 ++ > mm/slub.c | 38 +- > 2 files changed, 39 insertions(+), 1 deletion(-) > > diff --git a/mm/slab.h b/mm/slab.h > index 7e94700..5935749 100644 > --- a/mm/slab.h > +++ b/mm/slab.h > @@ -616,6 +616,8 @@ struct kmem_cache_node { > #ifdef CONFIG_SLUB > unsigned long nr_partial; > struct list_head partial; > + atomic_long_t pfree_objects; /* partial free objects */ > + atomic_long_t ptotal_objects; /* partial total objects */ You could rename these to "nr_partial_free_objs" and "nr_partial_total_objs" for readability. - Pekka
Re: [PATCH v3] mm, slab: Check GFP_SLAB_BUG_MASK before alloc_pages in kmalloc_order
On Wed, Jul 1, 2020 at 6:19 PM Long Li wrote: > > kmalloc cannot allocate memory from HIGHMEM. Allocating large amounts > of memory currently bypasses the check and will simply leak the memory > when page_address() returns NULL. To fix this, factor the > GFP_SLAB_BUG_MASK check out of slab & slub, and call it from > kmalloc_order() as well. In order to make the code clear, the warning > message is put in one place. > > Signed-off-by: Long Li Reviewed-by: Pekka Enberg
Re: [PATCH 17/26] mm/riscv: Use general page fault accounting
Hi Peter, On Sat, Jun 27, 2020 at 1:36 AM Peter Xu wrote: > Use the general page fault accounting by passing regs into handle_mm_fault(). > It naturally solve the issue of multiple page fault accounting when page fault > retry happened. I sent a patch to fix up riscv page fault accounting some days ago: http://lists.infradead.org/pipermail/linux-riscv/2020-June/000775.html However, your fix is obviously even better. For the generic and riscv parts: Reviewed-by: Pekka Enberg - Pekka
Re: [PATCH 0/8] mm: cleanup usage of
On Sat, Jun 27, 2020 at 5:35 PM Mike Rapoport wrote: > Most architectures have very similar versions of pXd_alloc_one() and > pXd_free_one() for intermediate levels of page table. > These patches add generic versions of these functions in > and enable use of the generic functions where > appropriate. Very nice cleanup series to the page table code! FWIW: Reviewed-by: Pekka Enberg
Re: [PATCH] mm: Fix kobject memleak in SLUB
On 28/04/2019 2.40, Tobin C. Harding wrote: Currently error return from kobject_init_and_add() is not followed by a call to kobject_put(). This means there is a memory leak. Add call to kobject_put() in error path of kobject_init_and_add(). Signed-off-by: Tobin C. Harding Reviewed-by: Pekka Enberg
Re: [PATCH 0/1] mm: Remove the SLAB allocator
Hi, On 4/11/19 10:55 AM, Michal Hocko wrote: Please please have it more rigorous then what happened when SLUB was forced to become a default This is the hard part. Even if you are able to show that SLUB is as fast as SLAB for all the benchmarks you run, there's bound to be that one workload where SLUB regresses. You will then have people complaining about that (rightly so) and you're again stuck with two allocators. To move forward, I think we should look at possible *pathological* cases where we think SLAB might have an advantage. For example, SLUB had much more difficulties with remote CPU frees than SLAB. Now I don't know if this is the case, but it should be easy to construct a synthetic benchmark to measure this. For example, have a userspace process that does networking, which is often memory allocation intensive, so that we know that SKBs traverse between CPUs. You can do this by making sure that the NIC queues are mapped to CPU N (so that network softirqs have to run on that CPU) but the process is pinned to CPU M. It's, of course, worth thinking about other pathological cases too. Workloads that cause large allocations is one. Workloads that cause lots of slab cache shrinking is another. - Pekka
Re: [PATCH v4] kmemleak: survive in a low-memory situation
Hi Catalin, On 27/03/2019 2.59, Qian Cai wrote: Unless there is a brave soul to reimplement the kmemleak to embed it's metadata into the tracked memory itself in a foreseeable future, this provides a good balance between enabling kmemleak in a low-memory situation and not introducing too much hackiness into the existing code for now. On Thu, Mar 28, 2019 at 08:05:31AM +0200, Pekka Enberg wrote: Unfortunately I am not that brave soul, but I'm wondering what the complication here is? It shouldn't be too hard to teach calculate_sizes() in SLUB about a new SLAB_KMEMLEAK flag that reserves spaces for the metadata. On 28/03/2019 12.30, Catalin Marinas wrote:> I don't think it's the calculate_sizes() that's the hard part. The way kmemleak is designed assumes that the metadata has a longer lifespan than the slab object it is tracking (and refcounted via get_object/put_object()). We'd have to replace some of the rcu_read_(un)lock() regions with a full kmemleak_lock together with a few more tweaks to allow the release of kmemleak_lock during memory scanning (which can take minutes; so it needs to be safe w.r.t. metadata freeing, currently relying on a deferred RCU freeing). Right. I think SLUB already supports delaying object freeing because of KASAN (see the slab_free_freelist_hook() function) so the issue with metadata outliving object is solvable (although will consume more memory). I can't say I remember enough details from kmemleak to comment on the locking complications you point out, though. - Pekka
Re: [PATCH v4] kmemleak: survive in a low-memory situation
Hi, On 27/03/2019 2.59, Qian Cai wrote: Unless there is a brave soul to reimplement the kmemleak to embed it's metadata into the tracked memory itself in a foreseeable future, this provides a good balance between enabling kmemleak in a low-memory situation and not introducing too much hackiness into the existing code for now. Unfortunately I am not that brave soul, but I'm wondering what the complication here is? It shouldn't be too hard to teach calculate_sizes() in SLUB about a new SLAB_KMEMLEAK flag that reserves spaces for the metadata. - Pekka
Re: [PATCH] slub: untag object before slab end
On 13/02/2019 4.05, Qian Cai wrote: get_freepointer() could return NULL if there is no more free objects in the slab. However, it could return a tagged pointer (like 0x2200) with KASAN_SW_TAGS which would escape the NULL object checking in check_valid_pointer() and trigger errors below, so untag the object before checking for a NULL object there. Reviewed-by: Pekka Enberg
Re: [PATCH] mm/slab: Increase width of first /proc/slabinfo column
Hi, On 01/02/2019 4.34, Christopher Lameter wrote: On Fri, 1 Feb 2019, Tobin C. Harding wrote: Currently when displaying /proc/slabinfo if any cache names are too long then the output columns are not aligned. We could do something fancy to get the maximum length of any cache name in the system or we could just increase the hardcoded width. Currently it is 17 characters. Monitors are wide these days so lets just increase it to 30 characters. Hmm.. I wonder if there are any tools that depend on the field width here? It's possible, but it's more likely that userspace parses by whitespace because it's easier to write it that way. At least procps, which is used by slabtop, is prepared to parse a cache name of 128 characters. See the scanf() call in parse_slabinfo20() function in proc/slab.c of procps: http://procps.sourceforge.net/ Of course, testing with slabtop would make sense. - Pekka
Re: [PATCH 0/3] slub: Do trivial comments fixes
On 31/01/2019 6.10, Tobin C. Harding wrote: From: "Tobin C. Harding" Hi Christopher, Here is a trivial patchset to wet my toes. This is my first patchset to mm, if there are some mm specific nuances in relation to when in the dev cycle (if ever) that minor (*cough* trivial) pathsets are acceptable please say so This patchset fixes comments strings in the SLUB subsystem. As per discussion at LCA I am working on getting my head around the SLUB allocator. If you specifically do *not* want me to do minor clean up while I'm reading please say so, I will not be offended. For the series: Reviewed-by: Pekka Enberg
Re: [PATCH] mm: Prevent mapping slab pages to userspace
On 25/01/2019 19.38, Matthew Wilcox wrote: It's never appropriate to map a page allocated by SLAB into userspace. A buggy device driver might try this, or an attacker might be able to find a way to make it happen. Signed-off-by: Matthew Wilcox Acked-by: Pekka Enberg A WARN_ON_ONCE() would be nice here to let those buggy drivers know that they will no longer work. --- mm/memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memory.c b/mm/memory.c index e11ca9dd823f..ce8c90b752be 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1451,7 +1451,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr, spinlock_t *ptl; retval = -EINVAL; - if (PageAnon(page)) + if (PageAnon(page) || PageSlab(page)) goto out; retval = -ENOMEM; flush_dcache_page(page);
Re: [PATCH] mm/slub.c: freelist is ensured to be NULL when new_slab() fails
On 29/12/2018 8.25, Peng Wang wrote: new_slab_objects() will return immediately if freelist is not NULL. if (freelist) return freelist; One more assignment operation could be avoided. Signed-off-by: Peng Wang Reviewed-by: Pekka Enberg --- mm/slub.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 36c0befeebd8..cf2ef4ababff 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2463,8 +2463,7 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags, stat(s, ALLOC_SLAB); c->page = page; *pc = c; - } else - freelist = NULL; + } return freelist; }
Re: [PATCH] mm, slab: reschedule cache_reap() on the same CPU
On 11/04/2018 10.00, Vlastimil Babka wrote: cache_reap() is initially scheduled in start_cpu_timer() via schedule_delayed_work_on(). But then the next iterations are scheduled via schedule_delayed_work(), i.e. using WORK_CPU_UNBOUND. Thus since commit ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs") there is no guarantee the future iterations will run on the originally intended cpu, although it's still preferred. I was able to demonstrate this with /sys/module/workqueue/parameters/debug_force_rr_cpu. IIUC, it may also happen due to migrating timers in nohz context. As a result, some cpu's would be calling cache_reap() more frequently and others never. This patch uses schedule_delayed_work_on() with the current cpu when scheduling the next iteration. Signed-off-by: Vlastimil Babka Fixes: ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs") CC: Cc: Joonsoo Kim Cc: David Rientjes Cc: Pekka Enberg Cc: Christoph Lameter Cc: Tejun Heo Cc: Lai Jiangshan Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd Acked-by: Pekka Enberg --- mm/slab.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/slab.c b/mm/slab.c index 9095c3945425..a76006aae857 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -4074,7 +4074,8 @@ static void cache_reap(struct work_struct *w) next_reap_node(); out: /* Set up the next iteration */ - schedule_delayed_work(work, round_jiffies_relative(REAPTIMEOUT_AC)); + schedule_delayed_work_on(smp_processor_id(), work, + round_jiffies_relative(REAPTIMEOUT_AC)); } void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo)
Re: [PATCH 0/4] zswap: Optimize compressed pool memory utilization
On Wed, Aug 17, 2016 at 1:03 PM, Srividya Desireddy wrote: > This series of patches optimize the memory utilized by zswap for storing > the swapped out pages. > > Zswap is a cache which compresses the pages that are being swapped out > and stores them into a dynamically allocated RAM-based memory pool. > Experiments have shown that around 10-15% of pages stored in zswap are > duplicates which results in 10-12% more RAM required to store these > duplicate compressed pages. Around 10-20% of pages stored in zswap > are zero-filled pages, but these pages are handled as normal pages by > compressing and allocating memory in the pool. > > The following patch-set optimizes memory utilized by zswap by avoiding the > storage of duplicate pages and zero-filled pages in zswap compressed memory > pool. > > Patch 1/4: zswap: Share zpool memory of duplicate pages > This patch shares compressed pool memory of the duplicate pages. When a new > page is requested for swap-out to zswap; search for an identical page in > the pages already stored in zswap. If an identical page is found then share > the compressed page data of the identical page with the new page. This > avoids allocation of memory in the compressed pool for a duplicate page. > This feature is tested on devices with 1GB, 2GB and 3GB RAM by executing > performance test at low memory conditions. Around 15-20% of the pages > swapped are duplicate of the pages existing in zswap, resulting in 15% > saving of zswap memory pool when compared to the baseline version. > > Test Parameters BaselineWith patch Improvement > Total RAM 955MB 955MB > Available RAM 254MB 269MB 15MB > Avg. App entry time 2.469sec2.207sec7% > Avg. App close time 1.151sec1.085sec6% > Apps launched in 1sec 5 12 7 > > There is little overhead in zswap store function due to the search > operation for finding duplicate pages. However, if duplicate page is > found it saves the compression and allocation time of the page. The average > overhead per zswap_frontswap_store() function call in the experimental > device is 9us. There is no overhead in case of zswap_frontswap_load() > operation. > > Patch 2/4: zswap: Enable/disable sharing of duplicate pages at runtime > This patch adds a module parameter to enable or disable the sharing of > duplicate zswap pages at runtime. > > Patch 3/4: zswap: Zero-filled pages handling > This patch checks if a page to be stored in zswap is a zero-filled page > (i.e. contents of the page are all zeros). If such page is found, > compression and allocation of memory for the compressed page is avoided > and instead the page is just marked as zero-filled page. > Although, compressed size of a zero-filled page using LZO compressor is > very less (52 bytes including zswap_header), this patch saves compression > and allocation time during store operation and decompression time during > zswap load operation for zero-filled pages. Experiments have shown that > around 10-20% of pages stored in zswap are zero-filled. Aren't zero-filled pages already handled by patch 1/4 as their contents match? So the overall memory saving is 52 bytes? - Pekka
Re: [PATCH 3/4] zswap: Zero-filled pages handling
On Wed, Aug 17, 2016 at 1:18 PM, Srividya Desireddy wrote: >> This patch adds a check in zswap_frontswap_store() to identify zero-filled >> page before compression of the page. If the page is a zero-filled page, set >> zswap_entry.zeroflag and skip the compression of the page and alloction >> of memory in zpool. In zswap_frontswap_load(), check if the zeroflag is >> set for the page in zswap_entry. If the flag is set, memset the page with >> zero. This saves the decompression time during load. >> >> The overall overhead caused due to zero-filled page check is very minimal >> when compared to the time saved by avoiding compression and allocation in >> case of zero-filled pages. The load time of a zero-filled page is reduced >> by 80% when compared to baseline. On Wed, Aug 17, 2016 at 3:25 PM, Pekka Enberg wrote: > AFAICT, that's an overall improvement only if there are a lot of > zero-filled pages because it's just overhead for pages that we *need* > to compress, no? So I suppose the question is, are there a lot of > zero-filled pages that we need to swap and why is that the case? I suppose reading your cover letter would have been helpful before sending out my email: "Experiments have shown that around 10-15% of pages stored in zswap are duplicates which results in 10-12% more RAM required to store these duplicate compressed pages." But I still don't understand why we have zero-filled pages that we are swapping out. - Pekka
Re: [PATCH 3/4] zswap: Zero-filled pages handling
On Wed, Aug 17, 2016 at 1:18 PM, Srividya Desireddy wrote: > This patch adds a check in zswap_frontswap_store() to identify zero-filled > page before compression of the page. If the page is a zero-filled page, set > zswap_entry.zeroflag and skip the compression of the page and alloction > of memory in zpool. In zswap_frontswap_load(), check if the zeroflag is > set for the page in zswap_entry. If the flag is set, memset the page with > zero. This saves the decompression time during load. > > The overall overhead caused due to zero-filled page check is very minimal > when compared to the time saved by avoiding compression and allocation in > case of zero-filled pages. The load time of a zero-filled page is reduced > by 80% when compared to baseline. AFAICT, that's an overall improvement only if there are a lot of zero-filled pages because it's just overhead for pages that we *need* to compress, no? So I suppose the question is, are there a lot of zero-filled pages that we need to swap and why is that the case? > @@ -1314,6 +1347,13 @@ static int zswap_frontswap_load(unsigned type, pgoff_t > offset, > } > spin_unlock(&tree->lock); > > + if (entry->zeroflag) { > + dst = kmap_atomic(page); > + memset(dst, 0, PAGE_SIZE); > + kunmap_atomic(dst); > + goto freeentry; > + } Don't we need the same thing in zswap_writeback_entry() for the ZSWAP_SWAPCACHE_NEW case? > + > /* decompress */ > dlen = PAGE_SIZE; > src = (u8 *)zpool_map_handle(entry->pool->zpool, > entry->zhandle->handle, > @@ -1327,6 +1367,7 @@ static int zswap_frontswap_load(unsigned type, pgoff_t > offset, > zpool_unmap_handle(entry->pool->zpool, entry->zhandle->handle); > BUG_ON(ret); - Pekka
Re: [PATCH, REGRESSION v3] mm: make apply_to_page_range more robust
On 01/22/2016 01:12 AM, David Rientjes wrote: NACK to your patch as it is just covering up buggy code silently. The problem needs to be addressed in change_memory_common() to return if there is no size to change (numpages == 0). It's a two line fix to that function. So add a WARN_ON there to *warn* about the situations. There's really no need to BUG_ON here. - Pekka
Re: [PATCH] mm: slab: convert slab_is_available to boolean
On 9/15/15 8:50 PM, Denis Kirjanov wrote: A good one candidate to return a boolean result Signed-off-by: Denis Kirjanov Reviewed-by: Pekka Enberg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 04/14] cxgb4: Use kvfree() in t4_free_mem()
Use kvfree() instead of open-coding it. Cc: Hariprasad S Signed-off-by: Pekka Enberg --- drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c index 803d91b..9143917 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c @@ -1121,10 +1121,7 @@ void *t4_alloc_mem(size_t size) */ void t4_free_mem(void *addr) { - if (is_vmalloc_addr(addr)) - vfree(addr); - else - kfree(addr); + kvfree(addr); } /* Send a Work Request to write the filter at a specified index. We construct -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 07/14] drm/nouveau/gem: Use kvfree() in u_free()
Use kvfree() instead of open-coding it. Cc: David Airlie Signed-off-by: Pekka Enberg --- drivers/gpu/drm/nouveau/nouveau_gem.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c index 0e690bf..af1ee51 100644 --- a/drivers/gpu/drm/nouveau/nouveau_gem.c +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c @@ -555,10 +555,7 @@ nouveau_gem_pushbuf_validate(struct nouveau_channel *chan, static inline void u_free(void *addr) { - if (!is_vmalloc_addr(addr)) - kfree(addr); - else - vfree(addr); + kvfree(addr); } static inline void * -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 03/14] cxgb3: Use kvfree() in cxgb_free_mem()
Use kvfree() instead of open-coding it. Cc: Santosh Raspatur Signed-off-by: Pekka Enberg --- drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c index b0cbb2b..76684dc 100644 --- a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c +++ b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c @@ -1169,10 +1169,7 @@ void *cxgb_alloc_mem(unsigned long size) */ void cxgb_free_mem(void *addr) { - if (is_vmalloc_addr(addr)) - vfree(addr); - else - kfree(addr); + kvfree(addr); } /* -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 09/14] IB/ehca: Use kvfree() in ipz_queue_{cd}tor()
Use kvfree() instead of open-coding it. Cc: Hoang-Nam Nguyen Cc: Christoph Raisch Signed-off-by: Pekka Enberg --- drivers/infiniband/hw/ehca/ipz_pt_fn.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.c b/drivers/infiniband/hw/ehca/ipz_pt_fn.c index 8d59451..7ffc748 100644 --- a/drivers/infiniband/hw/ehca/ipz_pt_fn.c +++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.c @@ -245,10 +245,7 @@ int ipz_queue_ctor(struct ehca_pd *pd, struct ipz_queue *queue, ipz_queue_ctor_exit0: ehca_gen_err("Couldn't alloc pages queue=%p " "nr_of_pages=%x", queue, nr_of_pages); - if (is_vmalloc_addr(queue->queue_pages)) - vfree(queue->queue_pages); - else - kfree(queue->queue_pages); + kvfree(queue->queue_pages); return 0; } @@ -270,10 +267,7 @@ int ipz_queue_dtor(struct ehca_pd *pd, struct ipz_queue *queue) free_page((unsigned long)queue->queue_pages[i]); } - if (is_vmalloc_addr(queue->queue_pages)) - vfree(queue->queue_pages); - else - kfree(queue->queue_pages); + kvfree(queue->queue_pages); return 1; } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 08/14] drivers/input/evdev.c: Use kvfree() in evdev_release()
Use kvfree() instead of open-coding it. Cc: Dmitry Torokhov Signed-off-by: Pekka Enberg --- drivers/input/evdev.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/input/evdev.c b/drivers/input/evdev.c index a18f41b..9d35499 100644 --- a/drivers/input/evdev.c +++ b/drivers/input/evdev.c @@ -422,10 +422,7 @@ static int evdev_release(struct inode *inode, struct file *file) evdev_detach_client(evdev, client); - if (is_vmalloc_addr(client)) - vfree(client); - else - kfree(client); + kvfree(client); evdev_close_device(evdev); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 01/14] ipc/util.c: Use kvfree() in ipc_rcu_free()
Use kvfree() instead of open-coding it. Signed-off-by: Pekka Enberg --- ipc/util.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/ipc/util.c b/ipc/util.c index ff3323e..537a41c 100644 --- a/ipc/util.c +++ b/ipc/util.c @@ -467,10 +467,7 @@ void ipc_rcu_free(struct rcu_head *head) { struct ipc_rcu *p = container_of(head, struct ipc_rcu, rcu); - if (is_vmalloc_addr(p)) - vfree(p); - else - kfree(p); + kvfree(p); } /** -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 10/14] target: Use kvfree() in session alloc and free
Use kvfree() instead of open-coding it. Cc: "Nicholas A. Bellinger" Signed-off-by: Pekka Enberg --- drivers/target/target_core_transport.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c index 3fe5cb2..9c058c0 100644 --- a/drivers/target/target_core_transport.c +++ b/drivers/target/target_core_transport.c @@ -280,10 +280,7 @@ int transport_alloc_session_tags(struct se_session *se_sess, if (rc < 0) { pr_err("Unable to init se_sess->sess_tag_pool," " tag_num: %u\n", tag_num); - if (is_vmalloc_addr(se_sess->sess_cmd_map)) - vfree(se_sess->sess_cmd_map); - else - kfree(se_sess->sess_cmd_map); + kvfree(se_sess->sess_cmd_map); se_sess->sess_cmd_map = NULL; return -ENOMEM; } @@ -490,10 +487,7 @@ void transport_free_session(struct se_session *se_sess) { if (se_sess->sess_cmd_map) { percpu_ida_destroy(&se_sess->sess_tag_pool); - if (is_vmalloc_addr(se_sess->sess_cmd_map)) - vfree(se_sess->sess_cmd_map); - else - kfree(se_sess->sess_cmd_map); + kvfree(se_sess->sess_cmd_map); } kmem_cache_free(se_sess_cache, se_sess); } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 12/14] dm: Use kvfree() in dm_kvfree()
Use kvfree() instead of open-coding it. Cc: Alasdair Kergon Cc: Mike Snitzer Signed-off-by: Pekka Enberg --- drivers/md/dm-stats.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/md/dm-stats.c b/drivers/md/dm-stats.c index f478a4c..492fe6a 100644 --- a/drivers/md/dm-stats.c +++ b/drivers/md/dm-stats.c @@ -160,10 +160,7 @@ static void dm_kvfree(void *ptr, size_t alloc_size) free_shared_memory(alloc_size); - if (is_vmalloc_addr(ptr)) - vfree(ptr); - else - kfree(ptr); + kvfree(ptr); } static void dm_stat_free(struct rcu_head *head) -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 11/14] libcxgbi: Use kvfree() in cxgbi_free_big_mem()
Use kvfree() instead of open-coding it. Cc: "James E.J. Bottomley" Signed-off-by: Pekka Enberg --- drivers/scsi/cxgbi/libcxgbi.h | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/scsi/cxgbi/libcxgbi.h b/drivers/scsi/cxgbi/libcxgbi.h index aba1af7..c2eb7ea 100644 --- a/drivers/scsi/cxgbi/libcxgbi.h +++ b/drivers/scsi/cxgbi/libcxgbi.h @@ -685,10 +685,7 @@ static inline void *cxgbi_alloc_big_mem(unsigned int size, static inline void cxgbi_free_big_mem(void *addr) { - if (is_vmalloc_addr(addr)) - vfree(addr); - else - kfree(addr); + kvfree(addr); } static inline void cxgbi_set_iscsi_ipv4(struct cxgbi_hba *chba, __be32 ipaddr) -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 13/14] bcache: Use kvfree() in various places
Use kvfree() instead of open-coding it. Cc: Kent Overstreet Signed-off-by: Pekka Enberg --- drivers/md/bcache/super.c | 10 ++ drivers/md/bcache/util.h | 10 ++ 2 files changed, 4 insertions(+), 16 deletions(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 4dd2bb7..94980bf 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -760,14 +760,8 @@ static void bcache_device_free(struct bcache_device *d) bio_split_pool_free(&d->bio_split_hook); if (d->bio_split) bioset_free(d->bio_split); - if (is_vmalloc_addr(d->full_dirty_stripes)) - vfree(d->full_dirty_stripes); - else - kfree(d->full_dirty_stripes); - if (is_vmalloc_addr(d->stripe_sectors_dirty)) - vfree(d->stripe_sectors_dirty); - else - kfree(d->stripe_sectors_dirty); + kvfree(d->full_dirty_stripes); + kvfree(d->stripe_sectors_dirty); closure_debug_destroy(&d->cl); } diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h index 98df757..1d04c48 100644 --- a/drivers/md/bcache/util.h +++ b/drivers/md/bcache/util.h @@ -52,10 +52,7 @@ struct closure; #define free_heap(heap) \ do { \ - if (is_vmalloc_addr((heap)->data)) \ - vfree((heap)->data);\ - else\ - kfree((heap)->data);\ + kvfree((heap)->data); \ (heap)->data = NULL;\ } while (0) @@ -163,10 +160,7 @@ do { \ #define free_fifo(fifo) \ do { \ - if (is_vmalloc_addr((fifo)->data)) \ - vfree((fifo)->data);\ - else\ - kfree((fifo)->data);\ + kvfree((fifo)->data); \ (fifo)->data = NULL;\ } while (0) -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 14/14] NTFS: Use kvfree() in ntfs_free()
Use kvfree() instead of open-coding it. Cc: Anton Altaparmakov Signed-off-by: Pekka Enberg --- fs/ntfs/malloc.h | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/fs/ntfs/malloc.h b/fs/ntfs/malloc.h index a44b14c..ab172e5 100644 --- a/fs/ntfs/malloc.h +++ b/fs/ntfs/malloc.h @@ -85,12 +85,7 @@ static inline void *ntfs_malloc_nofs_nofail(unsigned long size) static inline void ntfs_free(void *addr) { - if (!is_vmalloc_addr(addr)) { - kfree(addr); - /* free_page((unsigned long)addr); */ - return; - } - vfree(addr); + kvfree(addr); } #endif /* _LINUX_NTFS_MALLOC_H */ -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 06/14] drm: Use kvfree() in drm_free_large()
Use kvfree() instead of open-coding it. Cc: David Airlie Signed-off-by: Pekka Enberg --- include/drm/drm_mem_util.h | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/include/drm/drm_mem_util.h b/include/drm/drm_mem_util.h index 19a2404..e42495a 100644 --- a/include/drm/drm_mem_util.h +++ b/include/drm/drm_mem_util.h @@ -56,10 +56,7 @@ static __inline__ void *drm_malloc_ab(size_t nmemb, size_t size) static __inline void drm_free_large(void *ptr) { - if (!is_vmalloc_addr(ptr)) - return kfree(ptr); - - vfree(ptr); + kvfree(ptr); } #endif -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 02/14] kernel/relay.c: Use kvfree() in relay_free_page_array()
Use kvfree() instead of open-coding it. Signed-off-by: Pekka Enberg --- kernel/relay.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/kernel/relay.c b/kernel/relay.c index e9dbaeb..0b4570c 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -81,10 +81,7 @@ static struct page **relay_alloc_page_array(unsigned int n_pages) */ static void relay_free_page_array(struct page **array) { - if (is_vmalloc_addr(array)) - vfree(array); - else - kfree(array); + kvfree(array); } /** -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 05/14] ceph: Use kvfree() in ceph_put_page_vector()
Use kvfree instead of open-coding it. Cc: "Yan, Zheng" Cc: Sage Weil Signed-off-by: Pekka Enberg --- net/ceph/pagevec.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c index 096d914..d4f5f22 100644 --- a/net/ceph/pagevec.c +++ b/net/ceph/pagevec.c @@ -51,10 +51,7 @@ void ceph_put_page_vector(struct page **pages, int num_pages, bool dirty) set_page_dirty_lock(pages[i]); put_page(pages[i]); } - if (is_vmalloc_addr(pages)) - vfree(pages); - else - kfree(pages); + kvfree(pages); } EXPORT_SYMBOL(ceph_put_page_vector); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] perf kmem: Show warning when trying to run stat without record
On 05/05/2015 05:07 PM, Arnaldo Carvalho de Melo wrote: Em Tue, May 05, 2015 at 09:58:12AM +0900, Namhyung Kim escreveu: Sometimes one can mistakenly run perf kmem stat without perf kmem record before or different configuration like recoding --slab and stat --page. Show a warning message like below to inform user: # perf kmem stat --page --caller Not found page events. Have you run 'perf kmem record --page' before? Acked-by: Pekka Enberg Signed-off-by: Namhyung Kim Thanks, applied. I just found the messages a bit odd souding, perhaps: # perf kmem stat --page --caller No page allocation events found. Have you run 'perf kmem record --page'? Pekka? Sure, that sounds less confusing. - Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] Parallel struct page initialisation v4
On Tue, Apr 28, 2015 at 5:36 PM, Mel Gorman wrote: > Struct page initialisation had been identified as one of the reasons why > large machines take a long time to boot. Patches were posted a long time ago > to defer initialisation until they were first used. This was rejected on > the grounds it should not be necessary to hurt the fast paths. This series > reuses much of the work from that time but defers the initialisation of > memory to kswapd so that one thread per node initialises memory local to > that node. > > After applying the series and setting the appropriate Kconfig variable I > see this in the boot log on a 64G machine > > [7.383764] kswapd 0 initialised deferred memory in 188ms > [7.404253] kswapd 1 initialised deferred memory in 208ms > [7.411044] kswapd 3 initialised deferred memory in 216ms > [7.411551] kswapd 2 initialised deferred memory in 216ms > > On a 1TB machine, I see > > [8.406511] kswapd 3 initialised deferred memory in 1116ms > [8.428518] kswapd 1 initialised deferred memory in 1140ms > [8.435977] kswapd 0 initialised deferred memory in 1148ms > [8.437416] kswapd 2 initialised deferred memory in 1148ms > > Once booted the machine appears to work as normal. Boot times were measured > from the time shutdown was called until ssh was available again. In the > 64G case, the boot time savings are negligible. On the 1TB machine, the > savings were 16 seconds. FWIW, Acked-by: Pekka Enberg for the whole series. - Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET 0/6] perf kmem: Implement page allocation analysis (v7)
On Tue, Apr 14, 2015 at 5:52 AM, Namhyung Kim wrote: > Currently perf kmem command only analyzes SLAB memory allocation. And > I'd like to introduce page allocation analysis also. Users can use > --slab and/or --page option to select it. If none of these options > are used, it does slab allocation analysis for backward compatibility. Nice addition! Acked-by: Pekka Enberg for the whole series. - Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 2/4] perf tools: add Java demangling support
Hi Stephane, On Tue, Mar 31, 2015 at 1:19 AM, Stephane Eranian wrote: > +#define BASE_ENT(c, n) [c-'A']=n > +static const char *base_types['Z'-'A' + 1]={ > + BASE_ENT('B', "byte" ), > + BASE_ENT('C', "char" ), > + BASE_ENT('D', "double" ), > + BASE_ENT('F', "float" ), > + BASE_ENT('I', "int" ), > + BASE_ENT('J', "long" ), > + BASE_ENT('S', "short" ), > + BASE_ENT('Z', "bool" ), It's "boolean", not "bool" in JVM speak. > +static char * > +__demangle_java_sym(const char *str, const char *end, char *buf, int maxlen, > int mode) > +{ > + int rlen = 0; > + int array = 0; > + int narg = 0; > + const char *q; > + > + if (!end) > + end = str + strlen(str); > + > + for (q = str; q != end; q++) { > + > + if (rlen == (maxlen - 1)) > + break; > + > + switch (*q) { > + case 'L': > + if (mode == MODE_PREFIX || mode == MODE_CTYPE) { > + if (mode == MODE_CTYPE) { > + if (narg) > + rlen += scnprintf(buf + rlen, > maxlen - rlen, ", "); > + narg++; > + } > + rlen += scnprintf(buf + rlen, maxlen - rlen, > "class "); > + if (mode == MODE_PREFIX) > + mode = MODE_CLASS; > + } else > + buf[rlen++] = *q; > + break; This looks odd to me. "L" marks the beginning of an class name and it's terminated by ";". You could just strhchr() the terminator and simply copy the name to "buf" and drop cases ';', '/', and the default label fro the switch statement. > +char * > +java_demangle_sym(const char *str, int flags) > +{ [snip] > + /* > +* expansion factor estimated to 3x > +*/ > + len = strlen(str) * 3 + 1; > + buf = malloc(len); > + if (!buf) > + return NULL; Truncated symbols are lame. Can't you use realloc() to ensure that never happens? - Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: stand-alone kvmtool
On 2/26/15 1:02 PM, Alex Bennée wrote: If you can have it all it would be nice to preserve buildability all through your history for bisecting (and the moon on a stick please ;-) Is the dependency on the kernel sources something that has been stable over the projects history or something that's been declining/increasing over time? AFAICT, it's been mostly stable after the initial sweep to use kernel-provided data structures such as lists and rbtrees. - Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: stand-alone kvmtool
Hi, On 2/18/15 5:50 PM, Will Deacon wrote: Thanks for doing this. Since it looks unlikely that kvmtool will ever be merged back into the kernel tree, it makes sense to cut the dependency in my opinion. I am certainly OK with a standalone repository which preserves the history. Will, would you like to take over the proposed new repository and put it somewhere on git.kernel.org, perhaps? - Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/5] slab: Correct size_index table before replacing the bootstrap kmem_cache_node.
On 02/04/2015 11:06 PM, Daniel Sanders wrote: This patch moves the initialization of the size_index table slightly earlier so that the first few kmem_cache_node's can be safely allocated when KMALLOC_MIN_SIZE is large. There are currently two ways to generate indices into kmalloc_caches (via kmalloc_index() and via the size_index table in slab_common.c) and on some arches (possibly only MIPS) they potentially disagree with each other until create_kmalloc_caches() has been called. It seems that the intention is that the size_index table is a fast equivalent to kmalloc_index() and that create_kmalloc_caches() patches the table to return the correct value for the cases where kmalloc_index()'s if-statements apply. The failing sequence was: * kmalloc_caches contains NULL elements * kmem_cache_init initialises the element that 'struct kmem_cache_node' will be allocated to. For 32-bit Mips, this is a 56-byte struct and kmalloc_index returns KMALLOC_SHIFT_LOW (7). * init_list is called which calls kmalloc_node to allocate a 'struct kmem_cache_node'. * kmalloc_slab selects the kmem_caches element using size_index[size_index_elem(size)]. For MIPS, size is 56, and the expression returns 6. * This element of kmalloc_caches is NULL and allocation fails. * If it had not already failed, it would have called create_kmalloc_caches() at this point which would have changed size_index[size_index_elem(size)] to 7. Signed-off-by: Daniel Sanders Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org Acked-by: Pekka Enberg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] LLVMLinux: Correct size_index table before replacing the bootstrap kmem_cache_node.
On Wed, Feb 4, 2015 at 10:38 PM, Daniel Sanders wrote: > I don't believe the bug to be LLVM specific but GCC doesn't normally > encounter the problem. I haven't been able to identify exactly what GCC is > doing better (probably inlining) but it seems that GCC is managing to > optimize to the point that it eliminates the problematic allocations. This > theory is supported by the fact that GCC can be made to fail in the same way > by changing inline, __inline, __inline__, and __always_inline in > include/linux/compiler-gcc.h such that they don't actually inline things. OK, makes sense. Please include that explanation in the changelog and drop use proper "slab" prefix instead of the confusing "LLVMLinux" prefix in the subject line. - Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] LLVMLinux: Correct size_index table before replacing the bootstrap kmem_cache_node.
On 2/3/15 3:37 PM, Daniel Sanders wrote: This patch moves the initialization of the size_index table slightly earlier so that the first few kmem_cache_node's can be safely allocated when KMALLOC_MIN_SIZE is large. The patch looks OK to me but how is this related to LLVM? - Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
On 1/28/15 6:31 PM, Christoph Lameter wrote: On Wed, 28 Jan 2015, Vladimir Davydov wrote: This patch therefore makes __kmem_cache_shrink() allocate the array on stack instead of calling kmalloc, which may fail. The array size is chosen to be equal to 32, because most SLUB caches store not more than 32 objects per slab page. Slab pages with <= 32 free objects are sorted using the array by the number of objects in use and promoted to the head of the partial list, while slab pages with > 32 free objects are left in the end of the list without any ordering imposed on them. Acked-by: Christoph Lameter Acked-by: Pekka Enberg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] slub: Do not use c->page on free
On Wed, Dec 10, 2014 at 7:08 PM, Christoph Lameter wrote: >> > +{ >> > + long d = p - page->address; >> > + >> > + return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) >> > << PAGE_SHIFT); >> > +} >> >> Can you elaborate on what this is doing? I don't really understand it. > > Checks if the pointer points to the slab page. Also it tres to avoid > having to call compound_order needlessly. Not sure if that optimization is > worth it. Aah, it's the (1 << MAX_ORDER) optimization that confused me. Perhaps add a comment there to make it more obvious? I'm fine with the optimization: Reviewed-by: Pekka Enberg - Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/7] slub: Drop ->page field from kmem_cache_cpu
On Wed, Dec 10, 2014 at 6:30 PM, Christoph Lameter wrote: > Dropping the page field is possible since the page struct address > of an object or a freelist pointer can now always be calcualted from > the address. No freelist pointer will be NULL anymore so use > NULL to signify the condition that the current cpu has no > percpu slab attached to it. > > Signed-off-by: Christoph Lameter Reviewed-by: Pekka Enberg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/7] slub: Use end_token instead of NULL to terminate freelists
On Wed, Dec 10, 2014 at 6:30 PM, Christoph Lameter wrote: > Ending a list with NULL means that the termination of a list is the same > for all slab pages. The pointers of freelists otherwise always are > pointing to the address space of the page. Make termination of a > list possible by setting the lowest bit in the freelist address > and use the start address of a page if no other address is available > for list termination. > > This will allow us to determine the page struct address from a > freelist pointer in the future. > > Signed-off-by: Christoph Lameter Reviewed-by: Pekka Enberg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] slub: Avoid using the page struct address in allocation fastpath
On Wed, Dec 10, 2014 at 6:30 PM, Christoph Lameter wrote: > We can use virt_to_page there and only invoke the costly function if > actually a node is specified and we have to check the NUMA locality. > > Increases the cost of allocating on a specific NUMA node but then that > was never cheap since we may have to dump our caches and retrieve memory > from the correct node. > > Signed-off-by: Christoph Lameter > > Index: linux/mm/slub.c > === > --- linux.orig/mm/slub.c2014-12-09 12:27:49.414686959 -0600 > +++ linux/mm/slub.c 2014-12-09 12:27:49.414686959 -0600 > @@ -2097,6 +2097,15 @@ static inline int node_match(struct page > return 1; > } > > +static inline int node_match_ptr(void *p, int node) > +{ > +#ifdef CONFIG_NUMA > + if (!p || (node != NUMA_NO_NODE && page_to_nid(virt_to_page(p)) != > node)) You already test that object != NULL before calling node_match_ptr(). > + return 0; > +#endif > + return 1; > +} > + > #ifdef CONFIG_SLUB_DEBUG > static int count_free(struct page *page) > { > @@ -2410,7 +2419,7 @@ redo: > > object = c->freelist; > page = c->page; > - if (unlikely(!object || !node_match(page, node))) { > + if (unlikely(!object || !node_match_ptr(object, node))) { > object = __slab_alloc(s, gfpflags, node, addr, c); > stat(s, ALLOC_SLOWPATH); > } else { > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] slub: Do not use c->page on free
On Wed, Dec 10, 2014 at 6:30 PM, Christoph Lameter wrote: > Avoid using the page struct address on free by just doing an > address comparison. That is easily doable now that the page address > is available in the page struct and we already have the page struct > address of the object to be freed calculated. > > Signed-off-by: Christoph Lameter > > Index: linux/mm/slub.c > === > --- linux.orig/mm/slub.c2014-12-09 12:25:45.770405462 -0600 > +++ linux/mm/slub.c 2014-12-09 12:25:45.766405582 -0600 > @@ -2625,6 +2625,13 @@ slab_empty: > discard_slab(s, page); > } > > +static bool same_slab_page(struct kmem_cache *s, struct page *page, void *p) Why are you passing a pointer to struct kmem_cache here? You don't seem to use it. > +{ > + long d = p - page->address; > + > + return d > 0 && d < (1 << MAX_ORDER) && d < (compound_order(page) << > PAGE_SHIFT); > +} Can you elaborate on what this is doing? I don't really understand it. - Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7] slub: Use page-mapping to store address of page frame like done in SLAB
On Wed, Dec 10, 2014 at 6:30 PM, Christoph Lameter wrote: > SLAB uses the mapping field of the page struct to store a pointer to the > begining of the objects in the page frame. Use the same field to store > the address of the objects in SLUB as well. This allows us to avoid a > number of invocations of page_address(). Those are mostly only used for > debugging though so this should have no performance benefit. > > Signed-off-by: Christoph Lameter Reviewed-by: Pekka Enberg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] slub: Remove __slab_alloc code duplication
On Wed, Dec 10, 2014 at 6:30 PM, Christoph Lameter wrote: > Somehow the two branches in __slab_alloc do the same. > Unify them. > > Signed-off-by: Christoph Lameter Reviewed-by: Pekka Enberg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] slab: Fix nodeid bounds check for non-contiguous node IDs
On 12/1/14 6:28 AM, Paul Mackerras wrote: --- v2: include the oops message in the patch description mm/slab.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/slab.c b/mm/slab.c index eb2b2ea..f34e053 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3076,7 +3076,7 @@ static void *cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, void *obj; int x; - VM_BUG_ON(nodeid > num_online_nodes()); + VM_BUG_ON(nodeid < 0 || nodeid >= MAX_NUMNODES); n = get_node(cachep, nodeid); BUG_ON(!n); Reviewed-by: Pekka Enberg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SLAB Maintainer update
On 06/04/2014 10:20 PM, Christoph Lameter wrote: As discussed in various threads on the side: Remove one inactive maintainer, add two new ones and update my email address. Plus add Andrew. And fix the glob to include files like mm/slab_common.c Signed-off-by: Christoph Lameter Acked-by: Pekka Enberg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slab_common: fix the check for duplicate slab names
On 05/23/2014 11:16 PM, Mike Snitzer wrote: On Tue, Mar 25 2014 at 2:07pm -0400, Christoph Lameter wrote: On Tue, 25 Mar 2014, Mike Snitzer wrote: This patch still isn't upstream. Who should be shepherding it to Linus? Pekka usually does that. Acked-by: Christoph Lameter This still hasn't gotten upstream. Pekka, any chance you can pick it up? Here it is in dm-devel's kernel.org patchwork: https://patchwork.kernel.org/patch/3768901/ (Though it looks like it needs to be rebased due to the recent commit 794b1248, should Mikulas rebase and re-send?) I applied it and fixed the conflict by hand. Please double-check commit 694617474e33b8603fc76e090ed7d09376514b1a in my tree: https://git.kernel.org/cgit/linux/kernel/git/penberg/linux.git/ - Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: randconfig build error with next-20140512, in mm/slub.c
On 05/12/2014 09:47 PM, Christoph Lameter wrote: A patch was posted today for this issue. AFAICT, it's coming from -mm. Andrew, can you pick up the fix? Date: Mon, 12 May 2014 09:36:30 -0300 From: Fabio Estevam To: a...@linux-foundation.org Cc: linux...@kvack.org, feste...@gmail.com, Fabio Estevam ,Christoph Lameter , David Rientjes , Pekka Enberg Subject: [PATCH] mm: slub: Place count_partial() outside CONFIG_SLUB_DEBUG if block On Mon, 12 May 2014, Jim Davis wrote: Building with the attached random configuration file, mm/slub.c: In function ‘show_slab_objects’: mm/slub.c:4361:5: error: implicit declaration of function ‘count_partial’ [-Werr or=implicit-function-declaration] x = count_partial(n, count_total); ^ cc1: some warnings being treated as errors make[1]: *** [mm/slub.o] Error 1 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] slab: Fix off by one in object max number tests.
On Tue, May 6, 2014 at 6:32 AM, Linus Torvalds wrote: > On Mon, May 5, 2014 at 8:25 PM, David Miller wrote: >> >>> Sam Ravnborg wrote: There is a related patch in this area which I think is not yet applied. See: https://lkml.org/lkml/2014/4/18/28 Maybe this is related. >>> >>> Thanks, I'm testing with that patch added as well. >> >> I can confirm that this fixes my problems completely. > > Ugh. That patch has been out two weeks already. Pekka? Sorry, I missed that completely. > I guess I'll apply both patches directly. Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PROBLEM] perf report --gtk doesn't work
Hello, I'm seeing the following with v3.15-rc2: $ ~/bin/perf report --gtk GTK browser requested but could not find libperf-gtk.so The library file is in $HOME/lib64 and perf attempts to look it up. However, printing out dlerror() output shows the following: [penberg@localhost hornet]$ ~/bin/perf report --gtk /home/penberg/lib64/libperf-gtk.so: undefined symbol: sort_sym GTK browser requested but could not find libperf-gtk.so I'm not sure how this is supposed to work. sort_sym() is defined in the 'perf' executable so how is libperf-gtk.so going to see it? - Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] kmemcheck: Switch to using kernel disassembler
On 04/14/2014 08:44 PM, Sasha Levin wrote: kmemcheck used to do it's own basic instruction decoding, which is just a duplication of the work done in arch/x86/lib/insn.c. Instead, switch it to using the already existing dissasembler, and switch the magic opcode numbers into something meaningful. Signed-off-by: Sasha Levin Vegard probably should take a closer look at this but: Acked-by: Pekka Enberg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] SLAB changes for v3.15-rc1
Hi Linus, Please pull the latest SLAB tree from: git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux.git slab/next The biggest change is byte-sized freelist indices which reduces slab freelist memory usage: https://lkml.org/lkml/2013/12/2/64 Pekka --> The following changes since commit 38dbfb59d1175ef458d006556061adeaa8751b72: Linus 3.14-rc1 (2014-02-02 16:42:13 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux.git slab/next for you to fetch changes up to 34bf6ef94a835a8f1d8abd3e7d38c6c08d205867: mm: slab/slub: use page->list consistently instead of page->lru (2014-04-11 10:06:06 +0300) Dave Hansen (1): mm: slab/slub: use page->list consistently instead of page->lru Jianyu Zhan (1): mm/slab.c: cleanup outdated comments and unify variables naming Joe Perches (1): slab: Make allocations with GFP_ZERO slightly more efficient Joonsoo Kim (7): slab: factor out calculate nr objects in cache_estimate slab: introduce helper functions to get/set free object slab: restrict the number of objects in a slab slab: introduce byte sized index for the freelist of a slab slab: make more slab management structure off the slab slub: fix high order page allocation problem with __GFP_NOFAIL slab: fix wrongly used macro include/linux/mm_types.h | 3 +- include/linux/slab.h | 11 +++ mm/slab.c| 183 +++ mm/slob.c| 10 +-- mm/slub.c| 5 +- 5 files changed, 128 insertions(+), 84 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: oops in slab/leaks_show
On 03/11/2014 10:30 AM, Joonsoo Kim wrote: -8<- From ff6fe77fb764ca5bf8705bf53d07d38e4111e84c Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Tue, 11 Mar 2014 14:14:25 +0900 Subject: [PATCH] slab: remove kernel_map_pages() optimization in slab poisoning If CONFIG_DEBUG_PAGEALLOC enables, slab poisoning functionality uses kernel_map_pages(), instead of real poisoning, to detect memory corruption with low overhead. But, in that case, slab leak detector trigger oops. Reason is that slab leak detector accesses all active objects, especially including objects in cpu slab caches to get the caller information. These objects are already unmapped via kernel_map_pages() to detect memory corruption, so oops could be triggered. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: fuse hidp snd_seq_dummy tun rfcomm bnep llc2 af_key can_raw ipt_ULOG can_bcm nfnetlink scsi_transport_iscsi nfc caif_socket caif af_802154 phonet af_rxrpc can pppoe pppox ppp_generic +slhc irda crc_ccitt rds rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 xfs coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm crct10dif_pclmul crc32c_intel ghash_clmulni_intel +libcrc32c usb_debug microcode snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic pcspkr btusb bluetooth 6lowpan_iphc rfkill snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm +snd_timer e1000e snd ptp shpchp soundcore pps_core serio_raw CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: 8801aa46e890 ti: 880076924000 task.ti: 880076924000 RIP: 0010:[] [] handle_slab+0x8a/0x180 RSP: 0018:880076925de0 EFLAGS: 00010002 RAX: 1000 RBX: RCX: 5ce85ce7 RDX: ea00079be100 RSI: 1000 RDI: 880107458000 RBP: 880076925e18 R08: 0001 R09: R10: R11: 000f R12: 8801e6f84000 R13: ea00079be100 R14: 880107458000 R15: 88022bb8d2c0 FS: 7fb769e45740() GS:88024d04() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 8801e6f84ff8 CR3: a22db000 CR4: 001407e0 DR0: 02695000 DR1: 02695000 DR2: DR3: DR6: fffe0ff0 DR7: 00070602 Stack: 8802339dcfc0 88022bb8d2c0 880107458000 88022bb8d2c0 8802339dd008 8802339dcfc0 ea00079be100 880076925e68 aa1ad9be 880203fe4f00 88022bb8d318 76925e98 Call Trace: [] leaks_show+0xce/0x240 [] seq_read+0x28e/0x490 [] proc_reg_read+0x3d/0x80 [] vfs_read+0x9b/0x160 [] SyS_read+0x58/0xb0 [] tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP [] handle_slab+0x8a/0x180 RSP CR2: 8801e6f84ff8 There are two solutions to fix the problem. One is to disable CONFIG_DEBUG_SLAB_LEAK if CONFIG_DEBUG_PAGEALLOC=y. The other is to remove kernel_map_pages() optimization in slab poisoning. I think that second one is better, since we can use all functionality with some more overhead. slab poisoning is already heavy operation, so adding more overhead doesn't weaken their value. Reported-by: Dave Jones Signed-off-by: Joonsoo Kim Joonsoo, can you please resend against slab/next? I'm seeing some rejects here. - Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] x86, kmemcheck: Use kstrtoint() instead of sscanf()
On 02/20/2014 12:14 AM, David Rientjes wrote: Kmemcheck should use the preferred interface for parsing command line arguments, kstrto*(), rather than sscanf() itself. Use it appropriately. Signed-off-by: David Rientjes Acked-by: Pekka Enberg Andrew, can you pick this up? --- arch/x86/mm/kmemcheck/kmemcheck.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/kmemcheck/kmemcheck.c b/arch/x86/mm/kmemcheck/kmemcheck.c --- a/arch/x86/mm/kmemcheck/kmemcheck.c +++ b/arch/x86/mm/kmemcheck/kmemcheck.c @@ -78,10 +78,16 @@ early_initcall(kmemcheck_init); */ static int __init param_kmemcheck(char *str) { + int val; + int ret; + if (!str) return -EINVAL; - sscanf(str, "%d", &kmemcheck_enabled); + ret = kstrtoint(str, 0, &val); + if (ret) + return ret; + kmemcheck_enabled = val; return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 02/13] perf ui/gtk: Reuse generic __hpp__fmt() code
On 03/03/2014 03:14 AM, Namhyung Kim wrote: The __hpp__color_fmt used in the gtk code can be replace by the generic code with small change in print_fn callback. This is a preparation to upcoming changes and no functional changes intended. Cc: Jiri Olsa Cc: Pekka Enberg Signed-off-by: Namhyung Kim Acked-by: Pekka Enberg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Memory allocator semantics
On Tue, Feb 11, 2014 at 2:14 PM, Paul E. McKenney wrote: > In contrast, from kfree() to a kmalloc() returning some of the kfree()ed > memory, I believe the kfree()/kmalloc() implementation must do any needed > synchronization and ordering. But that is a different set of examples, > for example, this one: > > CPU 0 CPU 1 > p->a = 42; q = kmalloc(...); /* returning p */ > kfree(p); q->a = 5; > BUG_ON(q->a != 5); > > Unlike the situation with (A), (B), and (C), in this case I believe > that it is kfree()'s and kmalloc()'s responsibility to ensure that > the BUG_ON() never triggers. > > Make sense? I'm not sure... It's the caller's responsibility not to touch "p" after it's handed over to kfree() - otherwise that's a "use-after-free" error. If there's some reordering going on here, I'm tempted to blame the caller for lack of locking. Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Memory allocator semantics
Hi Paul, On Sun, Feb 9, 2014 at 4:00 AM, Paul E. McKenney wrote: > From what I can see, (A) works by accident, but is kind of useless because > you allocate and free the memory without touching it. (B) and (C) are the > lightest touches I could imagine, and as you say, both are bad. So I > believe that it is reasonable to prohibit (A). > > Or is there some use for (A) that I am missing? So again, there's nothing in (A) that the memory allocator is concerned about. kmalloc() makes no guarantees whatsoever about the visibility of "r1" across CPUs. If you're saying that there's an implicit barrier between kmalloc() and kfree(), that's an unintended side-effect, not a design decision AFAICT. Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Memory allocator semantics
Hi Paul, On 01/02/2014 10:33 PM, Paul E. McKenney wrote: From what I can see, the Linux-kernel's SLAB, SLOB, and SLUB memory allocators would deal with the following sort of race: A. CPU 0: r1 = kmalloc(...); ACCESS_ONCE(gp) = r1; CPU 1: r2 = ACCESS_ONCE(gp); if (r2) kfree(r2); However, my guess is that this should be considered an accident of the current implementation rather than a feature. The reason for this is that I cannot see how you would usefully do (A) above without also allowing (B) and (C) below, both of which look to me to be quite destructive: B. CPU 0: r1 = kmalloc(...); ACCESS_ONCE(shared_x) = r1; CPU 1: r2 = ACCESS_ONCE(shared_x); if (r2) kfree(r2); CPU 2: r3 = ACCESS_ONCE(shared_x); if (r3) kfree(r3); This results in the memory being on two different freelists. C. CPU 0: r1 = kmalloc(...); ACCESS_ONCE(shared_x) = r1; CPU 1: r2 = ACCESS_ONCE(shared_x); r2->a = 1; r2->b = 2; CPU 2: r3 = ACCESS_ONCE(shared_x); if (r3) kfree(r3); CPU 3: r4 = kmalloc(...); r4->s = 3; r4->t = 4; This results in the memory being used by two different CPUs, each of which believe that they have sole access. But I thought I should ask the experts. So, am I correct that kernel hackers are required to avoid "drive-by" kfree()s of kmalloc()ed memory? So to be completely honest, I don't understand what is the race in (A) that concerns the *memory allocator*. I also don't what the memory allocator can do in (B) and (C) which look like double-free and use-after-free, respectively, to me. :-) Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 5/5] slab: make more slab management structure off the slab
On Fri, Dec 13, 2013 at 9:03 AM, Joonsoo Kim wrote: > Hello, Pekka. > > Below is updated patch for 5/5 in this series. > Now I get acks from Christoph to all patches in this series. > So, could you merge this patchset? :) > If you want to resend wholeset with proper ack, I will do it > with pleasure. Applied, thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] SLAB changes for v3.14-rc1
Hi Linus, Please pull the latest SLAB tree from: git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux.git slab/next It contains random bug fixes that have accumulated in my inbox over the past few months. Pekka --> The following changes since commit 6ce4eac1f600b34f2f7f58f9cd8f0503d79e42ae: Linux 3.13-rc1 (2013-11-22 11:30:55 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux.git slab/next for you to fetch changes up to cb8ee1a3d429f8898972c869dd4792afb04e961a: mm: Fix warning on make htmldocs caused by slab.c (2014-01-31 13:52:25 +0200) Dave Hansen (2): mm: sl[uo]b: fix misleading comments mm: slub: work around unneeded lockdep warning Li Zefan (1): slub: Fix calculation of cpu slabs Masanari Iida (1): mm: Fix warning on make htmldocs caused by slab.c Peter Zijlstra (1): slub: use lockdep_assert_held Randy Dunlap (1): slab.h: remove duplicate kmalloc declaration and fix kernel-doc warnings Tetsuo Handa (1): slub: Fix possible format string bug. include/linux/slab.h | 110 +++ mm/slab.c| 2 +- mm/slub.c| 56 +++--- 3 files changed, 85 insertions(+), 83 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/