Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Wednesday 03/27 at 00:24 +0900, Masami Hiramatsu wrote: > On Tue, 26 Mar 2024 14:46:10 + > Mark Rutland wrote: > > > Hi Masami, > > > > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote: > > > Hi Jarkko, > > > > > > On Sun, 24 Mar 2024 01:29:08 +0200 > > > Jarkko Sakkinen wrote: > > > > > > > Tracing with kprobes while running a monolithic kernel is currently > > > > impossible due the kernel module allocator dependency. > > > > > > > > Address the issue by allowing architectures to implement module_alloc() > > > > and module_memfree() independent of the module subsystem. An arch tree > > > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file. > > > > > > > > Realize the feature on RISC-V by separating allocator to module_alloc.c > > > > and implementing module_memfree(). > > > > > > Even though, this involves changes in arch-independent part. So it should > > > be solved by generic way. Did you checked Calvin's thread? > > > > > > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/ > > > > > > I think, we'd better to introduce `alloc_execmem()`, > > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first > > > > > > config HAVE_ALLOC_EXECMEM > > > bool > > > > > > config ALLOC_EXECMEM > > > bool "Executable trampline memory allocation" > > > depends on MODULES || HAVE_ALLOC_EXECMEM > > > > > > And define fallback macro to module_alloc() like this. > > > > > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM > > > #define alloc_execmem(size, gfp) module_alloc(size) > > > #endif > > > > Please can we *not* do this? I think this is abstracting at the wrong level > > (as > > I mentioned on the prior execmem proposals). > > > > Different exectuable allocations can have different requirements. For > > example, > > on arm64 modules need to be within 2G of the kernel image, but the kprobes > > XOL > > areas can be anywhere in the kernel VA space. > > > > Forcing those behind the same interface makes things *harder* for > > architectures > > and/or makes the common code more complicated (if that ends up having to > > track > > all those different requirements). From my PoV it'd be much better to have > > separate kprobes_alloc_*() functions for kprobes which an architecture can > > then > > choose to implement using a common library if it wants to. > > > > I took a look at doing that using the core ifdeffery fixups from Jarkko's > > v6, > > and it looks pretty clean to me (and works in testing on arm64): > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules > > > > Could we please start with that approach, with kprobe-specific alloc/free > > code > > provided by the architecture? Heh, I also noticed that dead !RWX branch in arm64 patch_map(), I was about to send a patch to remove it. > OK, as far as I can read the code, this method also works and neat! > (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM > to user does not help, it should be an internal change. So hiding this change > from user is better choice. Then there is no reason to introduce the new > alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable. I'm happy with this, it solves the first half of my problem. But I want eBPF to work in the !MODULES case too. I think Mark's approach can work for bpf as well, without needing to touch module_alloc() at all? So I might be able to drop that first patch entirely. https://lore.kernel.org/all/a6b162aed1e6fea7f565ef9dd0204d6f2284bcce.1709676663.git.jcalvinow...@gmail.com/ Thanks, Calvin > Mark, can you send this series here, so that others can review/test it? > > Thank you! > > > > > > Thanks, > > Mark. > > > -- > Masami Hiramatsu (Google)
Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Monday 03/25 at 11:56 +0900, Masami Hiramatsu wrote: > Hi Jarkko, > > On Sun, 24 Mar 2024 01:29:08 +0200 > Jarkko Sakkinen wrote: > > > Tracing with kprobes while running a monolithic kernel is currently > > impossible due the kernel module allocator dependency. > > > > Address the issue by allowing architectures to implement module_alloc() > > and module_memfree() independent of the module subsystem. An arch tree > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file. > > > > Realize the feature on RISC-V by separating allocator to module_alloc.c > > and implementing module_memfree(). > > Even though, this involves changes in arch-independent part. So it should > be solved by generic way. Did you checked Calvin's thread? > > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/ FYI, I should have v2 of that series out later this week. Thanks, Calvin > I think, we'd better to introduce `alloc_execmem()`, > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first > > config HAVE_ALLOC_EXECMEM > bool > > config ALLOC_EXECMEM > bool "Executable trampline memory allocation" > depends on MODULES || HAVE_ALLOC_EXECMEM > > And define fallback macro to module_alloc() like this. > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM > #define alloc_execmem(size, gfp) module_alloc(size) > #endif > > Then, introduce a new dependency to kprobes > > config KPROBES > bool "Kprobes" > select ALLOC_EXECMEM > > and update kprobes to use alloc_execmem and remove module related > code from it. > > You also should consider using IS_ENABLED(CONFIG_MODULE) in the code to > avoid using #ifdefs. > > Finally, you can add RISCV implementation patch of HAVE_ALLOC_EXECMEM in the > next patch. > > Thank you, > > > > > > Link: https://www.sochub.fi # for power on testing new SoC's with a minimal > > stack > > Link: > > https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # > > continuation > > Signed-off-by: Jarkko Sakkinen > > --- > > v2: > > - Better late than never right? :-) > > - Focus only to RISC-V for now to make the patch more digestable. This > > is the arch where I use the patch on a daily basis to help with QA. > > - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration. > > --- > > arch/Kconfig | 8 +++- > > arch/riscv/Kconfig | 1 + > > arch/riscv/kernel/Makefile | 5 + > > arch/riscv/kernel/module.c | 11 --- > > arch/riscv/kernel/module_alloc.c | 28 > > kernel/kprobes.c | 10 ++ > > kernel/trace/trace_kprobe.c | 18 -- > > 7 files changed, 67 insertions(+), 14 deletions(-) > > create mode 100644 arch/riscv/kernel/module_alloc.c > > > > diff --git a/arch/Kconfig b/arch/Kconfig > > index a5af0edd3eb8..c931f1de98a7 100644 > > --- a/arch/Kconfig > > +++ b/arch/Kconfig > > @@ -52,7 +52,7 @@ config GENERIC_ENTRY > > > > config KPROBES > > bool "Kprobes" > > - depends on MODULES > > + depends on MODULES || HAVE_KPROBES_ALLOC > > depends on HAVE_KPROBES > > select KALLSYMS > > select TASKS_RCU if PREEMPTION > > @@ -215,6 +215,12 @@ config HAVE_OPTPROBES > > config HAVE_KPROBES_ON_FTRACE > > bool > > > > +config HAVE_KPROBES_ALLOC > > + bool > > + help > > + Architectures that select this option are capable of allocating memory > > + for kprobes withou the kernel module allocator. > > + > > config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE > > bool > > help > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig > > index e3142ce531a0..4f1b925e83d8 100644 > > --- a/arch/riscv/Kconfig > > +++ b/arch/riscv/Kconfig > > @@ -132,6 +132,7 @@ config RISCV > > select HAVE_KPROBES if !XIP_KERNEL > > select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL > > select HAVE_KRETPROBES if !XIP_KERNEL > > + select HAVE_KPROBES_ALLOC if !XIP_KERNEL > > # https://github.com/ClangBuiltLinux/linux/issues/1881 > > select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD > > select HAVE_MOVE_PMD > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile > > index 604d6bf7e476..46318194bce1 100644 > > --- a/arch/riscv/kernel/Makefile > > +++ b/arch/riscv/kernel/Makefile > > @@ -73,6 +73,11 @@ obj-$(CONFIG_SMP)+= cpu_ops.o > > > > obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o > > obj-$(CONFIG_MODULES) += module.o > > +ifeq ($(CONFIG_MODULES),y) > > +obj-y += module_alloc.o > > +else > > +obj-$(CONFIG_KPROBES) += module_alloc.o > > +endif > > obj-$(CONFIG_MODULE_SECTIONS) += module-sections.o > > > > obj-$(CONFIG_CPU_PM) += suspend_entry.o suspend.o > > diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c > > index 5e5a82644451..cc324b450f2e 100644 > > --- a/arch/riscv/kernel/module.c > >
Re: [RFC][PATCH 2/4] bpf: Allow BPF_JIT with CONFIG_MODULES=n
On Thursday 03/07 at 22:09 +, Christophe Leroy wrote: > > > Le 06/03/2024 à 21:05, Calvin Owens a écrit : > > [Vous ne recevez pas souvent de courriers de jcalvinow...@gmail.com. > > Découvrez pourquoi ceci est important à > > https://aka.ms/LearnAboutSenderIdentification ] > > > > No BPF code has to change, except in struct_ops (for module refs). > > > > This conflicts with bpf-next because of this (relevant) series: > > > > > > https://lore.kernel.org/all/20240119225005.668602-1-thinker...@gmail.com/ > > > > If something like this is merged down the road, it can go through > > bpf-next at leisure once the module_alloc change is in: it's a one-way > > dependency. > > > > Signed-off-by: Calvin Owens > > --- > > kernel/bpf/Kconfig | 2 +- > > kernel/bpf/bpf_struct_ops.c | 28 > > 2 files changed, 25 insertions(+), 5 deletions(-) > > > > diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig > > index 6a906ff93006..77df483a8925 100644 > > --- a/kernel/bpf/Kconfig > > +++ b/kernel/bpf/Kconfig > > @@ -42,7 +42,7 @@ config BPF_JIT > > bool "Enable BPF Just In Time compiler" > > depends on BPF > > depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT > > - depends on MODULES > > + select MODULE_ALLOC > > help > >BPF programs are normally handled by a BPF interpreter. This > > option > >allows the kernel to generate native code when a program is > > loaded > > diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c > > index 02068bd0e4d9..fbf08a1bb00c 100644 > > --- a/kernel/bpf/bpf_struct_ops.c > > +++ b/kernel/bpf/bpf_struct_ops.c > > @@ -108,11 +108,30 @@ const struct bpf_prog_ops bpf_struct_ops_prog_ops = { > > #endif > > }; > > > > +#if IS_ENABLED(CONFIG_MODULES) > > Can you avoid ifdefs as much as possible ? Similar to the other one, this was just a misguided attempt to avoid triggering -Wunused, I'll clean it up. This particular patch will look very different when rebased on bpf-next. > > static const struct btf_type *module_type; > > > > +static int bpf_struct_module_type_init(struct btf *btf) > > +{ > > + s32 module_id; > > Could be: > > if (!IS_ENABLED(CONFIG_MODULES)) > return 0; > > > + > > + module_id = btf_find_by_name_kind(btf, "module", BTF_KIND_STRUCT); > > + if (module_id < 0) > > + return 1; > > + > > + module_type = btf_type_by_id(btf, module_id); > > + return 0; > > +} > > +#else > > +static int bpf_struct_module_type_init(struct btf *btf) > > +{ > > + return 0; > > +} > > +#endif > > + > > void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log) > > { > > - s32 type_id, value_id, module_id; > > + s32 type_id, value_id; > > const struct btf_member *member; > > struct bpf_struct_ops *st_ops; > > const struct btf_type *t; > > @@ -125,12 +144,10 @@ void bpf_struct_ops_init(struct btf *btf, struct > > bpf_verifier_log *log) > > #include "bpf_struct_ops_types.h" > > #undef BPF_STRUCT_OPS_TYPE > > > > - module_id = btf_find_by_name_kind(btf, "module", BTF_KIND_STRUCT); > > - if (module_id < 0) { > > + if (bpf_struct_module_type_init(btf)) { > > pr_warn("Cannot find struct module in btf_vmlinux\n"); > > return; > > } > > - module_type = btf_type_by_id(btf, module_id); > > > > for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { > > st_ops = bpf_struct_ops[i]; > > @@ -433,12 +450,15 @@ static long bpf_struct_ops_map_update_elem(struct > > bpf_map *map, void *key, > > > > moff = __btf_member_bit_offset(t, member) / 8; > > ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, > > NULL); > > + > > +#if IS_ENABLED(CONFIG_MODULES) > > Can't see anything depending on CONFIG_MODULES here, can you instead do: > > if (IS_ENABLED(CONFIG_MODULES) && ptype == module_type) { > > > if (ptype == module_type) { > > if (*(void **)(udata + moff)) > > goto reset_unlock; > > *(void **)(kdata + moff) = BPF_MODULE_OWNER; > > continue; > > } > > +#endif > > > > err = st_ops->init_member(t, member, kdata, udata); > > if (err < 0) > > -- > > 2.43.0 > > > >
Re: [RFC][PATCH 3/4] kprobes: Allow kprobes with CONFIG_MODULES=n
On Thursday 03/07 at 22:16 +, Christophe Leroy wrote: > > > Le 06/03/2024 à 21:05, Calvin Owens a écrit : > > [Vous ne recevez pas souvent de courriers de jcalvinow...@gmail.com. > > Découvrez pourquoi ceci est important à > > https://aka.ms/LearnAboutSenderIdentification ] > > > > If something like this is merged down the road, it can go in at leisure > > once the module_alloc change is in: it's a one-way dependency. > > Too many #ifdef, please reorganise stuff to avoid that and avoid > changing prototypes based of CONFIG_MODULES. > > Other few comments below. TBH the ugliness here was just me trying not to trigger -Wunused, but that was silly: as you point out below, it's unncessary. I'll clean it up. > > > > Signed-off-by: Calvin Owens > > --- > > arch/Kconfig| 2 +- > > kernel/kprobes.c| 22 ++ > > kernel/trace/trace_kprobe.c | 11 +++ > > 3 files changed, 34 insertions(+), 1 deletion(-) > > > > diff --git a/arch/Kconfig b/arch/Kconfig > > index cfc24ced16dd..e60ce984d095 100644 > > --- a/arch/Kconfig > > +++ b/arch/Kconfig > > @@ -52,8 +52,8 @@ config GENERIC_ENTRY > > > > config KPROBES > > bool "Kprobes" > > - depends on MODULES > > depends on HAVE_KPROBES > > + select MODULE_ALLOC > > select KALLSYMS > > select TASKS_RCU if PREEMPTION > > help > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c > > index 9d9095e81792..194270e17d57 100644 > > --- a/kernel/kprobes.c > > +++ b/kernel/kprobes.c > > @@ -1556,8 +1556,12 @@ static bool is_cfi_preamble_symbol(unsigned long > > addr) > > str_has_prefix("__pfx_", symbuf); > > } > > > > +#if IS_ENABLED(CONFIG_MODULES) > > static int check_kprobe_address_safe(struct kprobe *p, > > struct module **probed_mod) > > +#else > > +static int check_kprobe_address_safe(struct kprobe *p) > > +#endif > > A bit ugly to have to change the prototype, why not just keep probed_mod > at all time ? > > When CONFIG_MODULES is not selected, __module_text_address() returns > NULL so it should work without that many #ifdefs. > > > { > > int ret; > > > > @@ -1580,6 +1584,7 @@ static int check_kprobe_address_safe(struct kprobe *p, > > goto out; > > } > > > > +#if IS_ENABLED(CONFIG_MODULES) > > /* Check if 'p' is probing a module. */ > > *probed_mod = __module_text_address((unsigned long) p->addr); > > if (*probed_mod) { > > @@ -1603,6 +1608,8 @@ static int check_kprobe_address_safe(struct kprobe *p, > > ret = -ENOENT; > > } > > } > > +#endif > > + > > out: > > preempt_enable(); > > jump_label_unlock(); > > @@ -1614,7 +1621,9 @@ int register_kprobe(struct kprobe *p) > > { > > int ret; > > struct kprobe *old_p; > > +#if IS_ENABLED(CONFIG_MODULES) > > struct module *probed_mod; > > +#endif > > kprobe_opcode_t *addr; > > bool on_func_entry; > > > > @@ -1633,7 +1642,11 @@ int register_kprobe(struct kprobe *p) > > p->nmissed = 0; > > INIT_LIST_HEAD(>list); > > > > +#if IS_ENABLED(CONFIG_MODULES) > > ret = check_kprobe_address_safe(p, _mod); > > +#else > > + ret = check_kprobe_address_safe(p); > > +#endif > > if (ret) > > return ret; > > > > @@ -1676,8 +1689,10 @@ int register_kprobe(struct kprobe *p) > > out: > > mutex_unlock(_mutex); > > > > +#if IS_ENABLED(CONFIG_MODULES) > > if (probed_mod) > > module_put(probed_mod); > > +#endif > > > > return ret; > > } > > @@ -2482,6 +2497,7 @@ int kprobe_add_area_blacklist(unsigned long start, > > unsigned long end) > > return 0; > > } > > > > +#if IS_ENABLED(CONFIG_MODULES) > > /* Remove all symbols in given area from kprobe blacklist */ > > static void kprobe_remove_area_blacklist(unsigned long start, unsigned > > long end) > > { > > @@ -2499,6 +2515,7 @@ static void kprobe_remove_ksym_blacklist(unsigned > > long entry) > > { > > kprobe_remove_area_blacklist(entry, entry + 1); > > } >
Re: [RFC][PATCH 3/4] kprobes: Allow kprobes with CONFIG_MODULES=n
On Friday 03/08 at 11:46 +0900, Masami Hiramatsu wrote: > On Wed, 6 Mar 2024 12:05:10 -0800 > Calvin Owens wrote: > > > If something like this is merged down the road, it can go in at leisure > > once the module_alloc change is in: it's a one-way dependency. > > > > Signed-off-by: Calvin Owens > > --- > > arch/Kconfig| 2 +- > > kernel/kprobes.c| 22 ++ > > kernel/trace/trace_kprobe.c | 11 +++ > > 3 files changed, 34 insertions(+), 1 deletion(-) > > > > diff --git a/arch/Kconfig b/arch/Kconfig > > index cfc24ced16dd..e60ce984d095 100644 > > --- a/arch/Kconfig > > +++ b/arch/Kconfig > > @@ -52,8 +52,8 @@ config GENERIC_ENTRY > > > > config KPROBES > > bool "Kprobes" > > - depends on MODULES > > depends on HAVE_KPROBES > > + select MODULE_ALLOC > > OK, if we use EXEC_ALLOC, > > config EXEC_ALLOC > depends on HAVE_EXEC_ALLOC > > And > > config KPROBES > bool "Kprobes" > depends on MODULES || EXEC_ALLOC > select EXEC_ALLOC if HAVE_EXEC_ALLOC > > then kprobes can be enabled either modules supported or exec_alloc is > supported. > (new arch does not need to implement exec_alloc) > > Maybe we also need something like > > #ifdef CONFIG_EXEC_ALLOC > #define module_alloc(size) exec_alloc(size) > #endif > > in kprobes.h, or just add `replacing module_alloc with exec_alloc` patch. > > Thank you, The example was helpful, thanks. I see what you mean with HAVE_EXEC_ALLOC, I'll implement it like that in the next verison. > > select KALLSYMS > > select TASKS_RCU if PREEMPTION > > help > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c > > index 9d9095e81792..194270e17d57 100644 > > --- a/kernel/kprobes.c > > +++ b/kernel/kprobes.c > > @@ -1556,8 +1556,12 @@ static bool is_cfi_preamble_symbol(unsigned long > > addr) > > str_has_prefix("__pfx_", symbuf); > > } > > > > +#if IS_ENABLED(CONFIG_MODULES) > > static int check_kprobe_address_safe(struct kprobe *p, > > struct module **probed_mod) > > +#else > > +static int check_kprobe_address_safe(struct kprobe *p) > > +#endif > > { > > int ret; > > > > @@ -1580,6 +1584,7 @@ static int check_kprobe_address_safe(struct kprobe *p, > > goto out; > > } > > > > +#if IS_ENABLED(CONFIG_MODULES) > > /* Check if 'p' is probing a module. */ > > *probed_mod = __module_text_address((unsigned long) p->addr); > > if (*probed_mod) { > > @@ -1603,6 +1608,8 @@ static int check_kprobe_address_safe(struct kprobe *p, > > ret = -ENOENT; > > } > > } > > +#endif > > + > > out: > > preempt_enable(); > > jump_label_unlock(); > > @@ -1614,7 +1621,9 @@ int register_kprobe(struct kprobe *p) > > { > > int ret; > > struct kprobe *old_p; > > +#if IS_ENABLED(CONFIG_MODULES) > > struct module *probed_mod; > > +#endif > > kprobe_opcode_t *addr; > > bool on_func_entry; > > > > @@ -1633,7 +1642,11 @@ int register_kprobe(struct kprobe *p) > > p->nmissed = 0; > > INIT_LIST_HEAD(>list); > > > > +#if IS_ENABLED(CONFIG_MODULES) > > ret = check_kprobe_address_safe(p, _mod); > > +#else > > + ret = check_kprobe_address_safe(p); > > +#endif > > if (ret) > > return ret; > > > > @@ -1676,8 +1689,10 @@ int register_kprobe(struct kprobe *p) > > out: > > mutex_unlock(_mutex); > > > > +#if IS_ENABLED(CONFIG_MODULES) > > if (probed_mod) > > module_put(probed_mod); > > +#endif > > > > return ret; > > } > > @@ -2482,6 +2497,7 @@ int kprobe_add_area_blacklist(unsigned long start, > > unsigned long end) > > return 0; > > } > > > > +#if IS_ENABLED(CONFIG_MODULES) > > /* Remove all symbols in given area from kprobe blacklist */ > > static void kprobe_remove_area_blacklist(unsigned long start, unsigned > > long end) > > { > > @@ -2499,6 +2515,7 @@ static void kprobe_remove_ksym_blacklist(unsigned > > long entry) > > { > > kprobe_remove_area_blacklist(entry, entry + 1); > > } > > +#endif > > > > int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long > > *value, &
Re: [RFC][PATCH 1/4] module: mm: Make module_alloc() generally available
On Thursday 03/07 at 14:43 +, Christophe Leroy wrote: > Hi Calvin, > > Le 06/03/2024 à 21:05, Calvin Owens a écrit : > > [Vous ne recevez pas souvent de courriers de jcalvinow...@gmail.com. > > Découvrez pourquoi ceci est important à > > https://aka.ms/LearnAboutSenderIdentification ] > > > > Both BPF_JIT and KPROBES depend on CONFIG_MODULES, but only require > > module_alloc() itself, which can be easily separated into a standalone > > allocator for executable kernel memory. > > Easily maybe, but not as easily as you think, see below. > > > > > Thomas Gleixner sent a patch to do that for x86 as part of a larger > > series a couple years ago: > > > > https://lore.kernel.org/all/20220716230953.442937...@linutronix.de/ > > > > I've simply extended that approach to the whole kernel. > > > > Signed-off-by: Calvin Owens > > --- > > arch/Kconfig | 2 +- > > arch/arm/kernel/module.c | 35 - > > arch/arm/mm/Makefile | 2 + > > arch/arm/mm/module_alloc.c | 40 ++ > > arch/arm64/kernel/module.c | 127 -- > > arch/arm64/mm/Makefile | 1 + > > arch/arm64/mm/module_alloc.c | 130 +++ > > arch/loongarch/kernel/module.c | 6 -- > > arch/loongarch/mm/Makefile | 2 + > > arch/loongarch/mm/module_alloc.c | 10 +++ > > arch/mips/kernel/module.c| 10 --- > > arch/mips/mm/Makefile| 2 + > > arch/mips/mm/module_alloc.c | 13 > > arch/nios2/kernel/module.c | 20 - > > arch/nios2/mm/Makefile | 2 + > > arch/nios2/mm/module_alloc.c | 22 ++ > > arch/parisc/kernel/module.c | 12 --- > > arch/parisc/mm/Makefile | 1 + > > arch/parisc/mm/module_alloc.c| 15 > > arch/powerpc/kernel/module.c | 36 - > > arch/powerpc/mm/Makefile | 1 + > > arch/powerpc/mm/module_alloc.c | 41 ++ > > Missing several powerpc changes to make it work. You must audit every > use of CONFIG_MODULES inside powerpc. Here are a few exemples: > > Function get_patch_pfn() to enable text code patching. > > arch/powerpc/Kconfig :select KASAN_VMALLOCif > KASAN && MODULES > > arch/powerpc/include/asm/kasan.h: > > #if defined(CONFIG_MODULES) && defined(CONFIG_PPC32) > #define KASAN_KERN_START ALIGN_DOWN(PAGE_OFFSET - SZ_256M, SZ_256M) > #else > #define KASAN_KERN_START PAGE_OFFSET > #endif > > arch/powerpc/kernel/head_8xx.S and arch/powerpc/kernel/head_book3s_32.S: > InstructionTLBMiss interrupt handler must know that there is executable > kernel text outside kernel core. > > Function is_module_segment() to identified segments used for module text > and set NX (NoExec) MMU flag on non-module segments. Thanks Christophe, I'll fix that up. I'm sure there are many other issues like this in the arch stuff here, I'm going to run them all through QEMU to catch everything I can before the next respin. > > arch/riscv/kernel/module.c | 11 --- > > arch/riscv/mm/Makefile | 1 + > > arch/riscv/mm/module_alloc.c | 17 > > arch/s390/kernel/module.c| 37 - > > arch/s390/mm/Makefile| 1 + > > arch/s390/mm/module_alloc.c | 42 ++ > > arch/sparc/kernel/module.c | 31 > > arch/sparc/mm/Makefile | 2 + > > arch/sparc/mm/module_alloc.c | 31 > > arch/x86/kernel/ftrace.c | 2 +- > > arch/x86/kernel/module.c | 56 - > > arch/x86/mm/Makefile | 2 + > > arch/x86/mm/module_alloc.c | 59 ++ > > fs/proc/kcore.c | 2 +- > > kernel/module/Kconfig| 1 + > > kernel/module/main.c | 17 > > mm/Kconfig | 3 + > > mm/Makefile | 1 + > > mm/module_alloc.c| 21 + > > mm/vmalloc.c | 2 +- > > 42 files changed, 467 insertions(+), 402 deletions(-) > > ... > > > diff --git a/mm/Kconfig b/mm/Kconfig > > index ffc3a2ba3a8c..92bfb5ae2e95 100644 > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -1261,6 +1261,9 @@ config LOCK_MM_AND_FIND_VMA > > config IOMMU_MM_DATA > > bool > > > > +config MODULE_ALLOC > > + def_bool n > > + > > I'd call it something else than CONFIG_MODULE_ALLOC as you want to use > it when CONFIG_MODULE is not selected. > > Something like CONFIG_EXECMEM_ALLOC or CONFIG_DYNAMIC_EXECMEM ? > > > > Christophe
Re: [RFC][PATCH 1/4] module: mm: Make module_alloc() generally available
On Friday 03/08 at 11:16 +0900, Masami Hiramatsu wrote: > Hi Calvin, > > On Wed, 6 Mar 2024 12:05:08 -0800 > Calvin Owens wrote: > > > Both BPF_JIT and KPROBES depend on CONFIG_MODULES, but only require > > module_alloc() itself, which can be easily separated into a standalone > > allocator for executable kernel memory. > > Thanks for your work! > As Luis pointed, it is better to use different name because this > is not only for modules and it does not depend on CONFIG_MODULES. > > > > > Thomas Gleixner sent a patch to do that for x86 as part of a larger > > series a couple years ago: > > > > https://lore.kernel.org/all/20220716230953.442937...@linutronix.de/ > > > > I've simply extended that approach to the whole kernel. > > I would like to see a series of patches for each architecture so that > architecture maintainers carefully check and test this feature. > > What about introducing CONFIG_HAVE_EXEC_ALLOC and enable it on > each architecture? Then you can start small set of major architectures > and expand it later. Thanks Masami. That makes sense to me, I'll do it. I'm also working on getting the other architectures running in QEMU, so hopefully I'll be able to iron out more of the arch problems on my own before the next respin. > Thank you, > > > > > Signed-off-by: Calvin Owens > > --- > > arch/Kconfig | 2 +- > > arch/arm/kernel/module.c | 35 - > > arch/arm/mm/Makefile | 2 + > > arch/arm/mm/module_alloc.c | 40 ++ > > arch/arm64/kernel/module.c | 127 -- > > arch/arm64/mm/Makefile | 1 + > > arch/arm64/mm/module_alloc.c | 130 +++ > > arch/loongarch/kernel/module.c | 6 -- > > arch/loongarch/mm/Makefile | 2 + > > arch/loongarch/mm/module_alloc.c | 10 +++ > > arch/mips/kernel/module.c| 10 --- > > arch/mips/mm/Makefile| 2 + > > arch/mips/mm/module_alloc.c | 13 > > arch/nios2/kernel/module.c | 20 - > > arch/nios2/mm/Makefile | 2 + > > arch/nios2/mm/module_alloc.c | 22 ++ > > arch/parisc/kernel/module.c | 12 --- > > arch/parisc/mm/Makefile | 1 + > > arch/parisc/mm/module_alloc.c| 15 > > arch/powerpc/kernel/module.c | 36 - > > arch/powerpc/mm/Makefile | 1 + > > arch/powerpc/mm/module_alloc.c | 41 ++ > > arch/riscv/kernel/module.c | 11 --- > > arch/riscv/mm/Makefile | 1 + > > arch/riscv/mm/module_alloc.c | 17 > > arch/s390/kernel/module.c| 37 - > > arch/s390/mm/Makefile| 1 + > > arch/s390/mm/module_alloc.c | 42 ++ > > arch/sparc/kernel/module.c | 31 > > arch/sparc/mm/Makefile | 2 + > > arch/sparc/mm/module_alloc.c | 31 > > arch/x86/kernel/ftrace.c | 2 +- > > arch/x86/kernel/module.c | 56 - > > arch/x86/mm/Makefile | 2 + > > arch/x86/mm/module_alloc.c | 59 ++ > > fs/proc/kcore.c | 2 +- > > kernel/module/Kconfig| 1 + > > kernel/module/main.c | 17 > > mm/Kconfig | 3 + > > mm/Makefile | 1 + > > mm/module_alloc.c| 21 + > > mm/vmalloc.c | 2 +- > > 42 files changed, 467 insertions(+), 402 deletions(-) > > create mode 100644 arch/arm/mm/module_alloc.c > > create mode 100644 arch/arm64/mm/module_alloc.c > > create mode 100644 arch/loongarch/mm/module_alloc.c > > create mode 100644 arch/mips/mm/module_alloc.c > > create mode 100644 arch/nios2/mm/module_alloc.c > > create mode 100644 arch/parisc/mm/module_alloc.c > > create mode 100644 arch/powerpc/mm/module_alloc.c > > create mode 100644 arch/riscv/mm/module_alloc.c > > create mode 100644 arch/s390/mm/module_alloc.c > > create mode 100644 arch/sparc/mm/module_alloc.c > > create mode 100644 arch/x86/mm/module_alloc.c > > create mode 100644 mm/module_alloc.c > > > > diff --git a/arch/Kconfig b/arch/Kconfig > > index a5af0edd3eb8..cfc24ced16dd 100644 > > --- a/arch/Kconfig > > +++ b/arch/Kconfig > > @@ -1305,7 +1305,7 @@ config ARCH_HAS_STRICT_MODULE_RWX > > > > config STRICT_MODULE_RWX > > bool "Set loadable kernel module data as NX and text as RO" if > > ARCH_OPTION
Re: [RFC][PATCH 3/4] kprobes: Allow kprobes with CONFIG_MODULES=n
On Thursday 03/07 at 09:22 +0200, Mike Rapoport wrote: > On Wed, Mar 06, 2024 at 12:05:10PM -0800, Calvin Owens wrote: > > If something like this is merged down the road, it can go in at leisure > > once the module_alloc change is in: it's a one-way dependency. > > > > Signed-off-by: Calvin Owens > > --- > > arch/Kconfig| 2 +- > > kernel/kprobes.c| 22 ++ > > kernel/trace/trace_kprobe.c | 11 +++ > > 3 files changed, 34 insertions(+), 1 deletion(-) > > When I did this in my last execmem posting, I think I've got slightly less > ugly ifdery, you may want to take a look at that: > > https://lore.kernel.org/all/20230918072955.2507221-13-r...@kernel.org Thanks Mike, I definitely agree. I'm annoyed at myself for not finding your patches, I spent some time looking for prior work and I really don't know how I missed it... > > diff --git a/arch/Kconfig b/arch/Kconfig > > index cfc24ced16dd..e60ce984d095 100644 > > --- a/arch/Kconfig > > +++ b/arch/Kconfig > > @@ -52,8 +52,8 @@ config GENERIC_ENTRY > > > > config KPROBES > > bool "Kprobes" > > - depends on MODULES > > depends on HAVE_KPROBES > > + select MODULE_ALLOC > > select KALLSYMS > > select TASKS_RCU if PREEMPTION > > help > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c > > index 9d9095e81792..194270e17d57 100644 > > --- a/kernel/kprobes.c > > +++ b/kernel/kprobes.c > > @@ -1556,8 +1556,12 @@ static bool is_cfi_preamble_symbol(unsigned long > > addr) > > str_has_prefix("__pfx_", symbuf); > > } > > > > +#if IS_ENABLED(CONFIG_MODULES) > > static int check_kprobe_address_safe(struct kprobe *p, > > struct module **probed_mod) > > +#else > > +static int check_kprobe_address_safe(struct kprobe *p) > > +#endif > > { > > int ret; > > > > @@ -1580,6 +1584,7 @@ static int check_kprobe_address_safe(struct kprobe *p, > > goto out; > > } > > > > +#if IS_ENABLED(CONFIG_MODULES) > > Plain #ifdef will do here and below. IS_ENABLED is for usage withing the > code, like > > if (IS_ENABLED(CONFIG_MODULES)) > ; > > > /* Check if 'p' is probing a module. */ > > *probed_mod = __module_text_address((unsigned long) p->addr); > > if (*probed_mod) { > > -- > Sincerely yours, > Mike.
Re: [RFC][PATCH 0/4] Make bpf_jit and kprobes work with CONFIG_MODULES=n
On Thursday 03/07 at 18:55 -0800, Luis Chamberlain wrote: > On Thu, Mar 7, 2024 at 6:50 PM Masami Hiramatsu wrote: > > > > On Wed, 6 Mar 2024 17:58:14 -0800 > > Song Liu wrote: > > > > > Hi Calvin, > > > > > > It is great to hear from you! :) > > > > > > On Wed, Mar 6, 2024 at 3:23 PM Calvin Owens > > > wrote: > > > > > > > > On Wednesday 03/06 at 13:34 -0800, Luis Chamberlain wrote: > > > > > On Wed, Mar 06, 2024 at 12:05:07PM -0800, Calvin Owens wrote: > > > > > > Hello all, > > > > > > > > > > > > This patchset makes it possible to use bpftrace with kprobes on > > > > > > kernels > > > > > > built without loadable module support. > > > > > > > > > > This is a step in the right direction for another reason: clearly the > > > > > module_alloc() is not about modules, and we have special reasons for > > > > > it > > > > > now beyond modules. The effort to share a generalize a huge page for > > > > > these things is also another reason for some of this but that is more > > > > > long term. > > > > > > > > > > I'm all for minor changes here so to avoid regressions but it seems a > > > > > rename is in order -- if we're going to all this might as well do it > > > > > now. And for that I'd just like to ask you paint the bikeshed with > > > > > Song Liu as he's been the one slowly making way to help us get there > > > > > with the "module: replace module_layout with module_memory", > > > > > and Mike Rapoport as he's had some follow up attempts [0]. As I see > > > > > it, > > > > > the EXECMEM stuff would be what we use instead then. Mike kept the > > > > > module_alloc() and the execmem was just a wrapper but your move of the > > > > > arch stuff makes sense as well and I think would complement his series > > > > > nicely. > > > > > > > > I apologize for missing that. I think these are the four most recent > > > > versions of the different series referenced from that LWN link: > > > > > > > > a) > > > > https://lore.kernel.org/all/20230918072955.2507221-1-r...@kernel.org/ > > > > b) > > > > https://lore.kernel.org/all/20230526051529.3387103-1-s...@kernel.org/ > > > > c) > > > > https://lore.kernel.org/all/20221107223921.3451913-1-s...@kernel.org/ > > > > d) > > > > https://lore.kernel.org/all/20201120202426.18009-1-rick.p.edgeco...@intel.com/ > > > > > > > > Song and Mike, please correct me if I'm wrong, but I think what I've > > > > done here (see [1], sorry for not adding you initially) is compatible > > > > with everything both of you have recently proposed above. How do you > > > > feel about this as a first step? > > > > > > I agree that the work here is compatible with other efforts. I have no > > > objection to making this the first step. > > > > > > > > > > > For naming, execmem_alloc() seems reasonable to me? I have no strong > > > > feelings at all, I'll just use that going forward unless somebody else > > > > expresses an opinion. > > > > > > I am not good at naming things. No objection from me to "execmem_alloc". > > > > Hm, it sounds good to me too. I think we should add a patch which just > > rename the module_alloc/module_memfree with execmem_alloc/free first. > > I think that would be cleaner, yes. Leaving the possible move to a > secondary patch and placing the testing more on the later part. Makes sense to me.
Re: [RFC][PATCH 0/4] Make bpf_jit and kprobes work with CONFIG_MODULES=n
On Wednesday 03/06 at 13:34 -0800, Luis Chamberlain wrote: > On Wed, Mar 06, 2024 at 12:05:07PM -0800, Calvin Owens wrote: > > Hello all, > > > > This patchset makes it possible to use bpftrace with kprobes on kernels > > built without loadable module support. > > This is a step in the right direction for another reason: clearly the > module_alloc() is not about modules, and we have special reasons for it > now beyond modules. The effort to share a generalize a huge page for > these things is also another reason for some of this but that is more > long term. > > I'm all for minor changes here so to avoid regressions but it seems a > rename is in order -- if we're going to all this might as well do it > now. And for that I'd just like to ask you paint the bikeshed with > Song Liu as he's been the one slowly making way to help us get there > with the "module: replace module_layout with module_memory", > and Mike Rapoport as he's had some follow up attempts [0]. As I see it, > the EXECMEM stuff would be what we use instead then. Mike kept the > module_alloc() and the execmem was just a wrapper but your move of the > arch stuff makes sense as well and I think would complement his series > nicely. I apologize for missing that. I think these are the four most recent versions of the different series referenced from that LWN link: a) https://lore.kernel.org/all/20230918072955.2507221-1-r...@kernel.org/ b) https://lore.kernel.org/all/20230526051529.3387103-1-s...@kernel.org/ c) https://lore.kernel.org/all/20221107223921.3451913-1-s...@kernel.org/ d) https://lore.kernel.org/all/20201120202426.18009-1-rick.p.edgeco...@intel.com/ Song and Mike, please correct me if I'm wrong, but I think what I've done here (see [1], sorry for not adding you initially) is compatible with everything both of you have recently proposed above. How do you feel about this as a first step? For naming, execmem_alloc() seems reasonable to me? I have no strong feelings at all, I'll just use that going forward unless somebody else expresses an opinion. [1] https://lore.kernel.org/lkml/cover.1709676663.git.jcalvinow...@gmail.com/T/#m337096e158a5f771d0c7c2fb15a3b80a4443226a > If you're gonna split code up to move to another place, it'd be nice > if you can add copyright headers as was done with the kernel/module.c > split into kernel/module/*.c Silly question: should it be the same copyright header as the original corresponding module.c, or a new one? I tried to preserve the license header because I wasn't sure what to do about it. Thanks, Calvin > Can we start with some small basic stuff we can all agree on? > > [0] https://lwn.net/Articles/944857/ > > Luis
[RFC][PATCH 4/4] selftests/bpf: Support testing the !MODULES case
This symlinks bpf_testmod into the main source, so it can be built-in for running selftests in the new !MODULES case. To be clear, no changes to the existing selftests are required: this only exists to enable testing the new case which was not previously possible. I'm sure somebody will be able to suggest a less ugly way I can do this... Signed-off-by: Calvin Owens --- include/trace/events/bpf_testmod.h| 1 + kernel/bpf/Kconfig| 9 ++ kernel/bpf/Makefile | 2 ++ kernel/bpf/bpf_testmod/Makefile | 1 + kernel/bpf/bpf_testmod/bpf_testmod.c | 1 + kernel/bpf/bpf_testmod/bpf_testmod.h | 1 + kernel/bpf/bpf_testmod/bpf_testmod_kfunc.h| 1 + net/bpf/test_run.c| 2 ++ tools/testing/selftests/bpf/Makefile | 28 +-- .../selftests/bpf/bpf_testmod/Makefile| 2 +- .../bpf/bpf_testmod/bpf_testmod-events.h | 6 .../selftests/bpf/bpf_testmod/bpf_testmod.c | 4 +++ .../bpf/bpf_testmod/bpf_testmod_kfunc.h | 2 ++ tools/testing/selftests/bpf/config| 5 tools/testing/selftests/bpf/config.mods | 5 tools/testing/selftests/bpf/config.nomods | 1 + .../selftests/bpf/progs/btf_type_tag_percpu.c | 2 ++ .../selftests/bpf/progs/btf_type_tag_user.c | 2 ++ tools/testing/selftests/bpf/progs/core_kern.c | 2 ++ .../selftests/bpf/progs/iters_testmod_seq.c | 2 ++ .../bpf/progs/test_core_reloc_module.c| 2 ++ .../selftests/bpf/progs/test_ldsx_insn.c | 2 ++ .../selftests/bpf/progs/test_module_attach.c | 3 ++ .../selftests/bpf/progs/tracing_struct.c | 2 ++ tools/testing/selftests/bpf/testing_helpers.c | 14 ++ tools/testing/selftests/bpf/vmtest.sh | 24 ++-- 26 files changed, 110 insertions(+), 16 deletions(-) create mode 12 include/trace/events/bpf_testmod.h create mode 100644 kernel/bpf/bpf_testmod/Makefile create mode 12 kernel/bpf/bpf_testmod/bpf_testmod.c create mode 12 kernel/bpf/bpf_testmod/bpf_testmod.h create mode 12 kernel/bpf/bpf_testmod/bpf_testmod_kfunc.h create mode 100644 tools/testing/selftests/bpf/config.mods create mode 100644 tools/testing/selftests/bpf/config.nomods diff --git a/include/trace/events/bpf_testmod.h b/include/trace/events/bpf_testmod.h new file mode 12 index ..ae237a90d381 --- /dev/null +++ b/include/trace/events/bpf_testmod.h @@ -0,0 +1 @@ +../../../tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h \ No newline at end of file diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig index 77df483a8925..d5ba795182e5 100644 --- a/kernel/bpf/Kconfig +++ b/kernel/bpf/Kconfig @@ -100,4 +100,13 @@ config BPF_LSM If you are unsure how to answer this question, answer N. +config BPF_TEST_MODULE + bool "Build the module for BPF selftests as a built-in" + depends on BPF_SYSCALL + depends on BPF_JIT + depends on !MODULES + default n + help + This allows most of the bpf selftests to run without modules. + endmenu # "BPF subsystem" diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index f526b7573e97..04b3e50ff940 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -46,3 +46,5 @@ obj-$(CONFIG_BPF_PRELOAD) += preload/ obj-$(CONFIG_BPF_SYSCALL) += relo_core.o $(obj)/relo_core.o: $(srctree)/tools/lib/bpf/relo_core.c FORCE $(call if_changed_rule,cc_o_c) + +obj-$(CONFIG_BPF_TEST_MODULE) += bpf_testmod/ diff --git a/kernel/bpf/bpf_testmod/Makefile b/kernel/bpf/bpf_testmod/Makefile new file mode 100644 index ..55a73fd8443e --- /dev/null +++ b/kernel/bpf/bpf_testmod/Makefile @@ -0,0 +1 @@ +obj-y += bpf_testmod.o diff --git a/kernel/bpf/bpf_testmod/bpf_testmod.c b/kernel/bpf/bpf_testmod/bpf_testmod.c new file mode 12 index ..ca3baca5d9c4 --- /dev/null +++ b/kernel/bpf/bpf_testmod/bpf_testmod.c @@ -0,0 +1 @@ +../../../tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c \ No newline at end of file diff --git a/kernel/bpf/bpf_testmod/bpf_testmod.h b/kernel/bpf/bpf_testmod/bpf_testmod.h new file mode 12 index ..f8d3df98b6a5 --- /dev/null +++ b/kernel/bpf/bpf_testmod/bpf_testmod.h @@ -0,0 +1 @@ +../../../tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h \ No newline at end of file diff --git a/kernel/bpf/bpf_testmod/bpf_testmod_kfunc.h b/kernel/bpf/bpf_testmod/bpf_testmod_kfunc.h new file mode 12 index ..fdf42f5eaeb0 --- /dev/null +++ b/kernel/bpf/bpf_testmod/bpf_testmod_kfunc.h @@ -0,0 +1 @@ +../../../tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h \ No newline at end of file diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index dfd919374017..33029c91bf92 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -573,10 +573,12 @@ __bpf_kfunc int bpf_modify_return_test2(int a, int *b, short c, int d,
[RFC][PATCH 3/4] kprobes: Allow kprobes with CONFIG_MODULES=n
If something like this is merged down the road, it can go in at leisure once the module_alloc change is in: it's a one-way dependency. Signed-off-by: Calvin Owens --- arch/Kconfig| 2 +- kernel/kprobes.c| 22 ++ kernel/trace/trace_kprobe.c | 11 +++ 3 files changed, 34 insertions(+), 1 deletion(-) diff --git a/arch/Kconfig b/arch/Kconfig index cfc24ced16dd..e60ce984d095 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -52,8 +52,8 @@ config GENERIC_ENTRY config KPROBES bool "Kprobes" - depends on MODULES depends on HAVE_KPROBES + select MODULE_ALLOC select KALLSYMS select TASKS_RCU if PREEMPTION help diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 9d9095e81792..194270e17d57 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -1556,8 +1556,12 @@ static bool is_cfi_preamble_symbol(unsigned long addr) str_has_prefix("__pfx_", symbuf); } +#if IS_ENABLED(CONFIG_MODULES) static int check_kprobe_address_safe(struct kprobe *p, struct module **probed_mod) +#else +static int check_kprobe_address_safe(struct kprobe *p) +#endif { int ret; @@ -1580,6 +1584,7 @@ static int check_kprobe_address_safe(struct kprobe *p, goto out; } +#if IS_ENABLED(CONFIG_MODULES) /* Check if 'p' is probing a module. */ *probed_mod = __module_text_address((unsigned long) p->addr); if (*probed_mod) { @@ -1603,6 +1608,8 @@ static int check_kprobe_address_safe(struct kprobe *p, ret = -ENOENT; } } +#endif + out: preempt_enable(); jump_label_unlock(); @@ -1614,7 +1621,9 @@ int register_kprobe(struct kprobe *p) { int ret; struct kprobe *old_p; +#if IS_ENABLED(CONFIG_MODULES) struct module *probed_mod; +#endif kprobe_opcode_t *addr; bool on_func_entry; @@ -1633,7 +1642,11 @@ int register_kprobe(struct kprobe *p) p->nmissed = 0; INIT_LIST_HEAD(>list); +#if IS_ENABLED(CONFIG_MODULES) ret = check_kprobe_address_safe(p, _mod); +#else + ret = check_kprobe_address_safe(p); +#endif if (ret) return ret; @@ -1676,8 +1689,10 @@ int register_kprobe(struct kprobe *p) out: mutex_unlock(_mutex); +#if IS_ENABLED(CONFIG_MODULES) if (probed_mod) module_put(probed_mod); +#endif return ret; } @@ -2482,6 +2497,7 @@ int kprobe_add_area_blacklist(unsigned long start, unsigned long end) return 0; } +#if IS_ENABLED(CONFIG_MODULES) /* Remove all symbols in given area from kprobe blacklist */ static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end) { @@ -2499,6 +2515,7 @@ static void kprobe_remove_ksym_blacklist(unsigned long entry) { kprobe_remove_area_blacklist(entry, entry + 1); } +#endif int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value, char *type, char *sym) @@ -2564,6 +2581,7 @@ static int __init populate_kprobe_blacklist(unsigned long *start, return ret ? : arch_populate_kprobe_blacklist(); } +#if IS_ENABLED(CONFIG_MODULES) static void add_module_kprobe_blacklist(struct module *mod) { unsigned long start, end; @@ -2665,6 +2683,7 @@ static struct notifier_block kprobe_module_nb = { .notifier_call = kprobes_module_callback, .priority = 0 }; +#endif /* IS_ENABLED(CONFIG_MODULES) */ void kprobe_free_init_mem(void) { @@ -2724,8 +2743,11 @@ static int __init init_kprobes(void) err = arch_init_kprobes(); if (!err) err = register_die_notifier(_exceptions_nb); + +#if IS_ENABLED(CONFIG_MODULES) if (!err) err = register_module_notifier(_module_nb); +#endif kprobes_initialized = (err == 0); kprobe_sysctls_init(); diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index c4c6e0e0068b..dd4598f775b9 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -102,6 +102,7 @@ static nokprobe_inline bool trace_kprobe_has_gone(struct trace_kprobe *tk) return kprobe_gone(>rp.kp); } +#if IS_ENABLED(CONFIG_MODULES) static nokprobe_inline bool trace_kprobe_within_module(struct trace_kprobe *tk, struct module *mod) { @@ -129,6 +130,12 @@ static nokprobe_inline bool trace_kprobe_module_exist(struct trace_kprobe *tk) return ret; } +#else +static nokprobe_inline bool trace_kprobe_module_exist(struct trace_kprobe *tk) +{ + return true; +} +#endif static bool trace_kprobe_is_busy(struct dyn_event *ev) { @@ -670,6 +677,7 @@ static int register_trace_kprobe(struct trace_kprobe *tk) return ret; } +#if IS_ENABLED(CONFIG_MODULES) /* Module notifier call ba
[RFC][PATCH 1/4] module: mm: Make module_alloc() generally available
Both BPF_JIT and KPROBES depend on CONFIG_MODULES, but only require module_alloc() itself, which can be easily separated into a standalone allocator for executable kernel memory. Thomas Gleixner sent a patch to do that for x86 as part of a larger series a couple years ago: https://lore.kernel.org/all/20220716230953.442937...@linutronix.de/ I've simply extended that approach to the whole kernel. Signed-off-by: Calvin Owens --- arch/Kconfig | 2 +- arch/arm/kernel/module.c | 35 - arch/arm/mm/Makefile | 2 + arch/arm/mm/module_alloc.c | 40 ++ arch/arm64/kernel/module.c | 127 -- arch/arm64/mm/Makefile | 1 + arch/arm64/mm/module_alloc.c | 130 +++ arch/loongarch/kernel/module.c | 6 -- arch/loongarch/mm/Makefile | 2 + arch/loongarch/mm/module_alloc.c | 10 +++ arch/mips/kernel/module.c| 10 --- arch/mips/mm/Makefile| 2 + arch/mips/mm/module_alloc.c | 13 arch/nios2/kernel/module.c | 20 - arch/nios2/mm/Makefile | 2 + arch/nios2/mm/module_alloc.c | 22 ++ arch/parisc/kernel/module.c | 12 --- arch/parisc/mm/Makefile | 1 + arch/parisc/mm/module_alloc.c| 15 arch/powerpc/kernel/module.c | 36 - arch/powerpc/mm/Makefile | 1 + arch/powerpc/mm/module_alloc.c | 41 ++ arch/riscv/kernel/module.c | 11 --- arch/riscv/mm/Makefile | 1 + arch/riscv/mm/module_alloc.c | 17 arch/s390/kernel/module.c| 37 - arch/s390/mm/Makefile| 1 + arch/s390/mm/module_alloc.c | 42 ++ arch/sparc/kernel/module.c | 31 arch/sparc/mm/Makefile | 2 + arch/sparc/mm/module_alloc.c | 31 arch/x86/kernel/ftrace.c | 2 +- arch/x86/kernel/module.c | 56 - arch/x86/mm/Makefile | 2 + arch/x86/mm/module_alloc.c | 59 ++ fs/proc/kcore.c | 2 +- kernel/module/Kconfig| 1 + kernel/module/main.c | 17 mm/Kconfig | 3 + mm/Makefile | 1 + mm/module_alloc.c| 21 + mm/vmalloc.c | 2 +- 42 files changed, 467 insertions(+), 402 deletions(-) create mode 100644 arch/arm/mm/module_alloc.c create mode 100644 arch/arm64/mm/module_alloc.c create mode 100644 arch/loongarch/mm/module_alloc.c create mode 100644 arch/mips/mm/module_alloc.c create mode 100644 arch/nios2/mm/module_alloc.c create mode 100644 arch/parisc/mm/module_alloc.c create mode 100644 arch/powerpc/mm/module_alloc.c create mode 100644 arch/riscv/mm/module_alloc.c create mode 100644 arch/s390/mm/module_alloc.c create mode 100644 arch/sparc/mm/module_alloc.c create mode 100644 arch/x86/mm/module_alloc.c create mode 100644 mm/module_alloc.c diff --git a/arch/Kconfig b/arch/Kconfig index a5af0edd3eb8..cfc24ced16dd 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -1305,7 +1305,7 @@ config ARCH_HAS_STRICT_MODULE_RWX config STRICT_MODULE_RWX bool "Set loadable kernel module data as NX and text as RO" if ARCH_OPTIONAL_KERNEL_RWX - depends on ARCH_HAS_STRICT_MODULE_RWX && MODULES + depends on ARCH_HAS_STRICT_MODULE_RWX && MODULE_ALLOC default !ARCH_OPTIONAL_KERNEL_RWX || ARCH_OPTIONAL_KERNEL_RWX_DEFAULT help If this is set, module text and rodata memory will be made read-only, diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c index e74d84f58b77..1c8798732d12 100644 --- a/arch/arm/kernel/module.c +++ b/arch/arm/kernel/module.c @@ -4,15 +4,12 @@ * * Copyright (C) 2002 Russell King. * Modified for nommu by Hyok S. Choi - * - * Module allocation method suggested by Andi Kleen. */ #include #include #include #include #include -#include #include #include #include @@ -22,38 +19,6 @@ #include #include -#ifdef CONFIG_XIP_KERNEL -/* - * The XIP kernel text is mapped in the module area for modules and - * some other stuff to work without any indirect relocations. - * MODULES_VADDR is redefined here and not in asm/memory.h to avoid - * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off. - */ -#undef MODULES_VADDR -#define MODULES_VADDR (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK) -#endif - -#ifdef CONFIG_MMU -void *module_alloc(unsigned long size) -{ - gfp_t gfp_mask = GFP_KERNEL; - void *p; - - /* Silence the initial allocation */ - if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) - gfp_mask |= __GFP_NOWARN; - - p = __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - gfp_mask, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE, - __builtin_return_addres
[RFC][PATCH 2/4] bpf: Allow BPF_JIT with CONFIG_MODULES=n
No BPF code has to change, except in struct_ops (for module refs). This conflicts with bpf-next because of this (relevant) series: https://lore.kernel.org/all/20240119225005.668602-1-thinker...@gmail.com/ If something like this is merged down the road, it can go through bpf-next at leisure once the module_alloc change is in: it's a one-way dependency. Signed-off-by: Calvin Owens --- kernel/bpf/Kconfig | 2 +- kernel/bpf/bpf_struct_ops.c | 28 2 files changed, 25 insertions(+), 5 deletions(-) diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig index 6a906ff93006..77df483a8925 100644 --- a/kernel/bpf/Kconfig +++ b/kernel/bpf/Kconfig @@ -42,7 +42,7 @@ config BPF_JIT bool "Enable BPF Just In Time compiler" depends on BPF depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT - depends on MODULES + select MODULE_ALLOC help BPF programs are normally handled by a BPF interpreter. This option allows the kernel to generate native code when a program is loaded diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 02068bd0e4d9..fbf08a1bb00c 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -108,11 +108,30 @@ const struct bpf_prog_ops bpf_struct_ops_prog_ops = { #endif }; +#if IS_ENABLED(CONFIG_MODULES) static const struct btf_type *module_type; +static int bpf_struct_module_type_init(struct btf *btf) +{ + s32 module_id; + + module_id = btf_find_by_name_kind(btf, "module", BTF_KIND_STRUCT); + if (module_id < 0) + return 1; + + module_type = btf_type_by_id(btf, module_id); + return 0; +} +#else +static int bpf_struct_module_type_init(struct btf *btf) +{ + return 0; +} +#endif + void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log) { - s32 type_id, value_id, module_id; + s32 type_id, value_id; const struct btf_member *member; struct bpf_struct_ops *st_ops; const struct btf_type *t; @@ -125,12 +144,10 @@ void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log) #include "bpf_struct_ops_types.h" #undef BPF_STRUCT_OPS_TYPE - module_id = btf_find_by_name_kind(btf, "module", BTF_KIND_STRUCT); - if (module_id < 0) { + if (bpf_struct_module_type_init(btf)) { pr_warn("Cannot find struct module in btf_vmlinux\n"); return; } - module_type = btf_type_by_id(btf, module_id); for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { st_ops = bpf_struct_ops[i]; @@ -433,12 +450,15 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, moff = __btf_member_bit_offset(t, member) / 8; ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, NULL); + +#if IS_ENABLED(CONFIG_MODULES) if (ptype == module_type) { if (*(void **)(udata + moff)) goto reset_unlock; *(void **)(kdata + moff) = BPF_MODULE_OWNER; continue; } +#endif err = st_ops->init_member(t, member, kdata, udata); if (err < 0) -- 2.43.0
[RFC][PATCH 0/4] Make bpf_jit and kprobes work with CONFIG_MODULES=n
Hello all, This patchset makes it possible to use bpftrace with kprobes on kernels built without loadable module support. On a Raspberry Pi 4b, this saves about 700KB of memory where BPF is needed but loadable module support is not. These two kernels had identical configurations, except CONFIG_MODULE was off in the second: - Linux version 6.8.0-rc7 - Memory: 3330672K/4050944K available (16576K kernel code, 2390K rwdata, - 12364K rodata, 5632K init, 675K bss, 195984K reserved, 524288K cma-reserved) + Linux version 6.8.0-rc7-3-g2af01251ca21 + Memory: 3331400K/4050944K available (16512K kernel code, 2384K rwdata, + 11728K rodata, 5632K init, 673K bss, 195256K reserved, 524288K cma-reserved) I don't intend to present an exhaustive list of !MODULES usecases, since I'm sure there are many I'm not aware of. Performance is a common one, the primary justification being that static text is mapped on hugepages and module text is not. Security is another, since rootkits are much harder to implement without modules. The first patch is the interesting one: it moves module_alloc() into its own file with its own Kconfig option, so it can be utilized even when loadable module support is disabled. I got the idea from an unmerged patch from a few years ago I found on lkml (see [1/4] for details). I think this also has value in its own right, since I suspect there are potential users beyond bpf, hopefully we will hear from some. Patches 2-3 are proofs of concept to demonstrate the first patch is sufficient to achieve my goal (full ebpf functionality without modules). Patch 4 adds a new "-n" argument to vmtest.sh to run the BPF selftests without modules, so the prior three patches can be rigorously tested. If something like the first patch were to eventually be merged, the rest could go through the normal bpf-next process as I clean them up: I've only based them on Linus' tree and combined them into a series here to introduce the idea. If you prefer to fetch the patches via git: [1/4] https://github.com/jcalvinowens/linux.git work/module-alloc +[2/4]+[3/4] https://github.com/jcalvinowens/linux.git work/nomodule-bpf +[4/4] https://github.com/jcalvinowens/linux.git testing/nomodule-bpf-ci In addition to the automated BPF selftests, I've lightly tested this on my laptop (x86_64), a Raspberry Pi 4b (arm64), and a Raspberry Pi Zero W (arm). The other architectures have only been compile tested. I didn't want to spam all the arch maintainers with what I expect will be a discussion mostly about modules and bpf, so I've left them off this first submission. I will be sure to add them on future submissions of the first patch. Of course, feedback on the arch bits is welcome here. In addition to feedback on the patches themselves, I'm interested in hearing from anybody else who might find this functionality useful. Thanks, Calvin Calvin Owens (4): module: mm: Make module_alloc() generally available bpf: Allow BPF_JIT with CONFIG_MODULES=n kprobes: Allow kprobes with CONFIG_MODULES=n selftests/bpf: Support testing the !MODULES case arch/Kconfig | 4 +- arch/arm/kernel/module.c | 35 - arch/arm/mm/Makefile | 2 + arch/arm/mm/module_alloc.c| 40 ++ arch/arm64/kernel/module.c| 127 - arch/arm64/mm/Makefile| 1 + arch/arm64/mm/module_alloc.c | 130 ++ arch/loongarch/kernel/module.c| 6 - arch/loongarch/mm/Makefile| 2 + arch/loongarch/mm/module_alloc.c | 10 ++ arch/mips/kernel/module.c | 10 -- arch/mips/mm/Makefile | 2 + arch/mips/mm/module_alloc.c | 13 ++ arch/nios2/kernel/module.c| 20 --- arch/nios2/mm/Makefile| 2 + arch/nios2/mm/module_alloc.c | 22 +++ arch/parisc/kernel/module.c | 12 -- arch/parisc/mm/Makefile | 1 + arch/parisc/mm/module_alloc.c | 15 ++ arch/powerpc/kernel/module.c | 36 - arch/powerpc/mm/Makefile | 1 + arch/powerpc/mm/module_alloc.c| 41 ++ arch/riscv/kernel/module.c| 11 -- arch/riscv/mm/Makefile| 1 + arch/riscv/mm/module_alloc.c | 17 +++ arch/s390/kernel/module.c | 37 - arch/s390/mm/Makefile | 1 + arch/s390/mm/module_alloc.c | 42 ++ arch/sparc/kernel/module.c| 31 - arch/sparc/mm/Makefile| 2 + arch/sparc/mm/module_alloc.c | 31 + arch/x86/kernel/ftrace.c | 2 +- arch/x86/kerne
Re: [PATCH 3/4] printk: Add consoles to a virtual "console" bus
On Monday 03/11 at 14:33 +0100, Petr Mladek wrote: > On Fri 2019-03-01 16:48:19, Calvin Owens wrote: > > This patch embeds a device struct in the console struct, and registers > > them on a "console" bus so we can expose attributes in sysfs. > > > > Currently, most drivers declare static console structs, and that is > > incompatible with the dev refcount model. So we end up needing to patch > > all of the console drivers to: > > > > 1. Dynamically allocate the console struct using a new helper > > 2. Handle the allocation in (1) possibly failing > > 3. Dispose of (1) with put_device() > > > > Early console structures must still be static, since they're required > > before we're able to allocate memory. The least ugly way I can come up > > with to handle this is an "is_static" flag in the structure which makes > > the gets and puts NOPs, and is checked in ->release() to catch mistakes. > > > > diff --git a/drivers/char/lp.c b/drivers/char/lp.c > > index 5c8d780637bd..e09cb192a469 100644 > > --- a/drivers/char/lp.c > > +++ b/drivers/char/lp.c > > @@ -857,12 +857,12 @@ static void lp_console_write(struct console *co, > > const char *s, > > parport_release(dev); > > } > > > > -static struct console lpcons = { > > - .name = "lp", > > +static const struct console_operations lp_cons_ops = { > > .write = lp_console_write, > > - .flags = CON_PRINTBUFFER, > > }; > > > > +static struct console *lpcons; > > I have got the following compilation error (see below): > > CC drivers/char/lp.o > drivers/char/lp.c: In function ‘lp_register’: > drivers/char/lp.c:925:2: error: ‘lpcons’ undeclared (first use in this > function) > lpcons = allocate_console_dfl(_cons_ops, "lp", NULL); > ^ > drivers/char/lp.c:925:2: note: each undeclared identifier is reported only > once for each function it appears in > In file included from drivers/char/lp.c:125:0: > drivers/char/lp.c:925:33: error: ‘lp_cons_ops’ undeclared (first use in this > function) D'oh, will fix. > > > #endif /* console on line printer */ > > > > /* --- initialisation code - */ > > @@ -921,6 +921,11 @@ static int lp_register(int nr, struct parport *port) > > _cb, nr); > > if (lp_table[nr].dev == NULL) > > return 1; > > + > > + lpcons = allocate_console_dfl(_cons_ops, "lp", NULL); > > + if (!lpcons) > > + return -ENOMEM; > > This should be done inside #ifdef CONFIG_LP_CONSOLE > to avoid the above compilation error. > > > + > > lp_table[nr].flags |= LP_EXIST; > > > > if (reset) > > [...] > > diff --git a/include/linux/console.h b/include/linux/console.h > > index 3c27a4a29b8c..382591683033 100644 > > --- a/include/linux/console.h > > +++ b/include/linux/console.h > > @@ -142,20 +143,28 @@ static inline int con_debug_leave(void) > > #define CON_BRL(32) /* Used for a braille device */ > > #define CON_EXTENDED (64) /* Use the extended output format a la > > /dev/kmsg */ > > > > -struct console { > > - charname[16]; > > +struct console; > > + > > +struct console_operations { > > void(*write)(struct console *, const char *, unsigned); > > int (*read)(struct console *, char *, unsigned); > > struct tty_driver *(*device)(struct console *, int *); > > void(*unblank)(void); > > int (*setup)(struct console *, char *); > > int (*match)(struct console *, char *name, int idx, char *options); > > +}; > > + > > +struct console { > > + charname[16]; > > short flags; > > short index; > > int cflag; > > void*data; > > struct console *next; > > int level; > > + const struct console_operations *ops; > > + struct device dev; > > + int is_static; > > }; > > > > /* > > @@ -167,6 +176,29 @@ struct console { > > extern int console_set_on_cmdline; > > extern struct console *early_console; > > > > +extern struct console *allocate_console(const struct console_operations > > *ops, > > + const char *name, short flags, > > + short index, void *data); > > + > > +#define allocate_console_dfl(ops, name, data) \ > &g
Re: [PATCH 1/4] printk: Introduce per-console loglevel setting
On Friday 03/08 at 12:10 +0900, Sergey Senozhatsky wrote: > On (03/01/19 16:48), Calvin Owens wrote: > [..] > > msg = log_from_idx(console_idx); > > - if (suppress_message_printing(msg->level)) { > > - /* > > -* Skip record we have buffered and already printed > > -* directly to the console when we received it, and > > -* record that has level above the console loglevel. > > -*/ > > - console_idx = log_next(console_idx); > > - console_seq++; > > - goto skip; > > - } > > > > /* Output to all consoles once old messages replayed. */ > > if (unlikely(exclusive_console && > > @@ -2405,7 +2402,7 @@ void console_unlock(void) > > console_lock_spinning_enable(); > > > > stop_critical_timings();/* don't trace print latency */ > > - call_console_drivers(ext_text, ext_len, text, len); > > + call_console_drivers(ext_text, ext_len, text, len, msg->level); > > start_critical_timings(); > > So it seems that now we always format the text and ext message (if > needed) and only then check if there is at least one console we can > print that message on. > > Can we iterate the consoles first and check if msg is worth > the effort (per console suppress_message_printing()) and only > if it is do all the formatting and call console drivers? Makes sense, will do. Thanks, Calvin > -ss
Re: [PATCH 3/4] printk: Add consoles to a virtual "console" bus
On Friday 03/08 at 16:53 +0100, Petr Mladek wrote: > On Fri 2019-03-01 16:48:19, Calvin Owens wrote: > > This patch embeds a device struct in the console struct, and registers > > them on a "console" bus so we can expose attributes in sysfs. > > > > Early console structures must still be static, since they're required > > before we're able to allocate memory. The least ugly way I can come up > > with to handle this is an "is_static" flag in the structure which makes > > the gets and puts NOPs, and is checked in ->release() to catch mistakes. > > I wonder if it might get detected by is_kernel_inittext(). I don't think inittext() in particular would work, since these actually need to exist forever if you pass "earlyprintk=[...],keep" so they aren't __init. But I bet you're right that we could catch the static case without needing the explicit flag, something like is_module_address() (but it would also need to work for the built-in case). I'll see if I can get this to work. Thanks, Calvin > Best Regards, > Petr
Re: [PATCH 3/4] printk: Add consoles to a virtual "console" bus
On Friday 03/08 at 17:34 +0100, Greg Kroah-Hartman wrote: > On Fri, Mar 08, 2019 at 04:58:14PM +0100, Petr Mladek wrote: > > On Fri 2019-03-08 03:56:19, John Ogness wrote: > > > On 2019-03-02, Calvin Owens wrote: > > > > This patch embeds a device struct in the console struct, and registers > > > > them on a "console" bus so we can expose attributes in sysfs. > > > > > > I expect that "class" would be more appropriate than "bus". These > > > devices really are grouped together based on their function and not the > > > medium by which they are accessed. > > > > Good point. "class" looks better to me as well. > > > > Greg, any opinion, where to put the entries for struct console ? > > Hang them off of the device that the console belongs to? > > Classes and busses are almost identical except: > - busses is the binding of a driver to a device (usb, pci, etc.) > - classes are usually userspace interactions to a device (input, > tty, etc.) > > So this sounds like a class to me. Sounds good, will make it a class. > If you want me to review this, I'll be glad to so do once 5.1-rc1 is > out... Yeah, I realized after sending this the timing was pretty terrible, I'll wait for 5.1-rc1 before rebasing/resending. Thanks, Calvin > thanks, > > greg k-h
Re: [PATCH] tpm: Make timeout logic simpler and more robust
On Tuesday 03/12 at 13:04 -0400, Mimi Zohar wrote: > On Mon, 2019-03-11 at 16:54 -0700, Calvin Owens wrote: > > We're having lots of problems with TPM commands timing out, and we're > > seeing these problems across lots of different hardware (both v1/v2). > > > > I instrumented the driver to collect latency data, but I wasn't able to > > find any specific timeout to fix: it seems like many of them are too > > aggressive. So I tried replacing all the timeout logic with a single > > universal long timeout, and found that makes our TPMs 100% reliable. > > > > Given that this timeout logic is very complex, problematic, and appears > > to serve no real purpose, I propose simply deleting all of it. > > Normally before sending such a massive change like this, included in > the bug report or patch description, there would be some indication as > to which kernel introduced a regression. Has this always been a > problem? Is this something new? How new? Honestly we've always had problems with flakiness from these devices, but it seems to have regressed sometime between 4.11 and 4.16. I wish a had a better answer for you: we need on the order of a hundred machines to see the difference, and setting up these 100+ machine tests is unfortunately involved enough that e.g. bisecting it just isn't feasible :/ What I can say for sure is that this patch makes everything much better for us. If there's anything in particular you'd like me to test, I have an army of machines I'm happy to put to use, let me know :) Thanks, Calvin > Mimi > > > > > Signed-off-by: Calvin Owens > > --- > > drivers/char/tpm/st33zp24/st33zp24.c | 28 +- > > drivers/char/tpm/tpm-interface.c | 41 +-- > > drivers/char/tpm/tpm-sysfs.c | 34 --- > > drivers/char/tpm/tpm.h | 60 +--- > > drivers/char/tpm/tpm1-cmd.c | 423 ++- > > drivers/char/tpm/tpm2-cmd.c | 120 > > drivers/char/tpm/tpm_crb.c | 20 +- > > drivers/char/tpm/tpm_i2c_atmel.c | 6 - > > drivers/char/tpm/tpm_i2c_infineon.c | 33 +-- > > drivers/char/tpm/tpm_i2c_nuvoton.c | 42 +-- > > drivers/char/tpm/tpm_nsc.c | 6 +- > > drivers/char/tpm/tpm_tis_core.c | 96 +- > > drivers/char/tpm/xen-tpmfront.c | 17 +- > > 13 files changed, 108 insertions(+), 818 deletions(-) > > > > diff --git a/drivers/char/tpm/st33zp24/st33zp24.c > > b/drivers/char/tpm/st33zp24/st33zp24.c > > index 64dc560859f2..433b9a72f0ef 100644 > > --- a/drivers/char/tpm/st33zp24/st33zp24.c > > +++ b/drivers/char/tpm/st33zp24/st33zp24.c > > @@ -154,13 +154,13 @@ static int request_locality(struct tpm_chip *chip) > > if (ret < 0) > > return ret; > > > > - stop = jiffies + chip->timeout_a; > > + stop = jiffies + TPM_UNIVERSAL_TIMEOUT_JIFFIES; > > > > /* Request locality is usually effective after the request */ > > do { > > if (check_locality(chip)) > > return tpm_dev->locality; > > - msleep(TPM_TIMEOUT); > > + msleep(TPM_TIMEOUT_POLL_MS); > > } while (time_before(jiffies, stop)); > > > > /* could not get locality */ > > @@ -193,7 +193,7 @@ static int get_burstcount(struct tpm_chip *chip) > > int burstcnt, status; > > u8 temp; > > > > - stop = jiffies + chip->timeout_d; > > + stop = jiffies + TPM_UNIVERSAL_TIMEOUT_JIFFIES; > > do { > > status = tpm_dev->ops->recv(tpm_dev->phy_id, TPM_STS + 1, > > , 1); > > @@ -209,7 +209,7 @@ static int get_burstcount(struct tpm_chip *chip) > > burstcnt |= temp << 8; > > if (burstcnt) > > return burstcnt; > > - msleep(TPM_TIMEOUT); > > + msleep(TPM_TIMEOUT_POLL_MS); > > } while (time_before(jiffies, stop)); > > return -EBUSY; > > } /* get_burstcount() */ > > @@ -248,11 +248,11 @@ static bool wait_for_tpm_stat_cond(struct tpm_chip > > *chip, u8 mask, > > * @param: check_cancel, does the command can be cancelled ? > > * @return: the tpm status, 0 if success, -ETIME if timeout is reached. > > */ > > -static int wait_for_stat(struct tpm_chip *chip, u8 mask, unsigned long > > timeout, > > +static int wait_for_stat(struct tpm_chip *chip, u8 mask, > > wait_queue_head_t *queue, bool check_cancel) > > { > > struct st33zp24_dev *tpm_dev = dev_get_drvdata(>dev); > > - unsigned long st
Re: [PATCH] tpm: Make timeout logic simpler and more robust
On Tuesday 03/12 at 17:39 +0200, Jarkko Sakkinen wrote: > On Tue, Mar 12, 2019 at 07:42:46AM -0700, James Bottomley wrote: > > On Tue, 2019-03-12 at 14:50 +0200, Jarkko Sakkinen wrote: > > > On Mon, Mar 11, 2019 at 05:27:43PM -0700, James Bottomley wrote: > > > > On Mon, 2019-03-11 at 16:54 -0700, Calvin Owens wrote: > > > > > e're having lots of problems with TPM commands timing out, and > > > > > we're seeing these problems across lots of different hardware > > > > > (both v1/v2). > > > > > > > > > > I instrumented the driver to collect latency data, but I wasn't > > > > > able to find any specific timeout to fix: it seems like many of > > > > > them are too aggressive. So I tried replacing all the timeout > > > > > logic with a single universal long timeout, and found that makes > > > > > our TPMs 100% reliable. > > > > > > > > > > Given that this timeout logic is very complex, problematic, and > > > > > appears to serve no real purpose, I propose simply deleting all > > > > > of it. > > > > > > > > "no real purpose" is a bit strong given that all these timeouts are > > > > standards mandated. The purpose stated by the standards is that > > > > there needs to be a way of differentiating the TPM crashed from the > > > > TPM is taking a very long time to respond. For a normally > > > > functioning TPM it looks complex and unnecessary, but for a > > > > malfunctioning one it's a lifesaver. > > > > > > Standards should be only followed when they make practical sense and > > > ignored when not. The range is only up to 2s anyway. > > > > I don't disagree ... and I'm certainly not going to defend the TCG > > because I do think the complexity of some of its standards contributed > > to the lack of use of TPM 1.2. > > > > However, I am saying we should root cause this problem rather than take > > a blind shot at the apparent timeout complexity. My timeout > > instability is definitely related to the polling adjustments, so it's > > not unreasonable to think Facebooks might be as well. > > Yeah, referring to my review comment, I think the very first thing > that should be done is to split patch into two. Then we can probably > give better feedback. Absolutely, will do. Thanks, Calvin > /Jarkko
Re: [PATCH] tpm: Make timeout logic simpler and more robust
On Monday 03/11 at 17:27 -0700, James Bottomley wrote: > On Mon, 2019-03-11 at 16:54 -0700, Calvin Owens wrote: > > e're having lots of problems with TPM commands timing out, and we're > > seeing these problems across lots of different hardware (both v1/v2). > > > > I instrumented the driver to collect latency data, but I wasn't able > > to find any specific timeout to fix: it seems like many of them are > > too aggressive. So I tried replacing all the timeout logic with a > > single universal long timeout, and found that makes our TPMs 100% > > reliable. > > > > Given that this timeout logic is very complex, problematic, and > > appears to serve no real purpose, I propose simply deleting all of > > it. > > "no real purpose" is a bit strong given that all these timeouts are > standards mandated. Sure, in fairness I said "appears to" ;) We tested this on roughly a hundred machines with a variety of hardware, they were flaky before and essentially perfectly reliable after this patch. So that's where I'm coming from here. > The purpose stated by the standards is that there needs to be a way of > differentiating the TPM crashed from the TPM is taking a very long > time to respond. For a normally functioning TPM it looks complex and > unnecessary, but for a malfunctioning one it's a lifesaver. Does getting -EWHATEVER some 2-3 seconds more quickly really make much of a difference? That's all we're talking about changing here, right? > Could you first check it's not a problem we introduced with our polling > changes? My nuvoton still doesn't work properly with the default poll > timings but it works flawlessly if I use the patch below. I think my > nuvoton is a bit out of spec (it's a very early model that was software > upgraded from 1.2 to 2.0) because no-one else on the list seems to see > the problems I see, but perhaps you are. I did consider the polling changes. My thinking was that, since the poll loops I was seeing time out are all gated on time_before(), it would only potentially change how much the final poll overruns the target jiffies, and wasn't as likely to help as changing the timeouts themselves. The theory about poking it too aggressively making it fall off the bus definitely makes sense, but the success of this "universal timeout" approach suggests to me that the timeouts themselves are the root problem with the flakiness we're seeing in production. Thanks, Calvin > James > > --- > > From 249d60a9fafa8638433e545b50dab6987346cb26 Mon Sep 17 00:00:00 2001 > From: James Bottomley > Date: Wed, 11 Jul 2018 10:11:14 -0700 > Subject: [PATCH] tpm.h: increase poll timings to fix tpm_tis regression > > tpm_tis regressed recently to the point where the TPM being driven by > it falls off the bus and cannot be contacted after some hours of use. > This is the failure trace: > > jejb@jarvis:~> dmesg|grep tpm > [3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id 2) > [14566.626614] tpm tpm0: Operation Timed out > [14566.626621] tpm tpm0: tpm2_load_context: failed with a system error -62 > [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > [14570.626605] tpm tpm0: tpm2_load_context: failed with a system error -62 > [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > ... > > The problem is caused by a change that caused us to poke the TPM far > more often to see if it's ready. Apparently something about the bus > its on and the TPM means that it crashes or falls off the bus if you > poke it too often and once this happens, only a reboot will recover > it. > > The fix I've come up with is to adjust the timings so the TPM no > longer falls of the bus. Obviously, this fix works for my Nuvoton > NPCT6xxx but that's the only TPM I've tested it with. > > Fixes: 424eaf910c32 tpm: reduce polling time to usecs for even finer > granularity > Signed-off-by: James Bottomley > > diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h > index 4b104245afed..a6c806d98950 100644 > --- a/drivers/char/tpm/tpm.h > +++ b/drivers/char/tpm/tpm.h > @@ -64,8 +64,8 @@ enum tpm_timeout { > TPM_TIMEOUT_RETRY = 100, /* msecs */ > TPM_TIMEOUT_RANGE_US = 300, /* usecs */ > TPM_TIMEOUT_POLL = 1, /* msecs */ > - TPM_TIMEOUT_USECS_MIN = 100, /* usecs */ > - TPM_TIMEOUT_USECS_MAX = 500 /* usecs */ > + TPM_TIMEOUT_USECS_MIN = 750, /* usecs */ > + TPM_TIMEOUT_USECS_MAX = 1000, /* usecs */ > }; > > /* TPM addresses */
[PATCH] tpm: Make timeout logic simpler and more robust
We're having lots of problems with TPM commands timing out, and we're seeing these problems across lots of different hardware (both v1/v2). I instrumented the driver to collect latency data, but I wasn't able to find any specific timeout to fix: it seems like many of them are too aggressive. So I tried replacing all the timeout logic with a single universal long timeout, and found that makes our TPMs 100% reliable. Given that this timeout logic is very complex, problematic, and appears to serve no real purpose, I propose simply deleting all of it. Signed-off-by: Calvin Owens --- drivers/char/tpm/st33zp24/st33zp24.c | 28 +- drivers/char/tpm/tpm-interface.c | 41 +-- drivers/char/tpm/tpm-sysfs.c | 34 --- drivers/char/tpm/tpm.h | 60 +--- drivers/char/tpm/tpm1-cmd.c | 423 ++- drivers/char/tpm/tpm2-cmd.c | 120 drivers/char/tpm/tpm_crb.c | 20 +- drivers/char/tpm/tpm_i2c_atmel.c | 6 - drivers/char/tpm/tpm_i2c_infineon.c | 33 +-- drivers/char/tpm/tpm_i2c_nuvoton.c | 42 +-- drivers/char/tpm/tpm_nsc.c | 6 +- drivers/char/tpm/tpm_tis_core.c | 96 +- drivers/char/tpm/xen-tpmfront.c | 17 +- 13 files changed, 108 insertions(+), 818 deletions(-) diff --git a/drivers/char/tpm/st33zp24/st33zp24.c b/drivers/char/tpm/st33zp24/st33zp24.c index 64dc560859f2..433b9a72f0ef 100644 --- a/drivers/char/tpm/st33zp24/st33zp24.c +++ b/drivers/char/tpm/st33zp24/st33zp24.c @@ -154,13 +154,13 @@ static int request_locality(struct tpm_chip *chip) if (ret < 0) return ret; - stop = jiffies + chip->timeout_a; + stop = jiffies + TPM_UNIVERSAL_TIMEOUT_JIFFIES; /* Request locality is usually effective after the request */ do { if (check_locality(chip)) return tpm_dev->locality; - msleep(TPM_TIMEOUT); + msleep(TPM_TIMEOUT_POLL_MS); } while (time_before(jiffies, stop)); /* could not get locality */ @@ -193,7 +193,7 @@ static int get_burstcount(struct tpm_chip *chip) int burstcnt, status; u8 temp; - stop = jiffies + chip->timeout_d; + stop = jiffies + TPM_UNIVERSAL_TIMEOUT_JIFFIES; do { status = tpm_dev->ops->recv(tpm_dev->phy_id, TPM_STS + 1, , 1); @@ -209,7 +209,7 @@ static int get_burstcount(struct tpm_chip *chip) burstcnt |= temp << 8; if (burstcnt) return burstcnt; - msleep(TPM_TIMEOUT); + msleep(TPM_TIMEOUT_POLL_MS); } while (time_before(jiffies, stop)); return -EBUSY; } /* get_burstcount() */ @@ -248,11 +248,11 @@ static bool wait_for_tpm_stat_cond(struct tpm_chip *chip, u8 mask, * @param: check_cancel, does the command can be cancelled ? * @return: the tpm status, 0 if success, -ETIME if timeout is reached. */ -static int wait_for_stat(struct tpm_chip *chip, u8 mask, unsigned long timeout, +static int wait_for_stat(struct tpm_chip *chip, u8 mask, wait_queue_head_t *queue, bool check_cancel) { struct st33zp24_dev *tpm_dev = dev_get_drvdata(>dev); - unsigned long stop; + unsigned long stop, timeout; int ret = 0; bool canceled = false; bool condition; @@ -264,7 +264,7 @@ static int wait_for_stat(struct tpm_chip *chip, u8 mask, unsigned long timeout, if ((status & mask) == mask) return 0; - stop = jiffies + timeout; + stop = jiffies + TPM_UNIVERSAL_TIMEOUT_JIFFIES; if (chip->flags & TPM_CHIP_FLAG_IRQ) { cur_intrs = tpm_dev->intrs; @@ -296,7 +296,7 @@ static int wait_for_stat(struct tpm_chip *chip, u8 mask, unsigned long timeout, } else { do { - msleep(TPM_TIMEOUT); + msleep(TPM_TIMEOUT_POLL_MS); status = chip->ops->status(chip); if ((status & mask) == mask) return 0; @@ -321,7 +321,6 @@ static int recv_data(struct tpm_chip *chip, u8 *buf, size_t count) while (size < count && wait_for_stat(chip, TPM_STS_DATA_AVAIL | TPM_STS_VALID, -chip->timeout_c, _dev->read_queue, true) == 0) { burstcnt = get_burstcount(chip); if (burstcnt < 0) @@ -384,7 +383,7 @@ static int st33zp24_send(struct tpm_chip *chip, unsigned char *buf, if ((status & TPM_STS_COMMAND_READY) == 0) { st33zp24_cancel(chip); if (wait_for_stat - (chip, TPM_STS_COMMAND_READY, chip->timeout_b, + (chip, TPM_STS_COMMAND_READY,
Re: [PATCH 4/4] printk: Add a device attribute for the per-console loglevel
On Monday 03/04 at 17:06 +0900, Sergey Senozhatsky wrote: > On (03/01/19 16:48), Calvin Owens wrote: > > +static struct attribute *console_sysfs_attrs[] = { > > + _attr_loglevel.attr, > > + NULL, > > +}; > > +ATTRIBUTE_GROUPS(console_sysfs); > > + > > static struct bus_type console_subsys = { > > .name = "console", > > + .dev_groups = console_sysfs_groups, > > }; > > Do we really need to change this dynamically? Console options are > traditionally static (boot param or DT). Can we also be happy with > the static per-console loglevel? It really does need to be runtime configurable: there are a lot of usecases that enables, like turning the fast console up to KERN_DEBUG on a pile of machines you want to take a closer look at. The 'kernel.printk' global loglevel is also already changable at runtime, and since that setting interacts with this one it would be strange if only the former were able to be changed. I also want to add more attribute knobs related to extended consoles, so the plumbing to get things exposed in sysfs is worth it for me. Thanks, Calvin
[PATCH 2/4] printk: Add ability to set loglevel via "console=" cmdline
This extends the "console=" interface to allow setting the per-console loglevel by adding "/N" to the string, where N is the desired loglevel expressed as a base 10 integer. Invalid values are silently ignored. Signed-off-by: Calvin Owens --- .../admin-guide/kernel-parameters.txt | 6 ++-- kernel/printk/console_cmdline.h | 1 + kernel/printk/printk.c| 30 +++ 3 files changed, 28 insertions(+), 9 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 858b6c0b9a15..afada61dcbce 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -612,10 +612,10 @@ ttyS[,options] ttyUSB0[,options] Use the specified serial port. The options are of - the form "pnf", where "" is the baud rate, + the form "pnf/l", where "" is the baud rate, "p" is parity ("n", "o", or "e"), "n" is number of - bits, and "f" is flow control ("r" for RTS or - omit it). Default is "9600n8". + bits, "f" is flow control ("r" for RTS or omit it), + and "l" is the loglevel on [0,7]. Default is "9600n8". See Documentation/admin-guide/serial-console.rst for more information. See diff --git a/kernel/printk/console_cmdline.h b/kernel/printk/console_cmdline.h index 11f19c466af5..fbf9b539366e 100644 --- a/kernel/printk/console_cmdline.h +++ b/kernel/printk/console_cmdline.h @@ -6,6 +6,7 @@ struct console_cmdline { charname[16]; /* Name of the driver */ int index; /* Minor dev. to use*/ + int loglevel; /* Loglevel to use */ char*options; /* Options for the driver */ #ifdef CONFIG_A11Y_BRAILLE_CONSOLE char*brl_options; /* Options for braille driver */ diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 6ead14f8c2bc..2e0eb89f046c 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -2057,7 +2057,7 @@ asmlinkage __visible void early_printk(const char *fmt, ...) #endif static int __add_preferred_console(char *name, int idx, char *options, - char *brl_options) + int loglevel, char *brl_options) { struct console_cmdline *c; int i; @@ -2083,6 +2083,7 @@ static int __add_preferred_console(char *name, int idx, char *options, c->options = options; braille_set_options(c, brl_options); + c->loglevel = loglevel; c->index = idx; return 0; } @@ -2104,8 +2105,8 @@ __setup("console_msg_format=", console_msg_format_setup); static int __init console_setup(char *str) { char buf[sizeof(console_cmdline[0].name) + 4]; /* 4 for "ttyS" */ - char *s, *options, *brl_options = NULL; - int idx; + char *s, *options, *llevel, *brl_options = NULL; + int idx, loglevel = LOGLEVEL_EMERG; if (_braille_console_setup(, _options)) return 1; @@ -2123,6 +2124,14 @@ static int __init console_setup(char *str) options = strchr(str, ','); if (options) *(options++) = 0; + + llevel = strchr(str, '/'); + if (llevel) { + *(llevel++) = 0; + if (kstrtoint(llevel, 10, )) + loglevel = LOGLEVEL_EMERG; + } + #ifdef __sparc__ if (!strcmp(str, "ttya")) strcpy(buf, "ttyS0"); @@ -2135,7 +2144,7 @@ static int __init console_setup(char *str) idx = simple_strtoul(s, NULL, 10); *s = 0; - __add_preferred_console(buf, idx, options, brl_options); + __add_preferred_console(buf, idx, options, loglevel, brl_options); console_set_on_cmdline = 1; return 1; } @@ -2156,7 +2165,8 @@ __setup("console=", console_setup); */ int add_preferred_console(char *name, int idx, char *options) { - return __add_preferred_console(name, idx, options, NULL); + return __add_preferred_console(name, idx, options, LOGLEVEL_EMERG, + NULL); } bool console_suspend_enabled = true; @@ -2574,6 +2584,7 @@ void register_console(struct console *newcon) struct console *bcon = NULL; struct console_cmdline *c; static bool has_preferred; + bool cmdline_exists = false;
[PATCH 1/4] printk: Introduce per-console loglevel setting
Not all consoles are created equal: depending on the actual hardware, the latency of a printk() call can vary dramatically. The worst examples are serial consoles, where it can spin for tens of milliseconds banging the UART to emit a message, which can cause application-level problems when the kernel spews onto the console. At Facebook we use netconsole to monitor our fleet, but we still have serial consoles attached on each host for live debugging, and the latter has caused problems. An obvious solution is to disable the kernel console output to ttyS0, but this makes live debugging frustrating, since crashes become silent and opaque to the ttyS0 user. Enabling it on the fly when needed isn't feasible, since boxes you need to debug via serial are likely to be borked in ways that make this impossible. That puts us between a rock and a hard place: we'd love to set kernel.printk to KERN_INFO and get all the logs. But while netconsole is fast enough to permit that without perturbing userspace, ttyS0 is not, and we're forced to limit console logging to KERN_WARNING and higher. This patch introduces a new per-console loglevel setting, and changes console_unlock() to use max(global_level, per_console_level) when deciding whether or not to emit a given log message. This lets us have our cake and eat it too: instead of being forced to limit all consoles verbosity based on the speed of the slowest one, we can "promote" the faster console while still using a conservative system loglevel setting to avoid disturbing applications. Signed-off-by: Calvin Owens --- include/linux/console.h | 1 + kernel/printk/printk.c | 36 +++- 2 files changed, 20 insertions(+), 17 deletions(-) diff --git a/include/linux/console.h b/include/linux/console.h index ec9bdb3d7bab..3c27a4a29b8c 100644 --- a/include/linux/console.h +++ b/include/linux/console.h @@ -155,6 +155,7 @@ struct console { int cflag; void*data; struct console *next; + int level; }; /* diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index d3d170374ceb..6ead14f8c2bc 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1164,9 +1164,14 @@ module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(ignore_loglevel, "ignore loglevel setting (prints all kernel messages to the console)"); -static bool suppress_message_printing(int level) +static int effective_loglevel(struct console *con) { - return (level >= console_loglevel && !ignore_loglevel); + return max(console_loglevel, con ? con->level : LOGLEVEL_EMERG); +} + +static bool suppress_message_printing(int level, struct console *con) +{ + return (level >= effective_loglevel(con) && !ignore_loglevel); } #ifdef CONFIG_BOOT_PRINTK_DELAY @@ -1198,7 +1203,7 @@ static void boot_delay_msec(int level) unsigned long timeout; if ((boot_delay == 0 || system_state >= SYSTEM_RUNNING) - || suppress_message_printing(level)) { + || suppress_message_printing(level, NULL)) { return; } @@ -1712,7 +1717,7 @@ static int console_trylock_spinning(void) * The console_lock must be held. */ static void call_console_drivers(const char *ext_text, size_t ext_len, -const char *text, size_t len) +const char *text, size_t len, int level) { struct console *con; @@ -1731,6 +1736,8 @@ static void call_console_drivers(const char *ext_text, size_t ext_len, if (!cpu_online(smp_processor_id()) && !(con->flags & CON_ANYTIME)) continue; + if (suppress_message_printing(level, con)) + continue; if (con->flags & CON_EXTENDED) con->write(con, ext_text, ext_len); else @@ -2022,7 +2029,7 @@ static ssize_t msg_print_ext_body(char *buf, size_t size, static void console_lock_spinning_enable(void) { } static int console_lock_spinning_disable_and_check(void) { return 0; } static void call_console_drivers(const char *ext_text, size_t ext_len, -const char *text, size_t len) {} +const char *text, size_t len, int level) {} static size_t msg_print_text(const struct printk_log *msg, bool syslog, bool time, char *buf, size_t size) { return 0; } static bool suppress_message_printing(int level) { return false; } @@ -2358,21 +2365,11 @@ void console_unlock(void) } else { len = 0; } -skip: + if (console_seq == log_next_seq) break; msg = log_from_idx(console_idx); - if (suppress_message_printing(msg->level)) { -
[PATCH 4/4] printk: Add a device attribute for the per-console loglevel
Signed-off-by: Calvin Owens --- kernel/printk/printk.c | 40 1 file changed, 40 insertions(+) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 67e1e993ab80..e7e602fa2d0b 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -2560,8 +2560,48 @@ static int __init keep_bootcon_setup(char *str) early_param("keep_bootcon", keep_bootcon_setup); +static ssize_t loglevel_show(struct device *dev, struct device_attribute *attr, +char *buf) +{ + struct console *con = container_of(dev, struct console, dev); + return sprintf(buf, "%d\n", con->level); +} + +static ssize_t loglevel_store(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) +{ + struct console *con = container_of(dev, struct console, dev); + ssize_t ret; + int tmp; + + ret = kstrtoint(buf, 10, ); + if (ret < 0) + return ret; + + if (tmp < LOGLEVEL_EMERG) + return -ERANGE; + + /* +* Mimic the behavior of /dev/kmsg with respect to minimum_loglevel. +*/ + if (tmp < minimum_console_loglevel) + tmp = minimum_console_loglevel; + + con->level = tmp; + return ret; +} + +static DEVICE_ATTR_RW(loglevel); + +static struct attribute *console_sysfs_attrs[] = { + _attr_loglevel.attr, + NULL, +}; +ATTRIBUTE_GROUPS(console_sysfs); + static struct bus_type console_subsys = { .name = "console", + .dev_groups = console_sysfs_groups, }; static void console_release(struct device *dev) -- 2.17.1
[RFC][PATCH 0/4] Per-console loglevel support, console device bus
Hello all, This is an extremely overdue refresh of this series: https://lkml.org/lkml/2017/9/28/770 The big change here is the 3rd patch, which actually wires up the console drivers to support embedding a device structure, so we can place them on a "console" bus and expose attributes in sysfs. I left the very long list of driver maintainers off this first submission, once there's agreement on the core idea here I'll add them. Thanks, Calvin Calvin Owens (4): printk: Introduce per-console loglevel setting printk: Add ability to set loglevel via "console=" cmdline printk: Add consoles to a virtual "console" bus printk: Add a device attribute for the per-console loglevel 131 files changed, 1859 insertions(+), 1061 deletions(-) -- 2.17.1
[PATCH] bnxt_en: Fix sources of spurious netpoll warnings
After applying 2270bc5da3497945 ("bnxt_en: Fix netpoll handling") and 903649e718f80da2 ("bnxt_en: Improve -ENOMEM logic in NAPI poll loop."), we still see the following WARN fire: [ cut here ] WARNING: CPU: 0 PID: 1875170 at net/core/netpoll.c:165 netpoll_poll_dev+0x15a/0x160 bnxt_poll+0x0/0xd0 exceeded budget in poll Call Trace: [] dump_stack+0x4d/0x70 [] __warn+0xd3/0xf0 [] warn_slowpath_fmt+0x4f/0x60 [] netpoll_poll_dev+0x15a/0x160 [] netpoll_send_skb_on_dev+0x168/0x250 [] netpoll_send_udp+0x2dc/0x440 [] write_ext_msg+0x20e/0x250 [] call_console_drivers.constprop.23+0xa5/0x110 [] console_unlock+0x339/0x5b0 [] vprintk_emit+0x2c8/0x450 [] vprintk_default+0x1f/0x30 [] printk+0x48/0x50 [] edac_raw_mc_handle_error+0x563/0x5c0 [edac_core] [] edac_mc_handle_error+0x42b/0x6e0 [edac_core] [] sbridge_mce_output_error+0x410/0x10d0 [sb_edac] [] sbridge_check_error+0xac/0x130 [sb_edac] [] edac_mc_workq_function+0x3c/0x90 [edac_core] [] process_one_work+0x19b/0x480 [] worker_thread+0x6a/0x520 [] kthread+0xe4/0x100 [] ret_from_fork+0x22/0x40 This happens because we increment rx_pkts on -ENOMEM and -EIO, resulting in rx_pkts > 0. Fix this by only bumping rx_pkts if we were actually given a non-zero budget. Signed-off-by: Calvin Owens <calvinow...@fb.com> --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index c5c38d4..f38160f 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -1883,7 +1883,7 @@ static int bnxt_poll_work(struct bnxt *bp, struct bnxt_napi *bnapi, int budget) * here forever if we consistently cannot allocate * buffers. */ - else if (rc == -ENOMEM) + else if (rc == -ENOMEM && budget) rx_pkts++; else if (rc == -EBUSY) /* partial completion */ break; @@ -1969,7 +1969,7 @@ static int bnxt_poll_nitroa0(struct napi_struct *napi, int budget) cpu_to_le32(RX_CMPL_ERRORS_CRC_ERROR); rc = bnxt_rx_pkt(bp, bnapi, _cons, ); - if (likely(rc == -EIO)) + if (likely(rc == -EIO) && budget) rx_pkts++; else if (rc == -EBUSY) /* partial completion */ break; -- 2.9.5
[PATCH] bnxt_en: Fix sources of spurious netpoll warnings
After applying 2270bc5da3497945 ("bnxt_en: Fix netpoll handling") and 903649e718f80da2 ("bnxt_en: Improve -ENOMEM logic in NAPI poll loop."), we still see the following WARN fire: [ cut here ] WARNING: CPU: 0 PID: 1875170 at net/core/netpoll.c:165 netpoll_poll_dev+0x15a/0x160 bnxt_poll+0x0/0xd0 exceeded budget in poll Call Trace: [] dump_stack+0x4d/0x70 [] __warn+0xd3/0xf0 [] warn_slowpath_fmt+0x4f/0x60 [] netpoll_poll_dev+0x15a/0x160 [] netpoll_send_skb_on_dev+0x168/0x250 [] netpoll_send_udp+0x2dc/0x440 [] write_ext_msg+0x20e/0x250 [] call_console_drivers.constprop.23+0xa5/0x110 [] console_unlock+0x339/0x5b0 [] vprintk_emit+0x2c8/0x450 [] vprintk_default+0x1f/0x30 [] printk+0x48/0x50 [] edac_raw_mc_handle_error+0x563/0x5c0 [edac_core] [] edac_mc_handle_error+0x42b/0x6e0 [edac_core] [] sbridge_mce_output_error+0x410/0x10d0 [sb_edac] [] sbridge_check_error+0xac/0x130 [sb_edac] [] edac_mc_workq_function+0x3c/0x90 [edac_core] [] process_one_work+0x19b/0x480 [] worker_thread+0x6a/0x520 [] kthread+0xe4/0x100 [] ret_from_fork+0x22/0x40 This happens because we increment rx_pkts on -ENOMEM and -EIO, resulting in rx_pkts > 0. Fix this by only bumping rx_pkts if we were actually given a non-zero budget. Signed-off-by: Calvin Owens --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index c5c38d4..f38160f 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -1883,7 +1883,7 @@ static int bnxt_poll_work(struct bnxt *bp, struct bnxt_napi *bnapi, int budget) * here forever if we consistently cannot allocate * buffers. */ - else if (rc == -ENOMEM) + else if (rc == -ENOMEM && budget) rx_pkts++; else if (rc == -EBUSY) /* partial completion */ break; @@ -1969,7 +1969,7 @@ static int bnxt_poll_nitroa0(struct napi_struct *napi, int budget) cpu_to_le32(RX_CMPL_ERRORS_CRC_ERROR); rc = bnxt_rx_pkt(bp, bnapi, _cons, ); - if (likely(rc == -EIO)) + if (likely(rc == -EIO) && budget) rx_pkts++; else if (rc == -EBUSY) /* partial completion */ break; -- 2.9.5
Re: [PATCH 2/3] printk: Add /sys/consoles/ interface
On 11/03/2017 07:32 AM, Kroah-Hartman wrote: On Fri, Nov 03, 2017 at 03:21:14PM +0100, Petr Mladek wrote: On Thu 2017-09-28 17:43:56, Calvin Owens wrote: This adds a new sysfs interface that contains a directory for each console registered on the system. Each directory contains a single "loglevel" file for reading and setting the per-console loglevel. We can let kobject destruction race with console removal: if it does, loglevel_{show,store}() will safely fail with -ENODEV. This is a little weird, but avoids embedding the kobject and therefore needing to totally refactor the way we handle console struct lifetime. It looks like a sane approach. It might be worth a comment in the code. Documentation/ABI/testing/sysfs-consoles | 13 + include/linux/console.h | 1 + kernel/printk/printk.c | 88 3 files changed, 102 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-consoles diff --git a/Documentation/ABI/testing/sysfs-consoles b/Documentation/ABI/testing/sysfs-consoles new file mode 100644 index 000..6a1593e --- /dev/null +++ b/Documentation/ABI/testing/sysfs-consoles @@ -0,0 +1,13 @@ +What: /sys/consoles/ Eeek, what! I rather add Greg in CC. I am not 100% sure that the top level directory is the right thing to do. Neither do I. Sure. This is a placeholder I choose arbitrarily pending some real input on the location, sorry I didn't make that clear. Alternative might be to hide this under /sys/kernel/consoles/. No no no. +Date: September 2017 +KernelVersion: 4.15 +Contact: Calvin Owens <calvinow...@fb.com> +Description: The /sys/consoles tree contains a directory for each console + configured on the system. These directories contain the + following attributes: + + * "loglevel" Set the per-console loglevel: the kernel uses + max(system_loglevel, perconsole_loglevel) when + deciding whether to emit a given message. The + default is 0, which means max() always yields + the system setting in the kernel.printk sysctl. I would call the attribute "min_loglevel". The name "loglevel" should be reserved for the really used loglevel that depends also on the global loglevel value. diff --git a/include/linux/console.h b/include/linux/console.h index a5b5d79..76840be 100644 --- a/include/linux/console.h +++ b/include/linux/console.h @@ -148,6 +148,7 @@ struct console { void*data; struct console *next; int level; + struct kobject *kobj; Why are you using "raw" kobjects and not a "real" struct device? This is a device, use that interface instead please. If you need a console 'bus' to place them on, fine, but the virtual bus is probably best and simpler to use. The problem is that the console corresponds to no actual device (this is what Petr was getting at in the other mail). A console *may* be associated with a real TTY device, but this isn't universally true (for example, see netconsole_ext). Embedding a device struct in the console structure is problematic for the same reason embedding a raw kobject is: we'd need to rewrite all the code to deal with the new refcount/release semantics. While that's certainly possible, it ends up being a much bigger thorny change. If we deal with the "get()/deregister()" race in a safe way, it becomes very simple. (If it were as trivial as replacing kfrees with puts and adding release callbacks, that'd be the obvious way to go, but of course it doesn't end up being that nice...) That is if you _really_ feel you need sysfs interaction with the console layer (hint, I am not yet convinced...) How would you expose this setting if not via sysfs? All I care about is having the setting, how exactly userspace pokes it is not at all important :) }; /* diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 3f1675e..488bda3 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -105,6 +105,8 @@ enum devkmsg_log_masks { static unsigned int __read_mostly devkmsg_log = DEVKMSG_LOG_MASK_DEFAULT; +static struct kobject *consoles_dir_kobj; static int __control_devkmsg(char *str) { if (!str) @@ -2371,6 +2373,82 @@ static int __init keep_bootcon_setup(char *str) early_param("keep_bootcon", keep_bootcon_setup); +static ssize_t loglevel_show(struct kobject *kobj, struct kobj_attribute *attr, +char *buf) +{ + struct console *con; + ssize_t ret = -ENODEV; + This might deserve a comment. Something like: /* * Find the related struct console a safe way. The kobject * desctruction is asynchronous. */ + console_
Re: [PATCH 2/3] printk: Add /sys/consoles/ interface
On 11/03/2017 07:32 AM, Kroah-Hartman wrote: On Fri, Nov 03, 2017 at 03:21:14PM +0100, Petr Mladek wrote: On Thu 2017-09-28 17:43:56, Calvin Owens wrote: This adds a new sysfs interface that contains a directory for each console registered on the system. Each directory contains a single "loglevel" file for reading and setting the per-console loglevel. We can let kobject destruction race with console removal: if it does, loglevel_{show,store}() will safely fail with -ENODEV. This is a little weird, but avoids embedding the kobject and therefore needing to totally refactor the way we handle console struct lifetime. It looks like a sane approach. It might be worth a comment in the code. Documentation/ABI/testing/sysfs-consoles | 13 + include/linux/console.h | 1 + kernel/printk/printk.c | 88 3 files changed, 102 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-consoles diff --git a/Documentation/ABI/testing/sysfs-consoles b/Documentation/ABI/testing/sysfs-consoles new file mode 100644 index 000..6a1593e --- /dev/null +++ b/Documentation/ABI/testing/sysfs-consoles @@ -0,0 +1,13 @@ +What: /sys/consoles/ Eeek, what! I rather add Greg in CC. I am not 100% sure that the top level directory is the right thing to do. Neither do I. Sure. This is a placeholder I choose arbitrarily pending some real input on the location, sorry I didn't make that clear. Alternative might be to hide this under /sys/kernel/consoles/. No no no. +Date: September 2017 +KernelVersion: 4.15 +Contact: Calvin Owens +Description: The /sys/consoles tree contains a directory for each console + configured on the system. These directories contain the + following attributes: + + * "loglevel" Set the per-console loglevel: the kernel uses + max(system_loglevel, perconsole_loglevel) when + deciding whether to emit a given message. The + default is 0, which means max() always yields + the system setting in the kernel.printk sysctl. I would call the attribute "min_loglevel". The name "loglevel" should be reserved for the really used loglevel that depends also on the global loglevel value. diff --git a/include/linux/console.h b/include/linux/console.h index a5b5d79..76840be 100644 --- a/include/linux/console.h +++ b/include/linux/console.h @@ -148,6 +148,7 @@ struct console { void*data; struct console *next; int level; + struct kobject *kobj; Why are you using "raw" kobjects and not a "real" struct device? This is a device, use that interface instead please. If you need a console 'bus' to place them on, fine, but the virtual bus is probably best and simpler to use. The problem is that the console corresponds to no actual device (this is what Petr was getting at in the other mail). A console *may* be associated with a real TTY device, but this isn't universally true (for example, see netconsole_ext). Embedding a device struct in the console structure is problematic for the same reason embedding a raw kobject is: we'd need to rewrite all the code to deal with the new refcount/release semantics. While that's certainly possible, it ends up being a much bigger thorny change. If we deal with the "get()/deregister()" race in a safe way, it becomes very simple. (If it were as trivial as replacing kfrees with puts and adding release callbacks, that'd be the obvious way to go, but of course it doesn't end up being that nice...) That is if you _really_ feel you need sysfs interaction with the console layer (hint, I am not yet convinced...) How would you expose this setting if not via sysfs? All I care about is having the setting, how exactly userspace pokes it is not at all important :) }; /* diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 3f1675e..488bda3 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -105,6 +105,8 @@ enum devkmsg_log_masks { static unsigned int __read_mostly devkmsg_log = DEVKMSG_LOG_MASK_DEFAULT; +static struct kobject *consoles_dir_kobj; static int __control_devkmsg(char *str) { if (!str) @@ -2371,6 +2373,82 @@ static int __init keep_bootcon_setup(char *str) early_param("keep_bootcon", keep_bootcon_setup); +static ssize_t loglevel_show(struct kobject *kobj, struct kobj_attribute *attr, +char *buf) +{ + struct console *con; + ssize_t ret = -ENODEV; + This might deserve a comment. Something like: /* * Find the related struct console a safe way. The kobject * desctruction is asynchronous. */ + console_lock(); + for_each_c
Re: [PATCH 1/3] printk: Introduce per-console loglevel setting
On 10/20/2017 01:05 AM, Petr Mladek wrote: On Thu 2017-10-19 16:40:45, Calvin Owens wrote: On 09/28/2017 05:43 PM, Calvin Owens wrote: Not all consoles are created equal: depending on the actual hardware, the latency of a printk() call can vary dramatically. The worst examples are serial consoles, where it can spin for tens of milliseconds banging the UART to emit a message, which can cause application-level problems when the kernel spews onto the console. Any thoughts on this series? Happy to resend again, but if there are no objections I'd love to see it merged sooner rather than later :) Happy to resend too, just let me know. There is no need to resend the patch. It is on my radar and I am going to look at it. Please, be patient, you hit conference, illness, after vacation season. We do not want to unnecessarily delay it but it is not a trivial change that might be accepted within minutes. No worries, just wanted to make sure it hadn't been missed :) Thanks, Calvin Best Regards, Petr
Re: [PATCH 1/3] printk: Introduce per-console loglevel setting
On 10/20/2017 01:05 AM, Petr Mladek wrote: On Thu 2017-10-19 16:40:45, Calvin Owens wrote: On 09/28/2017 05:43 PM, Calvin Owens wrote: Not all consoles are created equal: depending on the actual hardware, the latency of a printk() call can vary dramatically. The worst examples are serial consoles, where it can spin for tens of milliseconds banging the UART to emit a message, which can cause application-level problems when the kernel spews onto the console. Any thoughts on this series? Happy to resend again, but if there are no objections I'd love to see it merged sooner rather than later :) Happy to resend too, just let me know. There is no need to resend the patch. It is on my radar and I am going to look at it. Please, be patient, you hit conference, illness, after vacation season. We do not want to unnecessarily delay it but it is not a trivial change that might be accepted within minutes. No worries, just wanted to make sure it hadn't been missed :) Thanks, Calvin Best Regards, Petr
Re: [PATCH 1/3] printk: Introduce per-console loglevel setting
On 09/28/2017 05:43 PM, Calvin Owens wrote: Not all consoles are created equal: depending on the actual hardware, the latency of a printk() call can vary dramatically. The worst examples are serial consoles, where it can spin for tens of milliseconds banging the UART to emit a message, which can cause application-level problems when the kernel spews onto the console. Any thoughts on this series? Happy to resend again, but if there are no objections I'd love to see it merged sooner rather than later :) Happy to resend too, just let me know. Thanks, Calvin At Facebook we use netconsole to monitor our fleet, but we still have serial consoles attached on each host for live debugging, and the latter has caused problems. An obvious solution is to disable the kernel console output to ttyS0, but this makes live debugging frustrating, since crashes become silent and opaque to the ttyS0 user. Enabling it on the fly when needed isn't feasible, since boxes you need to debug via serial are likely to be borked in ways that make this impossible. That puts us between a rock and a hard place: we'd love to set kernel.printk to KERN_INFO and get all the logs. But while netconsole is fast enough to permit that without perturbing userspace, ttyS0 is not, and we're forced to limit console logging to KERN_WARNING and higher. This patch introduces a new per-console loglevel setting, and changes console_unlock() to use max(global_level, per_console_level) when deciding whether or not to emit a given log message. This lets us have our cake and eat it too: instead of being forced to limit all consoles verbosity based on the speed of the slowest one, we can "promote" the faster console while still using a conservative system loglevel setting to avoid disturbing applications. Cc: Petr Mladek <pmla...@suse.com> Cc: Steven Rostedt <rost...@goodmis.org> Cc: Sergey Senozhatsky <sergey.senozhat...@gmail.com> Signed-off-by: Calvin Owens <calvinow...@fb.com> --- (V1: https://lkml.org/lkml/2017/4/4/783) Changes in V2: * Honor the ignore_loglevel setting in all cases * Change semantics to use max(global, console) as the loglevel for a console, instead of the previous patch where we treated the per-console one as a filter downstream of the global one. include/linux/console.h | 1 + kernel/printk/printk.c | 38 +++--- 2 files changed, 20 insertions(+), 19 deletions(-) diff --git a/include/linux/console.h b/include/linux/console.h index b8920a0..a5b5d79 100644 --- a/include/linux/console.h +++ b/include/linux/console.h @@ -147,6 +147,7 @@ struct console { int cflag; void*data; struct console *next; + int level; }; /* diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 512f7c2..3f1675e 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1141,9 +1141,14 @@ module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(ignore_loglevel, "ignore loglevel setting (prints all kernel messages to the console)"); -static bool suppress_message_printing(int level) +static int effective_loglevel(struct console *con) { - return (level >= console_loglevel && !ignore_loglevel); + return max(console_loglevel, con ? con->level : LOGLEVEL_EMERG); +} + +static bool suppress_message_printing(int level, struct console *con) +{ + return (level >= effective_loglevel(con) && !ignore_loglevel); } #ifdef CONFIG_BOOT_PRINTK_DELAY @@ -1175,7 +1180,7 @@ static void boot_delay_msec(int level) unsigned long timeout; if ((boot_delay == 0 || system_state >= SYSTEM_RUNNING) - || suppress_message_printing(level)) { + || suppress_message_printing(level, NULL)) { return; } @@ -1549,7 +1554,7 @@ SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, int, len) * The console_lock must be held. */ static void call_console_drivers(const char *ext_text, size_t ext_len, -const char *text, size_t len) +const char *text, size_t len, int level) { struct console *con; @@ -1568,6 +1573,8 @@ static void call_console_drivers(const char *ext_text, size_t ext_len, if (!cpu_online(smp_processor_id()) && !(con->flags & CON_ANYTIME)) continue; + if (suppress_message_printing(level, con)) + continue; if (con->flags & CON_EXTENDED) con->write(con, ext_text, ext_len); else @@ -1856,10 +1863,9 @@ static ssize_t msg_print_ext_body(char *buf, size_t size, char *dict, size_t dict_len, char *text, size_
Re: [PATCH 1/3] printk: Introduce per-console loglevel setting
On 09/28/2017 05:43 PM, Calvin Owens wrote: Not all consoles are created equal: depending on the actual hardware, the latency of a printk() call can vary dramatically. The worst examples are serial consoles, where it can spin for tens of milliseconds banging the UART to emit a message, which can cause application-level problems when the kernel spews onto the console. Any thoughts on this series? Happy to resend again, but if there are no objections I'd love to see it merged sooner rather than later :) Happy to resend too, just let me know. Thanks, Calvin At Facebook we use netconsole to monitor our fleet, but we still have serial consoles attached on each host for live debugging, and the latter has caused problems. An obvious solution is to disable the kernel console output to ttyS0, but this makes live debugging frustrating, since crashes become silent and opaque to the ttyS0 user. Enabling it on the fly when needed isn't feasible, since boxes you need to debug via serial are likely to be borked in ways that make this impossible. That puts us between a rock and a hard place: we'd love to set kernel.printk to KERN_INFO and get all the logs. But while netconsole is fast enough to permit that without perturbing userspace, ttyS0 is not, and we're forced to limit console logging to KERN_WARNING and higher. This patch introduces a new per-console loglevel setting, and changes console_unlock() to use max(global_level, per_console_level) when deciding whether or not to emit a given log message. This lets us have our cake and eat it too: instead of being forced to limit all consoles verbosity based on the speed of the slowest one, we can "promote" the faster console while still using a conservative system loglevel setting to avoid disturbing applications. Cc: Petr Mladek Cc: Steven Rostedt Cc: Sergey Senozhatsky Signed-off-by: Calvin Owens --- (V1: https://lkml.org/lkml/2017/4/4/783) Changes in V2: * Honor the ignore_loglevel setting in all cases * Change semantics to use max(global, console) as the loglevel for a console, instead of the previous patch where we treated the per-console one as a filter downstream of the global one. include/linux/console.h | 1 + kernel/printk/printk.c | 38 +++--- 2 files changed, 20 insertions(+), 19 deletions(-) diff --git a/include/linux/console.h b/include/linux/console.h index b8920a0..a5b5d79 100644 --- a/include/linux/console.h +++ b/include/linux/console.h @@ -147,6 +147,7 @@ struct console { int cflag; void*data; struct console *next; + int level; }; /* diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 512f7c2..3f1675e 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1141,9 +1141,14 @@ module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(ignore_loglevel, "ignore loglevel setting (prints all kernel messages to the console)"); -static bool suppress_message_printing(int level) +static int effective_loglevel(struct console *con) { - return (level >= console_loglevel && !ignore_loglevel); + return max(console_loglevel, con ? con->level : LOGLEVEL_EMERG); +} + +static bool suppress_message_printing(int level, struct console *con) +{ + return (level >= effective_loglevel(con) && !ignore_loglevel); } #ifdef CONFIG_BOOT_PRINTK_DELAY @@ -1175,7 +1180,7 @@ static void boot_delay_msec(int level) unsigned long timeout; if ((boot_delay == 0 || system_state >= SYSTEM_RUNNING) - || suppress_message_printing(level)) { + || suppress_message_printing(level, NULL)) { return; } @@ -1549,7 +1554,7 @@ SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, int, len) * The console_lock must be held. */ static void call_console_drivers(const char *ext_text, size_t ext_len, -const char *text, size_t len) +const char *text, size_t len, int level) { struct console *con; @@ -1568,6 +1573,8 @@ static void call_console_drivers(const char *ext_text, size_t ext_len, if (!cpu_online(smp_processor_id()) && !(con->flags & CON_ANYTIME)) continue; + if (suppress_message_printing(level, con)) + continue; if (con->flags & CON_EXTENDED) con->write(con, ext_text, ext_len); else @@ -1856,10 +1863,9 @@ static ssize_t msg_print_ext_body(char *buf, size_t size, char *dict, size_t dict_len, char *text, size_t text_len) { return 0; } static void call_console_drivers(const char *ext_text, size
[PATCH 1/3] printk: Introduce per-console loglevel setting
Not all consoles are created equal: depending on the actual hardware, the latency of a printk() call can vary dramatically. The worst examples are serial consoles, where it can spin for tens of milliseconds banging the UART to emit a message, which can cause application-level problems when the kernel spews onto the console. At Facebook we use netconsole to monitor our fleet, but we still have serial consoles attached on each host for live debugging, and the latter has caused problems. An obvious solution is to disable the kernel console output to ttyS0, but this makes live debugging frustrating, since crashes become silent and opaque to the ttyS0 user. Enabling it on the fly when needed isn't feasible, since boxes you need to debug via serial are likely to be borked in ways that make this impossible. That puts us between a rock and a hard place: we'd love to set kernel.printk to KERN_INFO and get all the logs. But while netconsole is fast enough to permit that without perturbing userspace, ttyS0 is not, and we're forced to limit console logging to KERN_WARNING and higher. This patch introduces a new per-console loglevel setting, and changes console_unlock() to use max(global_level, per_console_level) when deciding whether or not to emit a given log message. This lets us have our cake and eat it too: instead of being forced to limit all consoles verbosity based on the speed of the slowest one, we can "promote" the faster console while still using a conservative system loglevel setting to avoid disturbing applications. Cc: Petr Mladek <pmla...@suse.com> Cc: Steven Rostedt <rost...@goodmis.org> Cc: Sergey Senozhatsky <sergey.senozhat...@gmail.com> Signed-off-by: Calvin Owens <calvinow...@fb.com> --- (V1: https://lkml.org/lkml/2017/4/4/783) Changes in V2: * Honor the ignore_loglevel setting in all cases * Change semantics to use max(global, console) as the loglevel for a console, instead of the previous patch where we treated the per-console one as a filter downstream of the global one. include/linux/console.h | 1 + kernel/printk/printk.c | 38 +++--- 2 files changed, 20 insertions(+), 19 deletions(-) diff --git a/include/linux/console.h b/include/linux/console.h index b8920a0..a5b5d79 100644 --- a/include/linux/console.h +++ b/include/linux/console.h @@ -147,6 +147,7 @@ struct console { int cflag; void*data; struct console *next; + int level; }; /* diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 512f7c2..3f1675e 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1141,9 +1141,14 @@ module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(ignore_loglevel, "ignore loglevel setting (prints all kernel messages to the console)"); -static bool suppress_message_printing(int level) +static int effective_loglevel(struct console *con) { - return (level >= console_loglevel && !ignore_loglevel); + return max(console_loglevel, con ? con->level : LOGLEVEL_EMERG); +} + +static bool suppress_message_printing(int level, struct console *con) +{ + return (level >= effective_loglevel(con) && !ignore_loglevel); } #ifdef CONFIG_BOOT_PRINTK_DELAY @@ -1175,7 +1180,7 @@ static void boot_delay_msec(int level) unsigned long timeout; if ((boot_delay == 0 || system_state >= SYSTEM_RUNNING) - || suppress_message_printing(level)) { + || suppress_message_printing(level, NULL)) { return; } @@ -1549,7 +1554,7 @@ SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, int, len) * The console_lock must be held. */ static void call_console_drivers(const char *ext_text, size_t ext_len, -const char *text, size_t len) +const char *text, size_t len, int level) { struct console *con; @@ -1568,6 +1573,8 @@ static void call_console_drivers(const char *ext_text, size_t ext_len, if (!cpu_online(smp_processor_id()) && !(con->flags & CON_ANYTIME)) continue; + if (suppress_message_printing(level, con)) + continue; if (con->flags & CON_EXTENDED) con->write(con, ext_text, ext_len); else @@ -1856,10 +1863,9 @@ static ssize_t msg_print_ext_body(char *buf, size_t size, char *dict, size_t dict_len, char *text, size_t text_len) { return 0; } static void call_console_drivers(const char *ext_text, size_t ext_len, -const char *text, size_t len) {} +const char *text, size_t len, int level) {} static size_t msg_print_te
[PATCH 1/3] printk: Introduce per-console loglevel setting
Not all consoles are created equal: depending on the actual hardware, the latency of a printk() call can vary dramatically. The worst examples are serial consoles, where it can spin for tens of milliseconds banging the UART to emit a message, which can cause application-level problems when the kernel spews onto the console. At Facebook we use netconsole to monitor our fleet, but we still have serial consoles attached on each host for live debugging, and the latter has caused problems. An obvious solution is to disable the kernel console output to ttyS0, but this makes live debugging frustrating, since crashes become silent and opaque to the ttyS0 user. Enabling it on the fly when needed isn't feasible, since boxes you need to debug via serial are likely to be borked in ways that make this impossible. That puts us between a rock and a hard place: we'd love to set kernel.printk to KERN_INFO and get all the logs. But while netconsole is fast enough to permit that without perturbing userspace, ttyS0 is not, and we're forced to limit console logging to KERN_WARNING and higher. This patch introduces a new per-console loglevel setting, and changes console_unlock() to use max(global_level, per_console_level) when deciding whether or not to emit a given log message. This lets us have our cake and eat it too: instead of being forced to limit all consoles verbosity based on the speed of the slowest one, we can "promote" the faster console while still using a conservative system loglevel setting to avoid disturbing applications. Cc: Petr Mladek Cc: Steven Rostedt Cc: Sergey Senozhatsky Signed-off-by: Calvin Owens --- (V1: https://lkml.org/lkml/2017/4/4/783) Changes in V2: * Honor the ignore_loglevel setting in all cases * Change semantics to use max(global, console) as the loglevel for a console, instead of the previous patch where we treated the per-console one as a filter downstream of the global one. include/linux/console.h | 1 + kernel/printk/printk.c | 38 +++--- 2 files changed, 20 insertions(+), 19 deletions(-) diff --git a/include/linux/console.h b/include/linux/console.h index b8920a0..a5b5d79 100644 --- a/include/linux/console.h +++ b/include/linux/console.h @@ -147,6 +147,7 @@ struct console { int cflag; void*data; struct console *next; + int level; }; /* diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 512f7c2..3f1675e 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1141,9 +1141,14 @@ module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(ignore_loglevel, "ignore loglevel setting (prints all kernel messages to the console)"); -static bool suppress_message_printing(int level) +static int effective_loglevel(struct console *con) { - return (level >= console_loglevel && !ignore_loglevel); + return max(console_loglevel, con ? con->level : LOGLEVEL_EMERG); +} + +static bool suppress_message_printing(int level, struct console *con) +{ + return (level >= effective_loglevel(con) && !ignore_loglevel); } #ifdef CONFIG_BOOT_PRINTK_DELAY @@ -1175,7 +1180,7 @@ static void boot_delay_msec(int level) unsigned long timeout; if ((boot_delay == 0 || system_state >= SYSTEM_RUNNING) - || suppress_message_printing(level)) { + || suppress_message_printing(level, NULL)) { return; } @@ -1549,7 +1554,7 @@ SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, int, len) * The console_lock must be held. */ static void call_console_drivers(const char *ext_text, size_t ext_len, -const char *text, size_t len) +const char *text, size_t len, int level) { struct console *con; @@ -1568,6 +1573,8 @@ static void call_console_drivers(const char *ext_text, size_t ext_len, if (!cpu_online(smp_processor_id()) && !(con->flags & CON_ANYTIME)) continue; + if (suppress_message_printing(level, con)) + continue; if (con->flags & CON_EXTENDED) con->write(con, ext_text, ext_len); else @@ -1856,10 +1863,9 @@ static ssize_t msg_print_ext_body(char *buf, size_t size, char *dict, size_t dict_len, char *text, size_t text_len) { return 0; } static void call_console_drivers(const char *ext_text, size_t ext_len, -const char *text, size_t len) {} +const char *text, size_t len, int level) {} static size_t msg_print_text(const struct printk_log *msg, bool syslog, char *buf, size_t size) { ret
[PATCH 3/3] printk: Add ability to set loglevel via "console=" cmdline
This extends the "console=" interface to allow setting the per-console loglevel by adding "/N" to the string, where N is the desired loglevel expressed as a base 10 integer. Invalid values are silently ignored. Cc: Petr Mladek <pmla...@suse.com> Cc: Steven Rostedt <rost...@goodmis.org> Cc: Sergey Senozhatsky <sergey.senozhat...@gmail.com> Signed-off-by: Calvin Owens <calvinow...@fb.com> --- Documentation/admin-guide/kernel-parameters.txt | 6 ++--- kernel/printk/console_cmdline.h | 1 + kernel/printk/printk.c | 30 - 3 files changed, 28 insertions(+), 9 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 0549662..f22b992 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -607,10 +607,10 @@ ttyS[,options] ttyUSB0[,options] Use the specified serial port. The options are of - the form "pnf", where "" is the baud rate, + the form "pnf/l", where "" is the baud rate, "p" is parity ("n", "o", or "e"), "n" is number of - bits, and "f" is flow control ("r" for RTS or - omit it). Default is "9600n8". + bits, "f" is flow control ("r" for RTS or omit it), + and "l" is the loglevel on [0,7]. Default is "9600n8". See Documentation/admin-guide/serial-console.rst for more information. See diff --git a/kernel/printk/console_cmdline.h b/kernel/printk/console_cmdline.h index 2ca4a8b..269e666 100644 --- a/kernel/printk/console_cmdline.h +++ b/kernel/printk/console_cmdline.h @@ -5,6 +5,7 @@ struct console_cmdline { charname[16]; /* Name of the driver */ int index; /* Minor dev. to use*/ + int loglevel; /* Loglevel to use */ char*options; /* Options for the driver */ #ifdef CONFIG_A11Y_BRAILLE_CONSOLE char*brl_options; /* Options for braille driver */ diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 488bda3..4c14cf2 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1892,7 +1892,7 @@ asmlinkage __visible void early_printk(const char *fmt, ...) #endif static int __add_preferred_console(char *name, int idx, char *options, - char *brl_options) + int loglevel, char *brl_options) { struct console_cmdline *c; int i; @@ -1918,6 +1918,7 @@ static int __add_preferred_console(char *name, int idx, char *options, c->options = options; braille_set_options(c, brl_options); + c->loglevel = loglevel; c->index = idx; return 0; } @@ -1928,8 +1929,8 @@ static int __add_preferred_console(char *name, int idx, char *options, static int __init console_setup(char *str) { char buf[sizeof(console_cmdline[0].name) + 4]; /* 4 for "ttyS" */ - char *s, *options, *brl_options = NULL; - int idx; + char *s, *options, *llevel, *brl_options = NULL; + int idx, loglevel = LOGLEVEL_EMERG; if (_braille_console_setup(, _options)) return 1; @@ -1947,6 +1948,14 @@ static int __init console_setup(char *str) options = strchr(str, ','); if (options) *(options++) = 0; + + llevel = strchr(str, '/'); + if (llevel) { + *(llevel++) = 0; + if (kstrtoint(llevel, 10, )) + loglevel = LOGLEVEL_EMERG; + } + #ifdef __sparc__ if (!strcmp(str, "ttya")) strcpy(buf, "ttyS0"); @@ -1959,7 +1968,7 @@ static int __init console_setup(char *str) idx = simple_strtoul(s, NULL, 10); *s = 0; - __add_preferred_console(buf, idx, options, brl_options); + __add_preferred_console(buf, idx, options, loglevel, brl_options); console_set_on_cmdline = 1; return 1; } @@ -1980,7 +1989,8 @@ __setup("console=", console_setup); */ int add_preferred_console(char *name, int idx, char *options) { - return __add_preferred_console(name, idx, options, NULL); + return __add_preferred_console(name, idx, options, LOGLEVEL_EMERG, + NULL); } bool console_suspend_enabled = true; @@ -2475,6 +2485,7 @@ void register_console(struct console *newcon) stru
[PATCH 3/3] printk: Add ability to set loglevel via "console=" cmdline
This extends the "console=" interface to allow setting the per-console loglevel by adding "/N" to the string, where N is the desired loglevel expressed as a base 10 integer. Invalid values are silently ignored. Cc: Petr Mladek Cc: Steven Rostedt Cc: Sergey Senozhatsky Signed-off-by: Calvin Owens --- Documentation/admin-guide/kernel-parameters.txt | 6 ++--- kernel/printk/console_cmdline.h | 1 + kernel/printk/printk.c | 30 - 3 files changed, 28 insertions(+), 9 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 0549662..f22b992 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -607,10 +607,10 @@ ttyS[,options] ttyUSB0[,options] Use the specified serial port. The options are of - the form "pnf", where "" is the baud rate, + the form "pnf/l", where "" is the baud rate, "p" is parity ("n", "o", or "e"), "n" is number of - bits, and "f" is flow control ("r" for RTS or - omit it). Default is "9600n8". + bits, "f" is flow control ("r" for RTS or omit it), + and "l" is the loglevel on [0,7]. Default is "9600n8". See Documentation/admin-guide/serial-console.rst for more information. See diff --git a/kernel/printk/console_cmdline.h b/kernel/printk/console_cmdline.h index 2ca4a8b..269e666 100644 --- a/kernel/printk/console_cmdline.h +++ b/kernel/printk/console_cmdline.h @@ -5,6 +5,7 @@ struct console_cmdline { charname[16]; /* Name of the driver */ int index; /* Minor dev. to use*/ + int loglevel; /* Loglevel to use */ char*options; /* Options for the driver */ #ifdef CONFIG_A11Y_BRAILLE_CONSOLE char*brl_options; /* Options for braille driver */ diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 488bda3..4c14cf2 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1892,7 +1892,7 @@ asmlinkage __visible void early_printk(const char *fmt, ...) #endif static int __add_preferred_console(char *name, int idx, char *options, - char *brl_options) + int loglevel, char *brl_options) { struct console_cmdline *c; int i; @@ -1918,6 +1918,7 @@ static int __add_preferred_console(char *name, int idx, char *options, c->options = options; braille_set_options(c, brl_options); + c->loglevel = loglevel; c->index = idx; return 0; } @@ -1928,8 +1929,8 @@ static int __add_preferred_console(char *name, int idx, char *options, static int __init console_setup(char *str) { char buf[sizeof(console_cmdline[0].name) + 4]; /* 4 for "ttyS" */ - char *s, *options, *brl_options = NULL; - int idx; + char *s, *options, *llevel, *brl_options = NULL; + int idx, loglevel = LOGLEVEL_EMERG; if (_braille_console_setup(, _options)) return 1; @@ -1947,6 +1948,14 @@ static int __init console_setup(char *str) options = strchr(str, ','); if (options) *(options++) = 0; + + llevel = strchr(str, '/'); + if (llevel) { + *(llevel++) = 0; + if (kstrtoint(llevel, 10, )) + loglevel = LOGLEVEL_EMERG; + } + #ifdef __sparc__ if (!strcmp(str, "ttya")) strcpy(buf, "ttyS0"); @@ -1959,7 +1968,7 @@ static int __init console_setup(char *str) idx = simple_strtoul(s, NULL, 10); *s = 0; - __add_preferred_console(buf, idx, options, brl_options); + __add_preferred_console(buf, idx, options, loglevel, brl_options); console_set_on_cmdline = 1; return 1; } @@ -1980,7 +1989,8 @@ __setup("console=", console_setup); */ int add_preferred_console(char *name, int idx, char *options) { - return __add_preferred_console(name, idx, options, NULL); + return __add_preferred_console(name, idx, options, LOGLEVEL_EMERG, + NULL); } bool console_suspend_enabled = true; @@ -2475,6 +2485,7 @@ void register_console(struct console *newcon) struct console *bcon = NULL; struct console_cmdline *c; static bool
[PATCH 2/3] printk: Add /sys/consoles/ interface
This adds a new sysfs interface that contains a directory for each console registered on the system. Each directory contains a single "loglevel" file for reading and setting the per-console loglevel. We can let kobject destruction race with console removal: if it does, loglevel_{show,store}() will safely fail with -ENODEV. This is a little weird, but avoids embedding the kobject and therefore needing to totally refactor the way we handle console struct lifetime. Cc: Petr Mladek <pmla...@suse.com> Cc: Steven Rostedt <rost...@goodmis.org> Cc: Sergey Senozhatsky <sergey.senozhat...@gmail.com> Signed-off-by: Calvin Owens <calvinow...@fb.com> --- (V1: https://lkml.org/lkml/2017/4/4/784) Changes in V2: * Honor minimum_console_loglevel when setting loglevels * Added entry in Documentation/ABI/testing Documentation/ABI/testing/sysfs-consoles | 13 + include/linux/console.h | 1 + kernel/printk/printk.c | 88 3 files changed, 102 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-consoles diff --git a/Documentation/ABI/testing/sysfs-consoles b/Documentation/ABI/testing/sysfs-consoles new file mode 100644 index 000..6a1593e --- /dev/null +++ b/Documentation/ABI/testing/sysfs-consoles @@ -0,0 +1,13 @@ +What: /sys/consoles/ +Date: September 2017 +KernelVersion: 4.15 +Contact: Calvin Owens <calvinow...@fb.com> +Description: The /sys/consoles tree contains a directory for each console + configured on the system. These directories contain the + following attributes: + + * "loglevel"Set the per-console loglevel: the kernel uses + max(system_loglevel, perconsole_loglevel) when + deciding whether to emit a given message. The + default is 0, which means max() always yields + the system setting in the kernel.printk sysctl. diff --git a/include/linux/console.h b/include/linux/console.h index a5b5d79..76840be 100644 --- a/include/linux/console.h +++ b/include/linux/console.h @@ -148,6 +148,7 @@ struct console { void*data; struct console *next; int level; + struct kobject *kobj; }; /* diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 3f1675e..488bda3 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -105,6 +105,8 @@ enum devkmsg_log_masks { static unsigned int __read_mostly devkmsg_log = DEVKMSG_LOG_MASK_DEFAULT; +static struct kobject *consoles_dir_kobj; + static int __control_devkmsg(char *str) { if (!str) @@ -2371,6 +2373,82 @@ static int __init keep_bootcon_setup(char *str) early_param("keep_bootcon", keep_bootcon_setup); +static ssize_t loglevel_show(struct kobject *kobj, struct kobj_attribute *attr, +char *buf) +{ + struct console *con; + ssize_t ret = -ENODEV; + + console_lock(); + for_each_console(con) { + if (con->kobj == kobj) { + ret = sprintf(buf, "%d\n", con->level); + break; + } + } + console_unlock(); + + return ret; +} + +static ssize_t loglevel_store(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t count) +{ + struct console *con; + ssize_t ret; + int tmp; + + ret = kstrtoint(buf, 10, ); + if (ret < 0) + return ret; + + if (tmp < LOGLEVEL_EMERG) + return -ERANGE; + + /* +* Mimic the behavior of /dev/kmsg with respect to minimum_loglevel +*/ + if (tmp < minimum_console_loglevel) + tmp = minimum_console_loglevel; + + ret = -ENODEV; + console_lock(); + for_each_console(con) { + if (con->kobj == kobj) { + con->level = tmp; + ret = count; + break; + } + } + console_unlock(); + + return ret; +} + +static const struct kobj_attribute console_loglevel_attr = + __ATTR(loglevel, 0644, loglevel_show, loglevel_store); + +static void console_register_sysfs(struct console *newcon) +{ + /* +* We might be called very early from register_console(): in that case, +* printk_late_init() will take care of this later. +*/ + if (!consoles_dir_kobj) + return; + + newcon->kobj = kobject_create_and_add(newcon->name, consoles_dir_kobj); + if (WARN_ON(!newcon->kobj)) + return; + + WARN_ON(sysfs_create_file(newcon->kobj, _loglevel_attr.attr)); +} + +static void console_unregister_sysfs(struct console *oldcon) +{ + kob
[PATCH 2/3] printk: Add /sys/consoles/ interface
This adds a new sysfs interface that contains a directory for each console registered on the system. Each directory contains a single "loglevel" file for reading and setting the per-console loglevel. We can let kobject destruction race with console removal: if it does, loglevel_{show,store}() will safely fail with -ENODEV. This is a little weird, but avoids embedding the kobject and therefore needing to totally refactor the way we handle console struct lifetime. Cc: Petr Mladek Cc: Steven Rostedt Cc: Sergey Senozhatsky Signed-off-by: Calvin Owens --- (V1: https://lkml.org/lkml/2017/4/4/784) Changes in V2: * Honor minimum_console_loglevel when setting loglevels * Added entry in Documentation/ABI/testing Documentation/ABI/testing/sysfs-consoles | 13 + include/linux/console.h | 1 + kernel/printk/printk.c | 88 3 files changed, 102 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-consoles diff --git a/Documentation/ABI/testing/sysfs-consoles b/Documentation/ABI/testing/sysfs-consoles new file mode 100644 index 000..6a1593e --- /dev/null +++ b/Documentation/ABI/testing/sysfs-consoles @@ -0,0 +1,13 @@ +What: /sys/consoles/ +Date: September 2017 +KernelVersion: 4.15 +Contact: Calvin Owens +Description: The /sys/consoles tree contains a directory for each console + configured on the system. These directories contain the + following attributes: + + * "loglevel"Set the per-console loglevel: the kernel uses + max(system_loglevel, perconsole_loglevel) when + deciding whether to emit a given message. The + default is 0, which means max() always yields + the system setting in the kernel.printk sysctl. diff --git a/include/linux/console.h b/include/linux/console.h index a5b5d79..76840be 100644 --- a/include/linux/console.h +++ b/include/linux/console.h @@ -148,6 +148,7 @@ struct console { void*data; struct console *next; int level; + struct kobject *kobj; }; /* diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 3f1675e..488bda3 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -105,6 +105,8 @@ enum devkmsg_log_masks { static unsigned int __read_mostly devkmsg_log = DEVKMSG_LOG_MASK_DEFAULT; +static struct kobject *consoles_dir_kobj; + static int __control_devkmsg(char *str) { if (!str) @@ -2371,6 +2373,82 @@ static int __init keep_bootcon_setup(char *str) early_param("keep_bootcon", keep_bootcon_setup); +static ssize_t loglevel_show(struct kobject *kobj, struct kobj_attribute *attr, +char *buf) +{ + struct console *con; + ssize_t ret = -ENODEV; + + console_lock(); + for_each_console(con) { + if (con->kobj == kobj) { + ret = sprintf(buf, "%d\n", con->level); + break; + } + } + console_unlock(); + + return ret; +} + +static ssize_t loglevel_store(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t count) +{ + struct console *con; + ssize_t ret; + int tmp; + + ret = kstrtoint(buf, 10, ); + if (ret < 0) + return ret; + + if (tmp < LOGLEVEL_EMERG) + return -ERANGE; + + /* +* Mimic the behavior of /dev/kmsg with respect to minimum_loglevel +*/ + if (tmp < minimum_console_loglevel) + tmp = minimum_console_loglevel; + + ret = -ENODEV; + console_lock(); + for_each_console(con) { + if (con->kobj == kobj) { + con->level = tmp; + ret = count; + break; + } + } + console_unlock(); + + return ret; +} + +static const struct kobj_attribute console_loglevel_attr = + __ATTR(loglevel, 0644, loglevel_show, loglevel_store); + +static void console_register_sysfs(struct console *newcon) +{ + /* +* We might be called very early from register_console(): in that case, +* printk_late_init() will take care of this later. +*/ + if (!consoles_dir_kobj) + return; + + newcon->kobj = kobject_create_and_add(newcon->name, consoles_dir_kobj); + if (WARN_ON(!newcon->kobj)) + return; + + WARN_ON(sysfs_create_file(newcon->kobj, _loglevel_attr.attr)); +} + +static void console_unregister_sysfs(struct console *oldcon) +{ + kobject_put(oldcon->kobj); +} + /* * The console driver calls this routine during kernel initialization * to register the console printin
Re: [RFC][PATCH 1/2] printk: Introduce per-console filtering of messages by loglevel
On Thursday 04/06 at 16:02 +0200, Petr Mladek wrote: > On Wed 2017-04-05 17:38:19, Calvin Owens wrote: > > On Wednesday 04/05 at 17:22 +0200, Petr Mladek wrote: > > > I think about a reasonable behavior. There seems to be three variables > > > that are related and are in use: > > > > > > console_level > > > minimum_console_loglevel > > > ignore_loglevel > > > > > > The functions seems to be the following: > > > > > > + console_level defines the current maximum level of > > > messages that appear on all enabled consoles; it > > > allows to filter out less important ones > > > > > > + minimum_console_loglevel defines the minimum > > > console_loglevel that might be set by userspace > > > via syslog interface; it prevents userspace from > > > hiding emergency messages > > > > > > + ignore_loglevel allows to see all messages > > > easily; it is used for debugging > > > > > > IMPORTANT: console_level is increased in some special > > > situations to see everything, e.g. in panic(), oops_begin(), > > > __handle_sysrq(). > > > > > > I guess that people want to see all messages even on the slow > > > console during panic(), oops(), with ignore_loglevel. It means > > > that the new per-console setting must not limit it. Also any > > > console must not go below minimum_console_level. > > > > I can definitely take oops_in_progress and minimum_console_level into > > account in the drop condition. I can also send a patch to make the sysrq > > handler reset all the maxlevels to LOGLEVEL_DEBUG if you like. > > Please note that you must not call console_lock() in the sysrq > handler. The function might sleep and it is irq context. > By other words, you could not manipulate the console structures > there. Sure, I'd punt it to process context somehow. > > > What about doing it the other way and define min_loglevel > > > for each console. It might be used to make selected consoles > > > always more verbose (above current console_level) but it > > > will not limit the more verbose modes. > > > > I think it's more intuitive to let the global sysctl behave as it always > > has, and allow additional filtering of higher levels downstream. I can > > definitely see why users might find this a bit confusing, but IMHO > > stacking two "filters" is more intuitive than a "filter" and a "bypass". > > I do not have strong opinion here. I like the idea of this patch. > Sadly, the console setting already is pretty confusing. > > I know that many people, including me, have troubles to understand > the meaning of the 4 numbers in /proc/sys/kernel/printk. They set > > console_loglevel > default_message_loglevel > minimum_console_loglevel > default_console_loglevel > > And we are going to add another complexity :-( > > > > How about a read-only "functional_loglevel" attribute for each console > > that displays: > > > > max(min(console_level, con->maxlevel), minimum_console_level) > > I like this idea and it inspired me. What about creating the following > structure under /sys > > /sys/consoles//loglevel > /minimum_loglevel > //loglevel > /minimum_loglevel > /loglevel > /minimum_loglevel > > The semantic would be: > >+ global loglevel will show the current default console_loglevel, > it must be above the global minimum_console_loglevel > >+ the per-console loglevel will show the loglevel specific > for the given console; it must be above the per-console > minimum_loglevel while > >+ the per-console minimum_loglevel must be above the global > minimum_console_loglevel > >The setting of the global values would affect the per-console >values but it must respect the above rules. > >It is still the "filter" and "bypass" logic. But we will just >repeat the existing terms and logic. Also note that >"ignore_loglevel" and the special modes in sysrq, panic, oops >use the "bypass" logic as well. Okay, I see where you're coming from. Let me play with this a bit, I'll send something concrete in the next day or two. > > Would that make the semantics more obvious? I'll obviously also send > > patches for Documentation once there's consensus about the interface. > > Please add also linux-...@vger.kernel.org, especially for the > patch adding the new toplevel directory under /sys. Will do. Thanks, Calvin > Best Regards, > Petr
Re: [RFC][PATCH 1/2] printk: Introduce per-console filtering of messages by loglevel
On Thursday 04/06 at 16:02 +0200, Petr Mladek wrote: > On Wed 2017-04-05 17:38:19, Calvin Owens wrote: > > On Wednesday 04/05 at 17:22 +0200, Petr Mladek wrote: > > > I think about a reasonable behavior. There seems to be three variables > > > that are related and are in use: > > > > > > console_level > > > minimum_console_loglevel > > > ignore_loglevel > > > > > > The functions seems to be the following: > > > > > > + console_level defines the current maximum level of > > > messages that appear on all enabled consoles; it > > > allows to filter out less important ones > > > > > > + minimum_console_loglevel defines the minimum > > > console_loglevel that might be set by userspace > > > via syslog interface; it prevents userspace from > > > hiding emergency messages > > > > > > + ignore_loglevel allows to see all messages > > > easily; it is used for debugging > > > > > > IMPORTANT: console_level is increased in some special > > > situations to see everything, e.g. in panic(), oops_begin(), > > > __handle_sysrq(). > > > > > > I guess that people want to see all messages even on the slow > > > console during panic(), oops(), with ignore_loglevel. It means > > > that the new per-console setting must not limit it. Also any > > > console must not go below minimum_console_level. > > > > I can definitely take oops_in_progress and minimum_console_level into > > account in the drop condition. I can also send a patch to make the sysrq > > handler reset all the maxlevels to LOGLEVEL_DEBUG if you like. > > Please note that you must not call console_lock() in the sysrq > handler. The function might sleep and it is irq context. > By other words, you could not manipulate the console structures > there. Sure, I'd punt it to process context somehow. > > > What about doing it the other way and define min_loglevel > > > for each console. It might be used to make selected consoles > > > always more verbose (above current console_level) but it > > > will not limit the more verbose modes. > > > > I think it's more intuitive to let the global sysctl behave as it always > > has, and allow additional filtering of higher levels downstream. I can > > definitely see why users might find this a bit confusing, but IMHO > > stacking two "filters" is more intuitive than a "filter" and a "bypass". > > I do not have strong opinion here. I like the idea of this patch. > Sadly, the console setting already is pretty confusing. > > I know that many people, including me, have troubles to understand > the meaning of the 4 numbers in /proc/sys/kernel/printk. They set > > console_loglevel > default_message_loglevel > minimum_console_loglevel > default_console_loglevel > > And we are going to add another complexity :-( > > > > How about a read-only "functional_loglevel" attribute for each console > > that displays: > > > > max(min(console_level, con->maxlevel), minimum_console_level) > > I like this idea and it inspired me. What about creating the following > structure under /sys > > /sys/consoles//loglevel > /minimum_loglevel > //loglevel > /minimum_loglevel > /loglevel > /minimum_loglevel > > The semantic would be: > >+ global loglevel will show the current default console_loglevel, > it must be above the global minimum_console_loglevel > >+ the per-console loglevel will show the loglevel specific > for the given console; it must be above the per-console > minimum_loglevel while > >+ the per-console minimum_loglevel must be above the global > minimum_console_loglevel > >The setting of the global values would affect the per-console >values but it must respect the above rules. > >It is still the "filter" and "bypass" logic. But we will just >repeat the existing terms and logic. Also note that >"ignore_loglevel" and the special modes in sysrq, panic, oops >use the "bypass" logic as well. Okay, I see where you're coming from. Let me play with this a bit, I'll send something concrete in the next day or two. > > Would that make the semantics more obvious? I'll obviously also send > > patches for Documentation once there's consensus about the interface. > > Please add also linux-...@vger.kernel.org, especially for the > patch adding the new toplevel directory under /sys. Will do. Thanks, Calvin > Best Regards, > Petr
Re: [RFC][PATCH 1/2] printk: Introduce per-console filtering of messages by loglevel
On Wednesday 04/05 at 17:22 +0200, Petr Mladek wrote: > On Wed 2017-04-05 11:16:28, Sergey Senozhatsky wrote: > > On (04/05/17 11:08), Sergey Senozhatsky wrote: > > [..] > > > > stop_critical_timings();/* don't trace print > > > > latency */ > > > > - call_console_drivers(ext_text, ext_len, text, len); > > > > + call_console_drivers(ext_text, ext_len, text, len, > > > > msg->level); > > > > start_critical_timings(); > > > > printk_safe_exit_irqrestore(flags); > > > > > > ok, so the idea is quite clear and reasonable. > > > > > > > > > some thoughts, > > > we have a system-wide suppress_message_printing() loglevel filtering > > > in console_unlock() loop, which sets a limit on loglevel for all of > > > the messages - we don't even msg_print_text() if the message has > > > suppressible loglevel. and this implicitly restricts per-console > > > maxlevels. > > > > > > console_unlock() > > > { > > > for (;;) { > > > ... > > > skip: > > > > > > if (suppress_message_printing(msg->level)) // > > > console_loglevel > > > goto skip; > > > > > > call_console_drivers(msg->level) > > > { > > > if (level > con->maxlevel) // con loglevel > > > continue; > > > ... > > > } > > > } > > > } > > > > > > this can be slightly confusing. what do you think? I think it makes sense as long as we're clear about the semantics: if a message would normally be printed to the console according to the global settings, this allows you to limit the loglevel a specific console will print. Petr suggested the opposite approach, I'll address that below. > I think about a reasonable behavior. There seems to be three variables > that are related and are in use: > > console_level > minimum_console_loglevel > ignore_loglevel > > The functions seems to be the following: > > + console_level defines the current maximum level of > messages that appear on all enabled consoles; it > allows to filter out less important ones > > + minimum_console_loglevel defines the minimum > console_loglevel that might be set by userspace > via syslog interface; it prevents userspace from > hiding emergency messages > > + ignore_loglevel allows to see all messages > easily; it is used for debugging > > IMPORTANT: console_level is increased in some special > situations to see everything, e.g. in panic(), oops_begin(), > __handle_sysrq(). > > I guess that people want to see all messages even on the slow > console during panic(), oops(), with ignore_loglevel. It means > that the new per-console setting must not limit it. Also any > console must not go below minimum_console_level. I can definitely take oops_in_progress and minimum_console_level into account in the drop condition. I can also send a patch to make the sysrq handler reset all the maxlevels to LOGLEVEL_DEBUG if you like. > What about doing it the other way and define min_loglevel > for each console. It might be used to make selected consoles > always more verbose (above current console_level) but it > will not limit the more verbose modes. I think it's more intuitive to let the global sysctl behave as it always has, and allow additional filtering of higher levels downstream. I can definitely see why users might find this a bit confusing, but IMHO stacking two "filters" is more intuitive than a "filter" and a "bypass". How about a read-only "functional_loglevel" attribute for each console that displays: max(min(console_level, con->maxlevel), minimum_console_level) Would that make the semantics more obvious? I'll obviously also send patches for Documentation once there's consensus about the interface. Thanks, Calvin > Best Regards, > Petr
Re: [RFC][PATCH 1/2] printk: Introduce per-console filtering of messages by loglevel
On Wednesday 04/05 at 17:22 +0200, Petr Mladek wrote: > On Wed 2017-04-05 11:16:28, Sergey Senozhatsky wrote: > > On (04/05/17 11:08), Sergey Senozhatsky wrote: > > [..] > > > > stop_critical_timings();/* don't trace print > > > > latency */ > > > > - call_console_drivers(ext_text, ext_len, text, len); > > > > + call_console_drivers(ext_text, ext_len, text, len, > > > > msg->level); > > > > start_critical_timings(); > > > > printk_safe_exit_irqrestore(flags); > > > > > > ok, so the idea is quite clear and reasonable. > > > > > > > > > some thoughts, > > > we have a system-wide suppress_message_printing() loglevel filtering > > > in console_unlock() loop, which sets a limit on loglevel for all of > > > the messages - we don't even msg_print_text() if the message has > > > suppressible loglevel. and this implicitly restricts per-console > > > maxlevels. > > > > > > console_unlock() > > > { > > > for (;;) { > > > ... > > > skip: > > > > > > if (suppress_message_printing(msg->level)) // > > > console_loglevel > > > goto skip; > > > > > > call_console_drivers(msg->level) > > > { > > > if (level > con->maxlevel) // con loglevel > > > continue; > > > ... > > > } > > > } > > > } > > > > > > this can be slightly confusing. what do you think? I think it makes sense as long as we're clear about the semantics: if a message would normally be printed to the console according to the global settings, this allows you to limit the loglevel a specific console will print. Petr suggested the opposite approach, I'll address that below. > I think about a reasonable behavior. There seems to be three variables > that are related and are in use: > > console_level > minimum_console_loglevel > ignore_loglevel > > The functions seems to be the following: > > + console_level defines the current maximum level of > messages that appear on all enabled consoles; it > allows to filter out less important ones > > + minimum_console_loglevel defines the minimum > console_loglevel that might be set by userspace > via syslog interface; it prevents userspace from > hiding emergency messages > > + ignore_loglevel allows to see all messages > easily; it is used for debugging > > IMPORTANT: console_level is increased in some special > situations to see everything, e.g. in panic(), oops_begin(), > __handle_sysrq(). > > I guess that people want to see all messages even on the slow > console during panic(), oops(), with ignore_loglevel. It means > that the new per-console setting must not limit it. Also any > console must not go below minimum_console_level. I can definitely take oops_in_progress and minimum_console_level into account in the drop condition. I can also send a patch to make the sysrq handler reset all the maxlevels to LOGLEVEL_DEBUG if you like. > What about doing it the other way and define min_loglevel > for each console. It might be used to make selected consoles > always more verbose (above current console_level) but it > will not limit the more verbose modes. I think it's more intuitive to let the global sysctl behave as it always has, and allow additional filtering of higher levels downstream. I can definitely see why users might find this a bit confusing, but IMHO stacking two "filters" is more intuitive than a "filter" and a "bypass". How about a read-only "functional_loglevel" attribute for each console that displays: max(min(console_level, con->maxlevel), minimum_console_level) Would that make the semantics more obvious? I'll obviously also send patches for Documentation once there's consensus about the interface. Thanks, Calvin > Best Regards, > Petr
Re: [RFC][PATCH 2/2] printk: Add /sys/consoles/${con}/ and maxlevel attribute
On Tuesday 04/04 at 23:30 -0400, Steven Rostedt wrote: > On Tue, 4 Apr 2017 16:03:20 -0700 > Calvin Owens <calvinow...@fb.com> wrote: > > > This does the simplest possible thing: add a directory at the root of > > sysfs that allows setting the "maxlevel" parameter for each console. > > > > We can let kobject destruction race with console removal: if it does, > > maxlevel_{show,store}() will safely fail with -ENODEV. This is a little > > weird, but avoids embedding the kobject and therefore needing to totally > > refactor the way we handle console struct lifetime. > > Can you also add a patch that allows this to be set on the kernel > command line, when the consoles are defined. Absolutely :) Thanks, Calvin > -- Steve > > > > > Signed-off-by: Calvin Owens <calvinow...@fb.com>
Re: [RFC][PATCH 2/2] printk: Add /sys/consoles/${con}/ and maxlevel attribute
On Tuesday 04/04 at 23:30 -0400, Steven Rostedt wrote: > On Tue, 4 Apr 2017 16:03:20 -0700 > Calvin Owens wrote: > > > This does the simplest possible thing: add a directory at the root of > > sysfs that allows setting the "maxlevel" parameter for each console. > > > > We can let kobject destruction race with console removal: if it does, > > maxlevel_{show,store}() will safely fail with -ENODEV. This is a little > > weird, but avoids embedding the kobject and therefore needing to totally > > refactor the way we handle console struct lifetime. > > Can you also add a patch that allows this to be set on the kernel > command line, when the consoles are defined. Absolutely :) Thanks, Calvin > -- Steve > > > > > Signed-off-by: Calvin Owens
Re: [RFC][PATCH 1/2] printk: Introduce per-console filtering of messages by loglevel
On Tuesday 04/04 at 23:27 -0400, Steven Rostedt wrote: > On Wed, 5 Apr 2017 11:16:28 +0900 > Sergey Senozhatskywrote: > > > > one more thing. > > > > this per-console filtering ignores... the "ignore_loglevel" param. > > > > early_param("ignore_loglevel", ignore_loglevel_setup); > > module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR); > > MODULE_PARM_DESC(ignore_loglevel, > > "ignore loglevel setting (prints all kernel messages to the > > console)"); > > > > > > my preference would be preserve "ignore_loglevel" behaviour. if > > we are forced to 'ignore all loglevel filtering' then we should > > do so. > > Agreed. Makes sense, I'll add then when I resend. Thanks, Calvin > -- Steve >
Re: [RFC][PATCH 1/2] printk: Introduce per-console filtering of messages by loglevel
On Tuesday 04/04 at 23:27 -0400, Steven Rostedt wrote: > On Wed, 5 Apr 2017 11:16:28 +0900 > Sergey Senozhatsky wrote: > > > > one more thing. > > > > this per-console filtering ignores... the "ignore_loglevel" param. > > > > early_param("ignore_loglevel", ignore_loglevel_setup); > > module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR); > > MODULE_PARM_DESC(ignore_loglevel, > > "ignore loglevel setting (prints all kernel messages to the > > console)"); > > > > > > my preference would be preserve "ignore_loglevel" behaviour. if > > we are forced to 'ignore all loglevel filtering' then we should > > do so. > > Agreed. Makes sense, I'll add then when I resend. Thanks, Calvin > -- Steve >
[RFC][PATCH 1/2] printk: Introduce per-console filtering of messages by loglevel
Not all consoles are created equal: depending on the actual hardware, the latency of a printk() call can vary dramatically. The worst examples are serial consoles, where it can spin for tens of milliseconds banging the UART to emit a message, which can cause application-level problems when the kernel spews onto the console. At Facebook we use netconsole to monitor our fleet, but we still have serial consoles attached on each host for live debugging, and the latter has caused problems. An obvious solution is to disable the kernel console output to ttyS0, but this makes live debugging frustrating, since crashes become silent and opaque to the ttyS0 user. Enabling it on the fly when needed isn't feasible, since boxes you need to debug via serial are likely to be borked in ways that make this impossible. This puts us between a rock and a hard place: we'd love to set kernel.printk to KERN_INFO and get all the logs. But while netconsole is fast enough to permit that without perturbing userspace, ttyS0 is not, and we're forced to limit console logging to KERN_WARNING and higher. This patch lets us have our cake and eat it too: instead of being forced to limit all consoles verbosity based on the speed of the slowest one, we can limit each based on its own speed. A subsequent patch will introduce a simple sysfs interface for changing this setting. Signed-off-by: Calvin Owens <calvinow...@fb.com> --- include/linux/console.h | 1 + kernel/printk/printk.c | 13 ++--- 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/include/linux/console.h b/include/linux/console.h index 5949d18..764a2c0 100644 --- a/include/linux/console.h +++ b/include/linux/console.h @@ -147,6 +147,7 @@ struct console { int cflag; void*data; struct console *next; + int maxlevel; }; /* diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 2984fb0..5393928 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1562,7 +1562,7 @@ SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, int, len) * The console_lock must be held. */ static void call_console_drivers(const char *ext_text, size_t ext_len, -const char *text, size_t len) +const char *text, size_t len, int level) { struct console *con; @@ -1581,6 +1581,8 @@ static void call_console_drivers(const char *ext_text, size_t ext_len, if (!cpu_online(smp_processor_id()) && !(con->flags & CON_ANYTIME)) continue; + if (level > con->maxlevel) + continue; if (con->flags & CON_EXTENDED) con->write(con, ext_text, ext_len); else @@ -1869,7 +1871,7 @@ static ssize_t msg_print_ext_body(char *buf, size_t size, char *dict, size_t dict_len, char *text, size_t text_len) { return 0; } static void call_console_drivers(const char *ext_text, size_t ext_len, -const char *text, size_t len) {} +const char *text, size_t len, int level) {} static size_t msg_print_text(const struct printk_log *msg, bool syslog, char *buf, size_t size) { return 0; } static bool suppress_message_printing(int level) { return false; } @@ -2238,7 +2240,7 @@ void console_unlock(void) raw_spin_unlock(_lock); stop_critical_timings();/* don't trace print latency */ - call_console_drivers(ext_text, ext_len, text, len); + call_console_drivers(ext_text, ext_len, text, len, msg->level); start_critical_timings(); printk_safe_exit_irqrestore(flags); @@ -2504,6 +2506,11 @@ void register_console(struct console *newcon) newcon->flags &= ~CON_PRINTBUFFER; /* +* By default, the per-console loglevel filter permits all messages. +*/ + newcon->maxlevel = LOGLEVEL_DEBUG; + + /* * Put this console in the list - keep the * preferred driver at the head of the list. */ -- 2.9.3
[RFC][PATCH 1/2] printk: Introduce per-console filtering of messages by loglevel
Not all consoles are created equal: depending on the actual hardware, the latency of a printk() call can vary dramatically. The worst examples are serial consoles, where it can spin for tens of milliseconds banging the UART to emit a message, which can cause application-level problems when the kernel spews onto the console. At Facebook we use netconsole to monitor our fleet, but we still have serial consoles attached on each host for live debugging, and the latter has caused problems. An obvious solution is to disable the kernel console output to ttyS0, but this makes live debugging frustrating, since crashes become silent and opaque to the ttyS0 user. Enabling it on the fly when needed isn't feasible, since boxes you need to debug via serial are likely to be borked in ways that make this impossible. This puts us between a rock and a hard place: we'd love to set kernel.printk to KERN_INFO and get all the logs. But while netconsole is fast enough to permit that without perturbing userspace, ttyS0 is not, and we're forced to limit console logging to KERN_WARNING and higher. This patch lets us have our cake and eat it too: instead of being forced to limit all consoles verbosity based on the speed of the slowest one, we can limit each based on its own speed. A subsequent patch will introduce a simple sysfs interface for changing this setting. Signed-off-by: Calvin Owens --- include/linux/console.h | 1 + kernel/printk/printk.c | 13 ++--- 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/include/linux/console.h b/include/linux/console.h index 5949d18..764a2c0 100644 --- a/include/linux/console.h +++ b/include/linux/console.h @@ -147,6 +147,7 @@ struct console { int cflag; void*data; struct console *next; + int maxlevel; }; /* diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 2984fb0..5393928 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1562,7 +1562,7 @@ SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, int, len) * The console_lock must be held. */ static void call_console_drivers(const char *ext_text, size_t ext_len, -const char *text, size_t len) +const char *text, size_t len, int level) { struct console *con; @@ -1581,6 +1581,8 @@ static void call_console_drivers(const char *ext_text, size_t ext_len, if (!cpu_online(smp_processor_id()) && !(con->flags & CON_ANYTIME)) continue; + if (level > con->maxlevel) + continue; if (con->flags & CON_EXTENDED) con->write(con, ext_text, ext_len); else @@ -1869,7 +1871,7 @@ static ssize_t msg_print_ext_body(char *buf, size_t size, char *dict, size_t dict_len, char *text, size_t text_len) { return 0; } static void call_console_drivers(const char *ext_text, size_t ext_len, -const char *text, size_t len) {} +const char *text, size_t len, int level) {} static size_t msg_print_text(const struct printk_log *msg, bool syslog, char *buf, size_t size) { return 0; } static bool suppress_message_printing(int level) { return false; } @@ -2238,7 +2240,7 @@ void console_unlock(void) raw_spin_unlock(_lock); stop_critical_timings();/* don't trace print latency */ - call_console_drivers(ext_text, ext_len, text, len); + call_console_drivers(ext_text, ext_len, text, len, msg->level); start_critical_timings(); printk_safe_exit_irqrestore(flags); @@ -2504,6 +2506,11 @@ void register_console(struct console *newcon) newcon->flags &= ~CON_PRINTBUFFER; /* +* By default, the per-console loglevel filter permits all messages. +*/ + newcon->maxlevel = LOGLEVEL_DEBUG; + + /* * Put this console in the list - keep the * preferred driver at the head of the list. */ -- 2.9.3
[RFC][PATCH 2/2] printk: Add /sys/consoles/${con}/ and maxlevel attribute
This does the simplest possible thing: add a directory at the root of sysfs that allows setting the "maxlevel" parameter for each console. We can let kobject destruction race with console removal: if it does, maxlevel_{show,store}() will safely fail with -ENODEV. This is a little weird, but avoids embedding the kobject and therefore needing to totally refactor the way we handle console struct lifetime. Signed-off-by: Calvin Owens <calvinow...@fb.com> --- include/linux/console.h | 1 + kernel/printk/printk.c | 82 + 2 files changed, 83 insertions(+) diff --git a/include/linux/console.h b/include/linux/console.h index 764a2c0..c76fde0 100644 --- a/include/linux/console.h +++ b/include/linux/console.h @@ -148,6 +148,7 @@ struct console { void*data; struct console *next; int maxlevel; + struct kobject *kobj; }; /* diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 5393928..e9d036b 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -105,6 +105,8 @@ enum devkmsg_log_masks { static unsigned int __read_mostly devkmsg_log = DEVKMSG_LOG_MASK_DEFAULT; +static struct kobject *consoles_dir_kobj; + static int __control_devkmsg(char *str) { if (!str) @@ -2386,6 +2388,76 @@ static int __init keep_bootcon_setup(char *str) early_param("keep_bootcon", keep_bootcon_setup); +static ssize_t maxlevel_show(struct kobject *kobj, struct kobj_attribute *attr, + char *buf) +{ + struct console *con; + ssize_t ret = -ENODEV; + + console_lock(); + for_each_console(con) { + if (con->kobj == kobj) { + ret = sprintf(buf, "%d\n", con->maxlevel); + break; + } + } + console_unlock(); + + return ret; +} + +static ssize_t maxlevel_store(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t count) +{ + struct console *con; + ssize_t ret; + int tmp; + + ret = kstrtoint(buf, 10, ); + if (ret < 0) + return ret; + + if (tmp < 0 || tmp > LOGLEVEL_DEBUG) + return -ERANGE; + + ret = -ENODEV; + console_lock(); + for_each_console(con) { + if (con->kobj == kobj) { + con->maxlevel = tmp; + ret = count; + break; + } + } + console_unlock(); + + return ret; +} + +static const struct kobj_attribute console_level_attr = + __ATTR(maxlevel, 0644, maxlevel_show, maxlevel_store); + +static void console_register_sysfs(struct console *newcon) +{ + /* +* We might be called very early from register_console(): in that case, +* printk_late_init() will take care of this later. +*/ + if (!consoles_dir_kobj) + return; + + newcon->kobj = kobject_create_and_add(newcon->name, consoles_dir_kobj); + if (WARN_ON(!newcon->kobj)) + return; + + WARN_ON(sysfs_create_file(newcon->kobj, _level_attr.attr)); +} + +static void console_unregister_sysfs(struct console *oldcon) +{ + kobject_put(oldcon->kobj); +} + /* * The console driver calls this routine during kernel initialization * to register the console printing procedure with printk() and to @@ -2509,6 +2581,7 @@ void register_console(struct console *newcon) * By default, the per-console loglevel filter permits all messages. */ newcon->maxlevel = LOGLEVEL_DEBUG; + newcon->kobj = NULL; /* * Put this console in the list - keep the @@ -2545,6 +2618,7 @@ void register_console(struct console *newcon) */ exclusive_console = newcon; } + console_register_sysfs(newcon); console_unlock(); console_sysfs_notify(); @@ -2611,6 +2685,7 @@ int unregister_console(struct console *console) console_drivers->flags |= CON_CONSDEV; console->flags &= ~CON_ENABLED; + console_unregister_sysfs(console); console_unlock(); console_sysfs_notify(); return res; @@ -2656,6 +2731,13 @@ static int __init printk_late_init(void) ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "printk:online", console_cpu_notify, NULL); WARN_ON(ret < 0); + + consoles_dir_kobj = kobject_create_and_add("consoles", NULL); + WARN_ON(!consoles_dir_kobj); + + for_each_console(con) + console_register_sysfs(con); + return 0; } late_initcall(printk_late_init); -- 2.9.3
[RFC][PATCH 2/2] printk: Add /sys/consoles/${con}/ and maxlevel attribute
This does the simplest possible thing: add a directory at the root of sysfs that allows setting the "maxlevel" parameter for each console. We can let kobject destruction race with console removal: if it does, maxlevel_{show,store}() will safely fail with -ENODEV. This is a little weird, but avoids embedding the kobject and therefore needing to totally refactor the way we handle console struct lifetime. Signed-off-by: Calvin Owens --- include/linux/console.h | 1 + kernel/printk/printk.c | 82 + 2 files changed, 83 insertions(+) diff --git a/include/linux/console.h b/include/linux/console.h index 764a2c0..c76fde0 100644 --- a/include/linux/console.h +++ b/include/linux/console.h @@ -148,6 +148,7 @@ struct console { void*data; struct console *next; int maxlevel; + struct kobject *kobj; }; /* diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 5393928..e9d036b 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -105,6 +105,8 @@ enum devkmsg_log_masks { static unsigned int __read_mostly devkmsg_log = DEVKMSG_LOG_MASK_DEFAULT; +static struct kobject *consoles_dir_kobj; + static int __control_devkmsg(char *str) { if (!str) @@ -2386,6 +2388,76 @@ static int __init keep_bootcon_setup(char *str) early_param("keep_bootcon", keep_bootcon_setup); +static ssize_t maxlevel_show(struct kobject *kobj, struct kobj_attribute *attr, + char *buf) +{ + struct console *con; + ssize_t ret = -ENODEV; + + console_lock(); + for_each_console(con) { + if (con->kobj == kobj) { + ret = sprintf(buf, "%d\n", con->maxlevel); + break; + } + } + console_unlock(); + + return ret; +} + +static ssize_t maxlevel_store(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t count) +{ + struct console *con; + ssize_t ret; + int tmp; + + ret = kstrtoint(buf, 10, ); + if (ret < 0) + return ret; + + if (tmp < 0 || tmp > LOGLEVEL_DEBUG) + return -ERANGE; + + ret = -ENODEV; + console_lock(); + for_each_console(con) { + if (con->kobj == kobj) { + con->maxlevel = tmp; + ret = count; + break; + } + } + console_unlock(); + + return ret; +} + +static const struct kobj_attribute console_level_attr = + __ATTR(maxlevel, 0644, maxlevel_show, maxlevel_store); + +static void console_register_sysfs(struct console *newcon) +{ + /* +* We might be called very early from register_console(): in that case, +* printk_late_init() will take care of this later. +*/ + if (!consoles_dir_kobj) + return; + + newcon->kobj = kobject_create_and_add(newcon->name, consoles_dir_kobj); + if (WARN_ON(!newcon->kobj)) + return; + + WARN_ON(sysfs_create_file(newcon->kobj, _level_attr.attr)); +} + +static void console_unregister_sysfs(struct console *oldcon) +{ + kobject_put(oldcon->kobj); +} + /* * The console driver calls this routine during kernel initialization * to register the console printing procedure with printk() and to @@ -2509,6 +2581,7 @@ void register_console(struct console *newcon) * By default, the per-console loglevel filter permits all messages. */ newcon->maxlevel = LOGLEVEL_DEBUG; + newcon->kobj = NULL; /* * Put this console in the list - keep the @@ -2545,6 +2618,7 @@ void register_console(struct console *newcon) */ exclusive_console = newcon; } + console_register_sysfs(newcon); console_unlock(); console_sysfs_notify(); @@ -2611,6 +2685,7 @@ int unregister_console(struct console *console) console_drivers->flags |= CON_CONSDEV; console->flags &= ~CON_ENABLED; + console_unregister_sysfs(console); console_unlock(); console_sysfs_notify(); return res; @@ -2656,6 +2731,13 @@ static int __init printk_late_init(void) ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "printk:online", console_cpu_notify, NULL); WARN_ON(ret < 0); + + consoles_dir_kobj = kobject_create_and_add("consoles", NULL); + WARN_ON(!consoles_dir_kobj); + + for_each_console(con) + console_register_sysfs(con); + return 0; } late_initcall(printk_late_init); -- 2.9.3
[PATCH v3] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files
When punching past EOF on XFS, fallocate(mode=PUNCH_HOLE|KEEP_SIZE) will round the file size up to the nearest multiple of PAGE_SIZE: calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1 calvinow@vm-disks/generic-xfs-1 ~$ stat test Size: 2048Blocks: 8 IO Block: 4096 regular file calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test calvinow@vm-disks/generic-xfs-1 ~$ stat test Size: 4096Blocks: 8 IO Block: 4096 regular file Commit 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") replaced xfs_zero_remaining_bytes() with calls to iomap helpers. The new helpers don't enforce that [pos,offset) lies strictly on [0,i_size) when being called from xfs_free_file_space(), so by "leaking" these ranges into xfs_zero_range() we get this buggy behavior. Fix this by reintroducing the checks xfs_zero_remaining_bytes() did against i_size at the bottom of xfs_free_file_space(). Reported-by: Aaron Gao <g...@fb.com> Fixes: 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") Cc: Christoph Hellwig <h...@lst.de> Cc: Brian Foster <bfos...@redhat.com> Cc: <sta...@vger.kernel.org> # 4.8+ Signed-off-by: Calvin Owens <calvinow...@fb.com> --- fs/xfs/xfs_bmap_util.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 8b75dce..828532c 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -1311,8 +1311,16 @@ xfs_free_file_space( /* * Now that we've unmap all full blocks we'll have to zero out any * partial block at the beginning and/or end. xfs_zero_range is -* smart enough to skip any holes, including those we just created. +* smart enough to skip any holes, including those we just created, +* but we must take care not to zero beyond EOF and enlarge i_size. */ + + if (offset >= XFS_ISIZE(ip)) + return 0; + + if (offset + len > XFS_ISIZE(ip)) + len = XFS_ISIZE(ip) - offset; + return xfs_zero_range(ip, offset, len, NULL); } -- 2.9.3
[PATCH v3] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files
When punching past EOF on XFS, fallocate(mode=PUNCH_HOLE|KEEP_SIZE) will round the file size up to the nearest multiple of PAGE_SIZE: calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1 calvinow@vm-disks/generic-xfs-1 ~$ stat test Size: 2048Blocks: 8 IO Block: 4096 regular file calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test calvinow@vm-disks/generic-xfs-1 ~$ stat test Size: 4096Blocks: 8 IO Block: 4096 regular file Commit 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") replaced xfs_zero_remaining_bytes() with calls to iomap helpers. The new helpers don't enforce that [pos,offset) lies strictly on [0,i_size) when being called from xfs_free_file_space(), so by "leaking" these ranges into xfs_zero_range() we get this buggy behavior. Fix this by reintroducing the checks xfs_zero_remaining_bytes() did against i_size at the bottom of xfs_free_file_space(). Reported-by: Aaron Gao Fixes: 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") Cc: Christoph Hellwig Cc: Brian Foster Cc: # 4.8+ Signed-off-by: Calvin Owens --- fs/xfs/xfs_bmap_util.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 8b75dce..828532c 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -1311,8 +1311,16 @@ xfs_free_file_space( /* * Now that we've unmap all full blocks we'll have to zero out any * partial block at the beginning and/or end. xfs_zero_range is -* smart enough to skip any holes, including those we just created. +* smart enough to skip any holes, including those we just created, +* but we must take care not to zero beyond EOF and enlarge i_size. */ + + if (offset >= XFS_ISIZE(ip)) + return 0; + + if (offset + len > XFS_ISIZE(ip)) + len = XFS_ISIZE(ip) - offset; + return xfs_zero_range(ip, offset, len, NULL); } -- 2.9.3
Re: [PATCH v2] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files
On 03/21/2017 04:39 AM, Brian Foster wrote: On Sun, Mar 19, 2017 at 09:54:51PM -0700, Calvin Owens wrote: When punching past EOF on XFS, fallocate(mode=PUNCH_HOLE|KEEP_SIZE) will round the file size up to the nearest multiple of PAGE_SIZE: calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1 calvinow@vm-disks/generic-xfs-1 ~$ stat test Size: 2048Blocks: 8 IO Block: 4096 regular file calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test calvinow@vm-disks/generic-xfs-1 ~$ stat test Size: 4096Blocks: 8 IO Block: 4096 regular file Commit 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") replaced xfs_zero_remaining_bytes() with calls to iomap helpers. The new helpers don't enforce that [pos,offset) lies strictly on [0,i_size) when being called from xfs_free_file_space(), so by "leaking" these ranges into xfs_zero_range() we get this buggy behavior. Fix this by reintroducing the checks xfs_zero_remaining_bytes() did against i_size at the bottom of xfs_free_file_space(). Reported-by: Aaron Gao <g...@fb.com> Fixes: 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") Cc: Christoph Hellwig <h...@lst.de> Cc: <sta...@vger.kernel.org> # 4.8+ Signed-off-by: Calvin Owens <calvinow...@fb.com> --- fs/xfs/xfs_bmap_util.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 8b75dce..0796ebc 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -1309,6 +1309,17 @@ xfs_free_file_space( } /* +* Avoid doing I/O beyond eof - it's not necessary +* since nothing can read beyond eof. The space will +* be zeroed when the file is extended anyway. +*/ I'd suggest to update the comment below with this information and move the following bits down below it as well. Will do. + if (offset >= XFS_ISIZE(ip)) + return 0; + + if ((offset + len) >= XFS_ISIZE(ip)) + len = XFS_ISIZE(ip) - offset - 1; + This looks like an off-by-one. Do you mean the following? if (offset + len > XFS_ISIZE(ip)) len = XFS_ISIZE(ip) - offset; It's not an off-by-one (it's self-consistent), but your way makes more sense, I'll fix it ;) Thanks, Calvin Brian + /* * Now that we've unmap all full blocks we'll have to zero out any * partial block at the beginning and/or end. xfs_zero_range is * smart enough to skip any holes, including those we just created. -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files
On 03/21/2017 04:39 AM, Brian Foster wrote: On Sun, Mar 19, 2017 at 09:54:51PM -0700, Calvin Owens wrote: When punching past EOF on XFS, fallocate(mode=PUNCH_HOLE|KEEP_SIZE) will round the file size up to the nearest multiple of PAGE_SIZE: calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1 calvinow@vm-disks/generic-xfs-1 ~$ stat test Size: 2048Blocks: 8 IO Block: 4096 regular file calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test calvinow@vm-disks/generic-xfs-1 ~$ stat test Size: 4096Blocks: 8 IO Block: 4096 regular file Commit 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") replaced xfs_zero_remaining_bytes() with calls to iomap helpers. The new helpers don't enforce that [pos,offset) lies strictly on [0,i_size) when being called from xfs_free_file_space(), so by "leaking" these ranges into xfs_zero_range() we get this buggy behavior. Fix this by reintroducing the checks xfs_zero_remaining_bytes() did against i_size at the bottom of xfs_free_file_space(). Reported-by: Aaron Gao Fixes: 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") Cc: Christoph Hellwig Cc: # 4.8+ Signed-off-by: Calvin Owens --- fs/xfs/xfs_bmap_util.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 8b75dce..0796ebc 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -1309,6 +1309,17 @@ xfs_free_file_space( } /* +* Avoid doing I/O beyond eof - it's not necessary +* since nothing can read beyond eof. The space will +* be zeroed when the file is extended anyway. +*/ I'd suggest to update the comment below with this information and move the following bits down below it as well. Will do. + if (offset >= XFS_ISIZE(ip)) + return 0; + + if ((offset + len) >= XFS_ISIZE(ip)) + len = XFS_ISIZE(ip) - offset - 1; + This looks like an off-by-one. Do you mean the following? if (offset + len > XFS_ISIZE(ip)) len = XFS_ISIZE(ip) - offset; It's not an off-by-one (it's self-consistent), but your way makes more sense, I'll fix it ;) Thanks, Calvin Brian + /* * Now that we've unmap all full blocks we'll have to zero out any * partial block at the beginning and/or end. xfs_zero_range is * smart enough to skip any holes, including those we just created. -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files
When punching past EOF on XFS, fallocate(mode=PUNCH_HOLE|KEEP_SIZE) will round the file size up to the nearest multiple of PAGE_SIZE: calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1 calvinow@vm-disks/generic-xfs-1 ~$ stat test Size: 2048Blocks: 8 IO Block: 4096 regular file calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test calvinow@vm-disks/generic-xfs-1 ~$ stat test Size: 4096Blocks: 8 IO Block: 4096 regular file Commit 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") replaced xfs_zero_remaining_bytes() with calls to iomap helpers. The new helpers don't enforce that [pos,offset) lies strictly on [0,i_size) when being called from xfs_free_file_space(), so by "leaking" these ranges into xfs_zero_range() we get this buggy behavior. Fix this by reintroducing the checks xfs_zero_remaining_bytes() did against i_size at the bottom of xfs_free_file_space(). Reported-by: Aaron Gao <g...@fb.com> Fixes: 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") Cc: Christoph Hellwig <h...@lst.de> Cc: <sta...@vger.kernel.org> # 4.8+ Signed-off-by: Calvin Owens <calvinow...@fb.com> --- fs/xfs/xfs_bmap_util.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 8b75dce..0796ebc 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -1309,6 +1309,17 @@ xfs_free_file_space( } /* +* Avoid doing I/O beyond eof - it's not necessary +* since nothing can read beyond eof. The space will +* be zeroed when the file is extended anyway. +*/ + if (offset >= XFS_ISIZE(ip)) + return 0; + + if ((offset + len) >= XFS_ISIZE(ip)) + len = XFS_ISIZE(ip) - offset - 1; + + /* * Now that we've unmap all full blocks we'll have to zero out any * partial block at the beginning and/or end. xfs_zero_range is * smart enough to skip any holes, including those we just created. -- 2.9.3
[PATCH v2] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files
When punching past EOF on XFS, fallocate(mode=PUNCH_HOLE|KEEP_SIZE) will round the file size up to the nearest multiple of PAGE_SIZE: calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1 calvinow@vm-disks/generic-xfs-1 ~$ stat test Size: 2048Blocks: 8 IO Block: 4096 regular file calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test calvinow@vm-disks/generic-xfs-1 ~$ stat test Size: 4096Blocks: 8 IO Block: 4096 regular file Commit 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") replaced xfs_zero_remaining_bytes() with calls to iomap helpers. The new helpers don't enforce that [pos,offset) lies strictly on [0,i_size) when being called from xfs_free_file_space(), so by "leaking" these ranges into xfs_zero_range() we get this buggy behavior. Fix this by reintroducing the checks xfs_zero_remaining_bytes() did against i_size at the bottom of xfs_free_file_space(). Reported-by: Aaron Gao Fixes: 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") Cc: Christoph Hellwig Cc: # 4.8+ Signed-off-by: Calvin Owens --- fs/xfs/xfs_bmap_util.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 8b75dce..0796ebc 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -1309,6 +1309,17 @@ xfs_free_file_space( } /* +* Avoid doing I/O beyond eof - it's not necessary +* since nothing can read beyond eof. The space will +* be zeroed when the file is extended anyway. +*/ + if (offset >= XFS_ISIZE(ip)) + return 0; + + if ((offset + len) >= XFS_ISIZE(ip)) + len = XFS_ISIZE(ip) - offset - 1; + + /* * Now that we've unmap all full blocks we'll have to zero out any * partial block at the beginning and/or end. xfs_zero_range is * smart enough to skip any holes, including those we just created. -- 2.9.3
[PATCH] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files
Commit 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") replaced xfs_zero_remaining_bytes() with calls to iomap helpers. Unfortunately the new iomap helpers don't enforce that [pos,count) lies strictly on [0,i_size). This causes fallocate(mode=PUNCH_HOLE|KEEP_SIZE) calls touching [i_size & ~PAGE_MASK, OFF_T_MAX] to round i_size up to the nearest multiple of PAGE_SIZE: calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1 calvinow@vm-disks/generic-xfs-1 ~$ stat test Size: 2048Blocks: 8 IO Block: 4096 regular file calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test calvinow@vm-disks/generic-xfs-1 ~$ stat test Size: 4096Blocks: 8 IO Block: 4096 regular file Fix this by reintroducing the checks xfs_zero_remaining_bytes() did against i_size into xfs_zero_range(). Reported-by: Aaron Gao <g...@fb.com> Fixes: 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") Cc: Christoph Hellwig <h...@lst.de> Cc: <sta...@vger.kernel.org> # 4.8+ Signed-off-by: Calvin Owens <calvinow...@fb.com> --- fs/xfs/xfs_file.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 35703a8..da7cd27 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -58,6 +58,17 @@ xfs_zero_range( xfs_off_t count, bool*did_zero) { + /* +* Avoid doing I/O beyond eof - it's not necessary +* since nothing can read beyond eof. The space will +* be zeroed when the file is extended anyway. +*/ + if (pos >= XFS_ISIZE(ip)) + return 0; + + if ((pos + count) >= XFS_ISIZE(ip)) + count = XFS_ISIZE(ip) - pos - 1; + return iomap_zero_range(VFS_I(ip), pos, count, NULL, _iomap_ops); } -- 2.9.3
[PATCH] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files
Commit 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") replaced xfs_zero_remaining_bytes() with calls to iomap helpers. Unfortunately the new iomap helpers don't enforce that [pos,count) lies strictly on [0,i_size). This causes fallocate(mode=PUNCH_HOLE|KEEP_SIZE) calls touching [i_size & ~PAGE_MASK, OFF_T_MAX] to round i_size up to the nearest multiple of PAGE_SIZE: calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1 calvinow@vm-disks/generic-xfs-1 ~$ stat test Size: 2048Blocks: 8 IO Block: 4096 regular file calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test calvinow@vm-disks/generic-xfs-1 ~$ stat test Size: 4096Blocks: 8 IO Block: 4096 regular file Fix this by reintroducing the checks xfs_zero_remaining_bytes() did against i_size into xfs_zero_range(). Reported-by: Aaron Gao Fixes: 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") Cc: Christoph Hellwig Cc: # 4.8+ Signed-off-by: Calvin Owens --- fs/xfs/xfs_file.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 35703a8..da7cd27 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -58,6 +58,17 @@ xfs_zero_range( xfs_off_t count, bool*did_zero) { + /* +* Avoid doing I/O beyond eof - it's not necessary +* since nothing can read beyond eof. The space will +* be zeroed when the file is extended anyway. +*/ + if (pos >= XFS_ISIZE(ip)) + return 0; + + if ((pos + count) >= XFS_ISIZE(ip)) + count = XFS_ISIZE(ip) - pos - 1; + return iomap_zero_range(VFS_I(ip), pos, count, NULL, _iomap_ops); } -- 2.9.3
Re: [PATCH] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files
> Fix this by reintroducing the checks xfs_zero_remaining_bytes() did > against i_size into xfs_zero_range(). Sorry this is wrong: I missed that xfs_zero_range() has another caller that depends on the behavior I'm changing. I'll send a v2 with the same hunk at the bottom of xfs_free_file_space() instead. Thanks, Calvin
Re: [PATCH] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files
> Fix this by reintroducing the checks xfs_zero_remaining_bytes() did > against i_size into xfs_zero_range(). Sorry this is wrong: I missed that xfs_zero_range() has another caller that depends on the behavior I'm changing. I'll send a v2 with the same hunk at the bottom of xfs_free_file_space() instead. Thanks, Calvin
Re: [PATCH] fs: Assert on module file_operations without an owner
On Friday 10/07 at 17:18 -0400, Calvin Owens wrote: > On Friday 10/07 at 21:48 +0100, Al Viro wrote: > > On Fri, Oct 07, 2016 at 01:35:52PM -0700, Calvin Owens wrote: > > > Omitting the owner field in file_operations declared in modules is an > > > easy mistake to make, and can result in crashes when the module is > > > unloaded while userspace is poking the file. > > > > > > This patch modifies fops_get() to WARN when it encounters a NULL owner, > > > since in this case it cannot take a reference on the containing module. > > > > NAK. This is complete crap - we do *NOT* need ->owner on a lot of > > file_operations. > > This isn't a theoretical issue: I have a proprietary module that makes this > mistake and crashes when poking a chrdev it exposes in userspace races with > unloading the module. > > Of course, the bug is in this silly module. I'm not arguing that it isn't. I > was hesitant to even mention this because I know waving at something in an OOT > module is a poor argument for changing anything in the proper kernel. > > But what I'm trying to do here is prevent people from making that mistake in > the future by yelling at them when they do. The implicit ignoring of a NULL > owner in try_module_get() in fops_get() is not necessarily obvious. Let's drop this, I should never have sent the patch in the first place. > > * we do not need that on file_operations of a regular file or > > directory on a normal filesystem, since that filesystem is not going > > away until the file has been closed - ->f_path.mnt is holding a reference > > to vfsmount, which is holding a reference to superblock, which is holding > > a reference to file_system_type, which is holding a reference to _its_ > > ->owner. > > * we do not need that on anything on procfs - module removal is > > legal while a procfs file is opened; its cleanup will be blocked for the > > duration of ->read(), ->write(), etc. calls. > > I see why this is true, and it's something I considered. But when there is > zero cost to being explicit and setting ->owner, why not do it? > > > If anything, we would be better off with modifications that would get > > rid of ->owner on file_operations. It's not trivial to do, but it might > > be not impossible. I'll look into this, I'm interested. Thanks, Calvin >
Re: [PATCH] fs: Assert on module file_operations without an owner
On Friday 10/07 at 17:18 -0400, Calvin Owens wrote: > On Friday 10/07 at 21:48 +0100, Al Viro wrote: > > On Fri, Oct 07, 2016 at 01:35:52PM -0700, Calvin Owens wrote: > > > Omitting the owner field in file_operations declared in modules is an > > > easy mistake to make, and can result in crashes when the module is > > > unloaded while userspace is poking the file. > > > > > > This patch modifies fops_get() to WARN when it encounters a NULL owner, > > > since in this case it cannot take a reference on the containing module. > > > > NAK. This is complete crap - we do *NOT* need ->owner on a lot of > > file_operations. > > This isn't a theoretical issue: I have a proprietary module that makes this > mistake and crashes when poking a chrdev it exposes in userspace races with > unloading the module. > > Of course, the bug is in this silly module. I'm not arguing that it isn't. I > was hesitant to even mention this because I know waving at something in an OOT > module is a poor argument for changing anything in the proper kernel. > > But what I'm trying to do here is prevent people from making that mistake in > the future by yelling at them when they do. The implicit ignoring of a NULL > owner in try_module_get() in fops_get() is not necessarily obvious. Let's drop this, I should never have sent the patch in the first place. > > * we do not need that on file_operations of a regular file or > > directory on a normal filesystem, since that filesystem is not going > > away until the file has been closed - ->f_path.mnt is holding a reference > > to vfsmount, which is holding a reference to superblock, which is holding > > a reference to file_system_type, which is holding a reference to _its_ > > ->owner. > > * we do not need that on anything on procfs - module removal is > > legal while a procfs file is opened; its cleanup will be blocked for the > > duration of ->read(), ->write(), etc. calls. > > I see why this is true, and it's something I considered. But when there is > zero cost to being explicit and setting ->owner, why not do it? > > > If anything, we would be better off with modifications that would get > > rid of ->owner on file_operations. It's not trivial to do, but it might > > be not impossible. I'll look into this, I'm interested. Thanks, Calvin >
Re: [PATCH] fs: Assert on module file_operations without an owner
On Friday 10/07 at 21:48 +0100, Al Viro wrote: > On Fri, Oct 07, 2016 at 01:35:52PM -0700, Calvin Owens wrote: > > Omitting the owner field in file_operations declared in modules is an > > easy mistake to make, and can result in crashes when the module is > > unloaded while userspace is poking the file. > > > > This patch modifies fops_get() to WARN when it encounters a NULL owner, > > since in this case it cannot take a reference on the containing module. > > NAK. This is complete crap - we do *NOT* need ->owner on a lot of > file_operations. This isn't a theoretical issue: I have a proprietary module that makes this mistake and crashes when poking a chrdev it exposes in userspace races with unloading the module. Of course, the bug is in this silly module. I'm not arguing that it isn't. I was hesitant to even mention this because I know waving at something in an OOT module is a poor argument for changing anything in the proper kernel. But what I'm trying to do here is prevent people from making that mistake in the future by yelling at them when they do. The implicit ignoring of a NULL owner in try_module_get() in fops_get() is not necessarily obvious. > * we do not need that on file_operations of a regular file or > directory on a normal filesystem, since that filesystem is not going > away until the file has been closed - ->f_path.mnt is holding a reference > to vfsmount, which is holding a reference to superblock, which is holding > a reference to file_system_type, which is holding a reference to _its_ > ->owner. > * we do not need that on anything on procfs - module removal is > legal while a procfs file is opened; its cleanup will be blocked for the > duration of ->read(), ->write(), etc. calls. I see why this is true, and it's something I considered. But when there is zero cost to being explicit and setting ->owner, why not do it? > If anything, we would be better off with modifications that would get > rid of ->owner on file_operations. It's not trivial to do, but it might > be not impossible.
Re: [PATCH] fs: Assert on module file_operations without an owner
On Friday 10/07 at 21:48 +0100, Al Viro wrote: > On Fri, Oct 07, 2016 at 01:35:52PM -0700, Calvin Owens wrote: > > Omitting the owner field in file_operations declared in modules is an > > easy mistake to make, and can result in crashes when the module is > > unloaded while userspace is poking the file. > > > > This patch modifies fops_get() to WARN when it encounters a NULL owner, > > since in this case it cannot take a reference on the containing module. > > NAK. This is complete crap - we do *NOT* need ->owner on a lot of > file_operations. This isn't a theoretical issue: I have a proprietary module that makes this mistake and crashes when poking a chrdev it exposes in userspace races with unloading the module. Of course, the bug is in this silly module. I'm not arguing that it isn't. I was hesitant to even mention this because I know waving at something in an OOT module is a poor argument for changing anything in the proper kernel. But what I'm trying to do here is prevent people from making that mistake in the future by yelling at them when they do. The implicit ignoring of a NULL owner in try_module_get() in fops_get() is not necessarily obvious. > * we do not need that on file_operations of a regular file or > directory on a normal filesystem, since that filesystem is not going > away until the file has been closed - ->f_path.mnt is holding a reference > to vfsmount, which is holding a reference to superblock, which is holding > a reference to file_system_type, which is holding a reference to _its_ > ->owner. > * we do not need that on anything on procfs - module removal is > legal while a procfs file is opened; its cleanup will be blocked for the > duration of ->read(), ->write(), etc. calls. I see why this is true, and it's something I considered. But when there is zero cost to being explicit and setting ->owner, why not do it? > If anything, we would be better off with modifications that would get > rid of ->owner on file_operations. It's not trivial to do, but it might > be not impossible.
[PATCH net-next] nfnetlink_log: Use GFP_NOWARN for skb allocation
Since the code explicilty falls back to a smaller allocation when the large one fails, we shouldn't complain when that happens. Signed-off-by: Calvin Owens <calvinow...@fb.com> --- net/netfilter/nfnetlink_log.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c index eb086a1..7435505 100644 --- a/net/netfilter/nfnetlink_log.c +++ b/net/netfilter/nfnetlink_log.c @@ -330,7 +330,7 @@ nfulnl_alloc_skb(struct net *net, u32 peer_portid, unsigned int inst_size, * message. WARNING: has to be <= 128k due to slab restrictions */ n = max(inst_size, pkt_size); - skb = alloc_skb(n, GFP_ATOMIC); + skb = alloc_skb(n, GFP_ATOMIC | __GFP_NOWARN); if (!skb) { if (n > pkt_size) { /* try to allocate only as much as we need for current -- 2.9.3
[PATCH net-next] nfnetlink_log: Use GFP_NOWARN for skb allocation
Since the code explicilty falls back to a smaller allocation when the large one fails, we shouldn't complain when that happens. Signed-off-by: Calvin Owens --- net/netfilter/nfnetlink_log.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c index eb086a1..7435505 100644 --- a/net/netfilter/nfnetlink_log.c +++ b/net/netfilter/nfnetlink_log.c @@ -330,7 +330,7 @@ nfulnl_alloc_skb(struct net *net, u32 peer_portid, unsigned int inst_size, * message. WARNING: has to be <= 128k due to slab restrictions */ n = max(inst_size, pkt_size); - skb = alloc_skb(n, GFP_ATOMIC); + skb = alloc_skb(n, GFP_ATOMIC | __GFP_NOWARN); if (!skb) { if (n > pkt_size) { /* try to allocate only as much as we need for current -- 2.9.3
[PATCH] fs: Assert on module file_operations without an owner
Omitting the owner field in file_operations declared in modules is an easy mistake to make, and can result in crashes when the module is unloaded while userspace is poking the file. This patch modifies fops_get() to WARN when it encounters a NULL owner, since in this case it cannot take a reference on the containing module. Signed-off-by: Calvin Owens <calvinow...@fb.com> --- include/linux/fs.h | 13 - kernel/module.c| 1 + 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 901e25d..fafda9e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2081,10 +2081,21 @@ extern struct dentry *mount_pseudo(struct file_system_type *, char *, unsigned long); /* Alas, no aliases. Too much hassle with bringing module.h everywhere */ -#define fops_get(fops) \ +#define __fops_get(fops) \ (((fops) && try_module_get((fops)->owner) ? (fops) : NULL)) #define fops_put(fops) \ do { if (fops) module_put((fops)->owner); } while(0) + +#define unowned_fmt "No fops owner at %p in [%s]\n" +#define fops_unowned(fops) \ + (is_module_address((unsigned long)(fops)) && !(fops)->owner) +#define fops_modname(fops) \ + __module_address((unsigned long)(fops))->name +#define fops_warn_unowned(fops) \ + WARN(fops_unowned(fops), unowned_fmt, (fops), fops_modname(fops)) +#define fops_get(fops) \ + ({ fops_warn_unowned(fops); __fops_get(fops); }) + /* * This one is to be used *ONLY* from ->open() instances. * fops must be non-NULL, pinned down *and* module dependencies diff --git a/kernel/module.c b/kernel/module.c index 529efae..4443727 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -4181,6 +4181,7 @@ bool is_module_address(unsigned long addr) return ret; } +EXPORT_SYMBOL_GPL(is_module_address); /* * __module_address - get the module which contains an address. -- 2.9.3
[PATCH] fs: Assert on module file_operations without an owner
Omitting the owner field in file_operations declared in modules is an easy mistake to make, and can result in crashes when the module is unloaded while userspace is poking the file. This patch modifies fops_get() to WARN when it encounters a NULL owner, since in this case it cannot take a reference on the containing module. Signed-off-by: Calvin Owens --- include/linux/fs.h | 13 - kernel/module.c| 1 + 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 901e25d..fafda9e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2081,10 +2081,21 @@ extern struct dentry *mount_pseudo(struct file_system_type *, char *, unsigned long); /* Alas, no aliases. Too much hassle with bringing module.h everywhere */ -#define fops_get(fops) \ +#define __fops_get(fops) \ (((fops) && try_module_get((fops)->owner) ? (fops) : NULL)) #define fops_put(fops) \ do { if (fops) module_put((fops)->owner); } while(0) + +#define unowned_fmt "No fops owner at %p in [%s]\n" +#define fops_unowned(fops) \ + (is_module_address((unsigned long)(fops)) && !(fops)->owner) +#define fops_modname(fops) \ + __module_address((unsigned long)(fops))->name +#define fops_warn_unowned(fops) \ + WARN(fops_unowned(fops), unowned_fmt, (fops), fops_modname(fops)) +#define fops_get(fops) \ + ({ fops_warn_unowned(fops); __fops_get(fops); }) + /* * This one is to be used *ONLY* from ->open() instances. * fops must be non-NULL, pinned down *and* module dependencies diff --git a/kernel/module.c b/kernel/module.c index 529efae..4443727 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -4181,6 +4181,7 @@ bool is_module_address(unsigned long addr) return ret; } +EXPORT_SYMBOL_GPL(is_module_address); /* * __module_address - get the module which contains an address. -- 2.9.3
[PATCH v2 net-next] mlx5: Add ndo_poll_controller() implementation
This implements ndo_poll_controller in net_device_ops callbacks for mlx5, which is necessary to use netconsole with this driver. Acked-By: Saeed Mahameed <sae...@mellanox.com> Signed-off-by: Calvin Owens <calvinow...@fb.com> --- Changes in v2: * Only iterate channels to avoid redundant napi_schedule() calls drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 20 1 file changed, 20 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index b58cfe3..7eaf380 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -3188,6 +3188,20 @@ static int mlx5e_xdp(struct net_device *dev, struct netdev_xdp *xdp) } } +#ifdef CONFIG_NET_POLL_CONTROLLER +/* Fake "interrupt" called by netpoll (eg netconsole) to send skbs without + * reenabling interrupts. + */ +static void mlx5e_netpoll(struct net_device *dev) +{ + struct mlx5e_priv *priv = netdev_priv(dev); + int i; + + for (i = 0; i < priv->params.num_channels; i++) + napi_schedule(>channel[i]->napi); +} +#endif + static const struct net_device_ops mlx5e_netdev_ops_basic = { .ndo_open= mlx5e_open, .ndo_stop= mlx5e_close, @@ -3208,6 +3222,9 @@ static const struct net_device_ops mlx5e_netdev_ops_basic = { #endif .ndo_tx_timeout = mlx5e_tx_timeout, .ndo_xdp = mlx5e_xdp, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = mlx5e_netpoll, +#endif }; static const struct net_device_ops mlx5e_netdev_ops_sriov = { @@ -3240,6 +3257,9 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov = { .ndo_get_vf_stats= mlx5e_get_vf_stats, .ndo_tx_timeout = mlx5e_tx_timeout, .ndo_xdp = mlx5e_xdp, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = mlx5e_netpoll, +#endif }; static int mlx5e_check_required_hca_cap(struct mlx5_core_dev *mdev) -- 2.9.3
[PATCH v2 net-next] mlx5: Add ndo_poll_controller() implementation
This implements ndo_poll_controller in net_device_ops callbacks for mlx5, which is necessary to use netconsole with this driver. Acked-By: Saeed Mahameed Signed-off-by: Calvin Owens --- Changes in v2: * Only iterate channels to avoid redundant napi_schedule() calls drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 20 1 file changed, 20 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index b58cfe3..7eaf380 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -3188,6 +3188,20 @@ static int mlx5e_xdp(struct net_device *dev, struct netdev_xdp *xdp) } } +#ifdef CONFIG_NET_POLL_CONTROLLER +/* Fake "interrupt" called by netpoll (eg netconsole) to send skbs without + * reenabling interrupts. + */ +static void mlx5e_netpoll(struct net_device *dev) +{ + struct mlx5e_priv *priv = netdev_priv(dev); + int i; + + for (i = 0; i < priv->params.num_channels; i++) + napi_schedule(>channel[i]->napi); +} +#endif + static const struct net_device_ops mlx5e_netdev_ops_basic = { .ndo_open= mlx5e_open, .ndo_stop= mlx5e_close, @@ -3208,6 +3222,9 @@ static const struct net_device_ops mlx5e_netdev_ops_basic = { #endif .ndo_tx_timeout = mlx5e_tx_timeout, .ndo_xdp = mlx5e_xdp, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = mlx5e_netpoll, +#endif }; static const struct net_device_ops mlx5e_netdev_ops_sriov = { @@ -3240,6 +3257,9 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov = { .ndo_get_vf_stats= mlx5e_get_vf_stats, .ndo_tx_timeout = mlx5e_tx_timeout, .ndo_xdp = mlx5e_xdp, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = mlx5e_netpoll, +#endif }; static int mlx5e_check_required_hca_cap(struct mlx5_core_dev *mdev) -- 2.9.3
[PATCH v2] mlx5: Add ndo_poll_controller() implementation
This implements ndo_poll_controller in net_device_ops callback for mlx5, which is necessary to use netconsole with this driver. Cc: Saeed Mahameed <sae...@dev.mellanox.co.il> Signed-off-by: Calvin Owens <calvinow...@fb.com> --- Changes in v2: * Only iterate channels to avoid redundant napi_schedule() calls drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 20 1 file changed, 20 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 2459c7f..830b8d0 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -2786,6 +2786,20 @@ static void mlx5e_tx_timeout(struct net_device *dev) schedule_work(>tx_timeout_work); } +#ifdef CONFIG_NET_POLL_CONTROLLER +/* Fake "interrupt" called by netpoll (eg netconsole) to send skbs without + * reenabling interrupts. + */ +static void mlx5e_netpoll(struct net_device *dev) +{ + struct mlx5e_priv *priv = netdev_priv(dev); + int i; + + for (i = 0; i < priv->params.num_channels; i++) + napi_schedule(>channel[i]->napi); +} +#endif + static const struct net_device_ops mlx5e_netdev_ops_basic = { .ndo_open= mlx5e_open, .ndo_stop= mlx5e_close, @@ -2805,6 +2819,9 @@ static const struct net_device_ops mlx5e_netdev_ops_basic = { .ndo_rx_flow_steer = mlx5e_rx_flow_steer, #endif .ndo_tx_timeout = mlx5e_tx_timeout, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = mlx5e_netpoll, +#endif }; static const struct net_device_ops mlx5e_netdev_ops_sriov = { @@ -2836,6 +2853,9 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov = { .ndo_set_vf_link_state = mlx5e_set_vf_link_state, .ndo_get_vf_stats= mlx5e_get_vf_stats, .ndo_tx_timeout = mlx5e_tx_timeout, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = mlx5e_netpoll, +#endif }; static int mlx5e_check_required_hca_cap(struct mlx5_core_dev *mdev) -- 2.9.3
[PATCH v2] mlx5: Add ndo_poll_controller() implementation
This implements ndo_poll_controller in net_device_ops callback for mlx5, which is necessary to use netconsole with this driver. Cc: Saeed Mahameed Signed-off-by: Calvin Owens --- Changes in v2: * Only iterate channels to avoid redundant napi_schedule() calls drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 20 1 file changed, 20 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 2459c7f..830b8d0 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -2786,6 +2786,20 @@ static void mlx5e_tx_timeout(struct net_device *dev) schedule_work(>tx_timeout_work); } +#ifdef CONFIG_NET_POLL_CONTROLLER +/* Fake "interrupt" called by netpoll (eg netconsole) to send skbs without + * reenabling interrupts. + */ +static void mlx5e_netpoll(struct net_device *dev) +{ + struct mlx5e_priv *priv = netdev_priv(dev); + int i; + + for (i = 0; i < priv->params.num_channels; i++) + napi_schedule(>channel[i]->napi); +} +#endif + static const struct net_device_ops mlx5e_netdev_ops_basic = { .ndo_open= mlx5e_open, .ndo_stop= mlx5e_close, @@ -2805,6 +2819,9 @@ static const struct net_device_ops mlx5e_netdev_ops_basic = { .ndo_rx_flow_steer = mlx5e_rx_flow_steer, #endif .ndo_tx_timeout = mlx5e_tx_timeout, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = mlx5e_netpoll, +#endif }; static const struct net_device_ops mlx5e_netdev_ops_sriov = { @@ -2836,6 +2853,9 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov = { .ndo_set_vf_link_state = mlx5e_set_vf_link_state, .ndo_get_vf_stats= mlx5e_get_vf_stats, .ndo_tx_timeout = mlx5e_tx_timeout, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = mlx5e_netpoll, +#endif }; static int mlx5e_check_required_hca_cap(struct mlx5_core_dev *mdev) -- 2.9.3
[PATCH] mlx5: Add ndo_poll_controller() implementation
This implements ndo_poll_controller in net_device_ops for mlx5, which is necessary to use netconsole with this driver. Signed-off-by: Calvin Owens <calvinow...@fb.com> --- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 20 1 file changed, 20 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 2459c7f..439476f 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -2786,6 +2786,20 @@ static void mlx5e_tx_timeout(struct net_device *dev) schedule_work(>tx_timeout_work); } +#ifdef CONFIG_NET_POLL_CONTROLLER +/* Fake "interrupt" called by netpoll (eg netconsole) to send skbs without + * reenabling interrupts. + */ +static void mlx5e_netpoll(struct net_device *dev) +{ + struct mlx5e_priv *priv = netdev_priv(dev); + int i, nr_sq = priv->params.num_channels * priv->params.num_tc; + + for (i = 0; i < nr_sq; i++) + napi_schedule(priv->txq_to_sq_map[i]->cq.napi); +} +#endif + static const struct net_device_ops mlx5e_netdev_ops_basic = { .ndo_open= mlx5e_open, .ndo_stop= mlx5e_close, @@ -2805,6 +2819,9 @@ static const struct net_device_ops mlx5e_netdev_ops_basic = { .ndo_rx_flow_steer = mlx5e_rx_flow_steer, #endif .ndo_tx_timeout = mlx5e_tx_timeout, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = mlx5e_netpoll, +#endif }; static const struct net_device_ops mlx5e_netdev_ops_sriov = { @@ -2836,6 +2853,9 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov = { .ndo_set_vf_link_state = mlx5e_set_vf_link_state, .ndo_get_vf_stats= mlx5e_get_vf_stats, .ndo_tx_timeout = mlx5e_tx_timeout, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = mlx5e_netpoll, +#endif }; static int mlx5e_check_required_hca_cap(struct mlx5_core_dev *mdev) -- 2.9.3
[PATCH] mlx5: Add ndo_poll_controller() implementation
This implements ndo_poll_controller in net_device_ops for mlx5, which is necessary to use netconsole with this driver. Signed-off-by: Calvin Owens --- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 20 1 file changed, 20 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 2459c7f..439476f 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -2786,6 +2786,20 @@ static void mlx5e_tx_timeout(struct net_device *dev) schedule_work(>tx_timeout_work); } +#ifdef CONFIG_NET_POLL_CONTROLLER +/* Fake "interrupt" called by netpoll (eg netconsole) to send skbs without + * reenabling interrupts. + */ +static void mlx5e_netpoll(struct net_device *dev) +{ + struct mlx5e_priv *priv = netdev_priv(dev); + int i, nr_sq = priv->params.num_channels * priv->params.num_tc; + + for (i = 0; i < nr_sq; i++) + napi_schedule(priv->txq_to_sq_map[i]->cq.napi); +} +#endif + static const struct net_device_ops mlx5e_netdev_ops_basic = { .ndo_open= mlx5e_open, .ndo_stop= mlx5e_close, @@ -2805,6 +2819,9 @@ static const struct net_device_ops mlx5e_netdev_ops_basic = { .ndo_rx_flow_steer = mlx5e_rx_flow_steer, #endif .ndo_tx_timeout = mlx5e_tx_timeout, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = mlx5e_netpoll, +#endif }; static const struct net_device_ops mlx5e_netdev_ops_sriov = { @@ -2836,6 +2853,9 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov = { .ndo_set_vf_link_state = mlx5e_set_vf_link_state, .ndo_get_vf_stats= mlx5e_get_vf_stats, .ndo_tx_timeout = mlx5e_tx_timeout, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = mlx5e_netpoll, +#endif }; static int mlx5e_check_required_hca_cap(struct mlx5_core_dev *mdev) -- 2.9.3
[PATCH 2/3] mpt3sas: Eliminate dead sleep_flag code
With the exception of a single call to wait_for_doorbell_int(), all this conditional sleeping code is dead. So delete it. Signed-off-by: Calvin Owens <calvinow...@fb.com> --- drivers/scsi/mpt3sas/mpt3sas_base.c | 241 +-- drivers/scsi/mpt3sas/mpt3sas_base.h | 6 +- drivers/scsi/mpt3sas/mpt3sas_config.c| 3 +- drivers/scsi/mpt3sas/mpt3sas_ctl.c | 15 +- drivers/scsi/mpt3sas/mpt3sas_scsih.c | 21 +-- drivers/scsi/mpt3sas/mpt3sas_transport.c | 12 +- 6 files changed, 120 insertions(+), 178 deletions(-) diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c b/drivers/scsi/mpt3sas/mpt3sas_base.c index 751f13e..0956183 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_base.c +++ b/drivers/scsi/mpt3sas/mpt3sas_base.c @@ -98,7 +98,7 @@ MODULE_PARM_DESC(mpt3sas_fwfault_debug, " enable detection of firmware fault and halt firmware - (default=0)"); static int -_base_get_ioc_facts(struct MPT3SAS_ADAPTER *ioc, int sleep_flag); +_base_get_ioc_facts(struct MPT3SAS_ADAPTER *ioc); /** * _scsih_set_fwfault_debug - global setting of ioc->fwfault_debug. @@ -218,8 +218,7 @@ _base_fault_reset_work(struct work_struct *work) ioc->non_operational_loop = 0; if ((doorbell & MPI2_IOC_STATE_MASK) != MPI2_IOC_STATE_OPERATIONAL) { - rc = mpt3sas_base_hard_reset_handler(ioc, CAN_SLEEP, - FORCE_BIG_HAMMER); + rc = mpt3sas_base_hard_reset_handler(ioc, FORCE_BIG_HAMMER); pr_warn(MPT3SAS_FMT "%s: hard reset: %s\n", ioc->name, __func__, (rc == 0) ? "success" : "failed"); doorbell = mpt3sas_base_get_iocstate(ioc, 0); @@ -2145,7 +2144,7 @@ mpt3sas_base_map_resources(struct MPT3SAS_ADAPTER *ioc) _base_mask_interrupts(ioc); - r = _base_get_ioc_facts(ioc, CAN_SLEEP); + r = _base_get_ioc_facts(ioc); if (r) goto out_fail; @@ -3172,12 +3171,11 @@ _base_release_memory_pools(struct MPT3SAS_ADAPTER *ioc) /** * _base_allocate_memory_pools - allocate start of day memory pools * @ioc: per adapter object - * @sleep_flag: CAN_SLEEP or NO_SLEEP * * Returns 0 success, anything else error */ static int -_base_allocate_memory_pools(struct MPT3SAS_ADAPTER *ioc, int sleep_flag) +_base_allocate_memory_pools(struct MPT3SAS_ADAPTER *ioc) { struct mpt3sas_facts *facts; u16 max_sge_elements; @@ -3647,29 +3645,25 @@ mpt3sas_base_get_iocstate(struct MPT3SAS_ADAPTER *ioc, int cooked) * _base_wait_on_iocstate - waiting on a particular ioc state * @ioc_state: controller state { READY, OPERATIONAL, or RESET } * @timeout: timeout in second - * @sleep_flag: CAN_SLEEP or NO_SLEEP * * Returns 0 for success, non-zero for failure. */ static int -_base_wait_on_iocstate(struct MPT3SAS_ADAPTER *ioc, u32 ioc_state, int timeout, - int sleep_flag) +_base_wait_on_iocstate(struct MPT3SAS_ADAPTER *ioc, u32 ioc_state, int timeout) { u32 count, cntdn; u32 current_state; count = 0; - cntdn = (sleep_flag == CAN_SLEEP) ? 1000*timeout : 2000*timeout; + cntdn = 1000 * timeout; do { current_state = mpt3sas_base_get_iocstate(ioc, 1); if (current_state == ioc_state) return 0; if (count && current_state == MPI2_IOC_STATE_FAULT) break; - if (sleep_flag == CAN_SLEEP) - usleep_range(1000, 1500); - else - udelay(500); + + usleep_range(1000, 1500); count++; } while (--cntdn); @@ -3681,24 +3675,22 @@ _base_wait_on_iocstate(struct MPT3SAS_ADAPTER *ioc, u32 ioc_state, int timeout, * a write to the doorbell) * @ioc: per adapter object * @timeout: timeout in second - * @sleep_flag: CAN_SLEEP or NO_SLEEP * * Returns 0 for success, non-zero for failure. * * Notes: MPI2_HIS_IOC2SYS_DB_STATUS - set to one when IOC writes to doorbell. */ static int -_base_diag_reset(struct MPT3SAS_ADAPTER *ioc, int sleep_flag); +_base_diag_reset(struct MPT3SAS_ADAPTER *ioc); static int -_base_wait_for_doorbell_int(struct MPT3SAS_ADAPTER *ioc, int timeout, - int sleep_flag) +_base_wait_for_doorbell_int(struct MPT3SAS_ADAPTER *ioc, int timeout) { u32 cntdn, count; u32 int_status; count = 0; - cntdn = (sleep_flag == CAN_SLEEP) ? 1000*timeout : 2000*timeout; + cntdn = 1000 * timeout; do { int_status = readl(>chip->HostInterruptStatus); if (int_status & MPI2_HIS_IOC2SYS_DB_STATUS) { @@ -3707,10 +3699,35 @@ _base_wait_for_doorbell_int(struct MPT3SAS_ADAPTER *ioc, int timeout, ioc->name, __func__, count, timeout)); return 0; } -
[PATCH 2/3] mpt3sas: Eliminate dead sleep_flag code
With the exception of a single call to wait_for_doorbell_int(), all this conditional sleeping code is dead. So delete it. Signed-off-by: Calvin Owens --- drivers/scsi/mpt3sas/mpt3sas_base.c | 241 +-- drivers/scsi/mpt3sas/mpt3sas_base.h | 6 +- drivers/scsi/mpt3sas/mpt3sas_config.c| 3 +- drivers/scsi/mpt3sas/mpt3sas_ctl.c | 15 +- drivers/scsi/mpt3sas/mpt3sas_scsih.c | 21 +-- drivers/scsi/mpt3sas/mpt3sas_transport.c | 12 +- 6 files changed, 120 insertions(+), 178 deletions(-) diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c b/drivers/scsi/mpt3sas/mpt3sas_base.c index 751f13e..0956183 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_base.c +++ b/drivers/scsi/mpt3sas/mpt3sas_base.c @@ -98,7 +98,7 @@ MODULE_PARM_DESC(mpt3sas_fwfault_debug, " enable detection of firmware fault and halt firmware - (default=0)"); static int -_base_get_ioc_facts(struct MPT3SAS_ADAPTER *ioc, int sleep_flag); +_base_get_ioc_facts(struct MPT3SAS_ADAPTER *ioc); /** * _scsih_set_fwfault_debug - global setting of ioc->fwfault_debug. @@ -218,8 +218,7 @@ _base_fault_reset_work(struct work_struct *work) ioc->non_operational_loop = 0; if ((doorbell & MPI2_IOC_STATE_MASK) != MPI2_IOC_STATE_OPERATIONAL) { - rc = mpt3sas_base_hard_reset_handler(ioc, CAN_SLEEP, - FORCE_BIG_HAMMER); + rc = mpt3sas_base_hard_reset_handler(ioc, FORCE_BIG_HAMMER); pr_warn(MPT3SAS_FMT "%s: hard reset: %s\n", ioc->name, __func__, (rc == 0) ? "success" : "failed"); doorbell = mpt3sas_base_get_iocstate(ioc, 0); @@ -2145,7 +2144,7 @@ mpt3sas_base_map_resources(struct MPT3SAS_ADAPTER *ioc) _base_mask_interrupts(ioc); - r = _base_get_ioc_facts(ioc, CAN_SLEEP); + r = _base_get_ioc_facts(ioc); if (r) goto out_fail; @@ -3172,12 +3171,11 @@ _base_release_memory_pools(struct MPT3SAS_ADAPTER *ioc) /** * _base_allocate_memory_pools - allocate start of day memory pools * @ioc: per adapter object - * @sleep_flag: CAN_SLEEP or NO_SLEEP * * Returns 0 success, anything else error */ static int -_base_allocate_memory_pools(struct MPT3SAS_ADAPTER *ioc, int sleep_flag) +_base_allocate_memory_pools(struct MPT3SAS_ADAPTER *ioc) { struct mpt3sas_facts *facts; u16 max_sge_elements; @@ -3647,29 +3645,25 @@ mpt3sas_base_get_iocstate(struct MPT3SAS_ADAPTER *ioc, int cooked) * _base_wait_on_iocstate - waiting on a particular ioc state * @ioc_state: controller state { READY, OPERATIONAL, or RESET } * @timeout: timeout in second - * @sleep_flag: CAN_SLEEP or NO_SLEEP * * Returns 0 for success, non-zero for failure. */ static int -_base_wait_on_iocstate(struct MPT3SAS_ADAPTER *ioc, u32 ioc_state, int timeout, - int sleep_flag) +_base_wait_on_iocstate(struct MPT3SAS_ADAPTER *ioc, u32 ioc_state, int timeout) { u32 count, cntdn; u32 current_state; count = 0; - cntdn = (sleep_flag == CAN_SLEEP) ? 1000*timeout : 2000*timeout; + cntdn = 1000 * timeout; do { current_state = mpt3sas_base_get_iocstate(ioc, 1); if (current_state == ioc_state) return 0; if (count && current_state == MPI2_IOC_STATE_FAULT) break; - if (sleep_flag == CAN_SLEEP) - usleep_range(1000, 1500); - else - udelay(500); + + usleep_range(1000, 1500); count++; } while (--cntdn); @@ -3681,24 +3675,22 @@ _base_wait_on_iocstate(struct MPT3SAS_ADAPTER *ioc, u32 ioc_state, int timeout, * a write to the doorbell) * @ioc: per adapter object * @timeout: timeout in second - * @sleep_flag: CAN_SLEEP or NO_SLEEP * * Returns 0 for success, non-zero for failure. * * Notes: MPI2_HIS_IOC2SYS_DB_STATUS - set to one when IOC writes to doorbell. */ static int -_base_diag_reset(struct MPT3SAS_ADAPTER *ioc, int sleep_flag); +_base_diag_reset(struct MPT3SAS_ADAPTER *ioc); static int -_base_wait_for_doorbell_int(struct MPT3SAS_ADAPTER *ioc, int timeout, - int sleep_flag) +_base_wait_for_doorbell_int(struct MPT3SAS_ADAPTER *ioc, int timeout) { u32 cntdn, count; u32 int_status; count = 0; - cntdn = (sleep_flag == CAN_SLEEP) ? 1000*timeout : 2000*timeout; + cntdn = 1000 * timeout; do { int_status = readl(>chip->HostInterruptStatus); if (int_status & MPI2_HIS_IOC2SYS_DB_STATUS) { @@ -3707,10 +3699,35 @@ _base_wait_for_doorbell_int(struct MPT3SAS_ADAPTER *ioc, int timeout, ioc->name, __func__, count, timeout)); return 0; } - if (sleep_flag == CAN_SLE
[PATCH 1/3] mpt3sas: Eliminate conditional locking in mpt3sas_scsih_issue_tm()
This flag that conditionally acquires the mutex is confusing and prone to bugginess: refactor it into two separate function calls, and make the unlocked one complain if it's called outside the mutex. Signed-off-by: Calvin Owens <calvinow...@fb.com> --- drivers/scsi/mpt3sas/mpt3sas_base.h | 16 +++-- drivers/scsi/mpt3sas/mpt3sas_ctl.c | 5 ++- drivers/scsi/mpt3sas/mpt3sas_scsih.c | 66 +--- 3 files changed, 38 insertions(+), 49 deletions(-) diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.h b/drivers/scsi/mpt3sas/mpt3sas_base.h index eb7f5b0..f0baafd 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_base.h +++ b/drivers/scsi/mpt3sas/mpt3sas_base.h @@ -794,16 +794,6 @@ struct reply_post_struct { dma_addr_t reply_post_free_dma; }; -/** - * enum mutex_type - task management mutex type - * @TM_MUTEX_OFF: mutex is not required becuase calling function is acquiring it - * @TM_MUTEX_ON: mutex is required - */ -enum mutex_type { - TM_MUTEX_OFF = 0, - TM_MUTEX_ON = 1, -}; - typedef void (*MPT3SAS_FLUSH_RUNNING_CMDS)(struct MPT3SAS_ADAPTER *ioc); /** * struct MPT3SAS_ADAPTER - per adapter struct @@ -1291,7 +1281,11 @@ void mpt3sas_scsih_reset_handler(struct MPT3SAS_ADAPTER *ioc, int reset_phase); int mpt3sas_scsih_issue_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, uint channel, uint id, uint lun, u8 type, u16 smid_task, - ulong timeout, enum mutex_type m_type); + ulong timeout); +int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, + uint channel, uint id, uint lun, u8 type, u16 smid_task, + ulong timeout); + void mpt3sas_scsih_set_tm_flag(struct MPT3SAS_ADAPTER *ioc, u16 handle); void mpt3sas_scsih_clear_tm_flag(struct MPT3SAS_ADAPTER *ioc, u16 handle); void mpt3sas_expander_remove(struct MPT3SAS_ADAPTER *ioc, u64 sas_address); diff --git a/drivers/scsi/mpt3sas/mpt3sas_ctl.c b/drivers/scsi/mpt3sas/mpt3sas_ctl.c index 7d00f09..75ae533 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_ctl.c +++ b/drivers/scsi/mpt3sas/mpt3sas_ctl.c @@ -1001,10 +1001,9 @@ _ctl_do_mpt_command(struct MPT3SAS_ADAPTER *ioc, struct mpt3_ioctl_command karg, ioc->name, le16_to_cpu(mpi_request->FunctionDependent1)); mpt3sas_halt_firmware(ioc); - mpt3sas_scsih_issue_tm(ioc, + mpt3sas_scsih_issue_locked_tm(ioc, le16_to_cpu(mpi_request->FunctionDependent1), 0, 0, - 0, MPI2_SCSITASKMGMT_TASKTYPE_TARGET_RESET, 0, 30, - TM_MUTEX_ON); + 0, MPI2_SCSITASKMGMT_TASKTYPE_TARGET_RESET, 0, 30); } else mpt3sas_base_hard_reset_handler(ioc, CAN_SLEEP, FORCE_BIG_HAMMER); diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index acabe48..c93a7ba 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -2201,7 +2201,6 @@ mpt3sas_scsih_clear_tm_flag(struct MPT3SAS_ADAPTER *ioc, u16 handle) * @type: MPI2_SCSITASKMGMT_TASKTYPE__XXX (defined in mpi2_init.h) * @smid_task: smid assigned to the task * @timeout: timeout in seconds - * @m_type: TM_MUTEX_ON or TM_MUTEX_OFF * Context: user * * A generic API for sending task management requests to firmware. @@ -2212,8 +2211,7 @@ mpt3sas_scsih_clear_tm_flag(struct MPT3SAS_ADAPTER *ioc, u16 handle) */ int mpt3sas_scsih_issue_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, uint channel, - uint id, uint lun, u8 type, u16 smid_task, ulong timeout, - enum mutex_type m_type) + uint id, uint lun, u8 type, u16 smid_task, ulong timeout) { Mpi2SCSITaskManagementRequest_t *mpi_request; Mpi2SCSITaskManagementReply_t *mpi_reply; @@ -2224,21 +,19 @@ mpt3sas_scsih_issue_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, uint channel, int rc; u16 msix_task = 0; - if (m_type == TM_MUTEX_ON) - mutex_lock(>tm_cmds.mutex); + lockdep_assert_held(>tm_cmds.mutex); + if (ioc->tm_cmds.status != MPT3_CMD_NOT_USED) { pr_info(MPT3SAS_FMT "%s: tm_cmd busy!!!\n", __func__, ioc->name); - rc = FAILED; - goto err_out; + return FAILED; } if (ioc->shost_recovery || ioc->remove_host || ioc->pci_error_recovery) { pr_info(MPT3SAS_FMT "%s: host reset in progress!\n", __func__, ioc->name); - rc = FAILED; - goto err_out; + return FAILED; } ioc_state = mpt3sas_base_get_iocstate(ioc, 0); @@ -2247,8 +2243,7 @@ mpt3sas_scsih_issue_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, uint channel,
[PATCH 3/3] mpt3sas: Fix warnings exposed by W=1
Trivial non-functional changes for a couple annoying things: 1) Functions local to files are not declared static, which is frustrating when reading the code because it's non-obvious at first glance what's actually called from other files. 2) Set-but-unused variables abound, presumably to mask -Wunused-result errors in the past. None of these are flagged today though (with one exception noted below), so remove them. Fixing (2) exposed the fact that we improperly ignore the return value of scsi_device_reprobe() in _scsih_reprobe_lun(). Fixing the calling code to deal with the potential error is non-trivial, so for now just WARN(). Signed-off-by: Calvin Owens <calvinow...@fb.com> --- drivers/scsi/mpt3sas/mpt3sas_base.c | 18 +++- drivers/scsi/mpt3sas/mpt3sas_config.c| 4 +- drivers/scsi/mpt3sas/mpt3sas_ctl.c | 29 ++--- drivers/scsi/mpt3sas/mpt3sas_scsih.c | 70 +++- drivers/scsi/mpt3sas/mpt3sas_transport.c | 16 ++-- 5 files changed, 56 insertions(+), 81 deletions(-) diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c b/drivers/scsi/mpt3sas/mpt3sas_base.c index 0956183..df95d1a 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_base.c +++ b/drivers/scsi/mpt3sas/mpt3sas_base.c @@ -2039,7 +2039,7 @@ _base_enable_msix(struct MPT3SAS_ADAPTER *ioc) * mpt3sas_base_unmap_resources - free controller resources * @ioc: per adapter object */ -void +static void mpt3sas_base_unmap_resources(struct MPT3SAS_ADAPTER *ioc) { struct pci_dev *pdev = ioc->pdev; @@ -3884,7 +3884,6 @@ _base_handshake_req_reply_wait(struct MPT3SAS_ADAPTER *ioc, int request_bytes, MPI2DefaultReply_t *default_reply = (MPI2DefaultReply_t *)reply; int i; u8 failed; - u16 dummy; __le32 *mfp; /* make sure doorbell is not in use */ @@ -3964,7 +3963,7 @@ _base_handshake_req_reply_wait(struct MPT3SAS_ADAPTER *ioc, int request_bytes, return -EFAULT; } if (i >= reply_bytes/2) /* overflow case */ - dummy = readl(>chip->Doorbell); + readl(>chip->Doorbell); else reply[i] = le16_to_cpu(readl(>chip->Doorbell) & MPI2_DOORBELL_DATA_MASK); @@ -4009,7 +4008,6 @@ mpt3sas_base_sas_iounit_control(struct MPT3SAS_ADAPTER *ioc, { u16 smid; u32 ioc_state; - unsigned long timeleft; bool issue_reset = false; int rc; void *request; @@ -4062,7 +4060,7 @@ mpt3sas_base_sas_iounit_control(struct MPT3SAS_ADAPTER *ioc, ioc->ioc_link_reset_in_progress = 1; init_completion(>base_cmds.done); mpt3sas_base_put_smid_default(ioc, smid); - timeleft = wait_for_completion_timeout(>base_cmds.done, + wait_for_completion_timeout(>base_cmds.done, msecs_to_jiffies(1)); if ((mpi_request->Operation == MPI2_SAS_OP_PHY_HARD_RESET || mpi_request->Operation == MPI2_SAS_OP_PHY_LINK_RESET) && @@ -4112,7 +4110,6 @@ mpt3sas_base_scsi_enclosure_processor(struct MPT3SAS_ADAPTER *ioc, { u16 smid; u32 ioc_state; - unsigned long timeleft; bool issue_reset = false; int rc; void *request; @@ -4163,7 +4160,7 @@ mpt3sas_base_scsi_enclosure_processor(struct MPT3SAS_ADAPTER *ioc, memcpy(request, mpi_request, sizeof(Mpi2SepReply_t)); init_completion(>base_cmds.done); mpt3sas_base_put_smid_default(ioc, smid); - timeleft = wait_for_completion_timeout(>base_cmds.done, + wait_for_completion_timeout(>base_cmds.done, msecs_to_jiffies(1)); if (!(ioc->base_cmds.status & MPT3_CMD_COMPLETE)) { pr_err(MPT3SAS_FMT "%s: timeout\n", @@ -4548,7 +4545,6 @@ _base_send_port_enable(struct MPT3SAS_ADAPTER *ioc) { Mpi2PortEnableRequest_t *mpi_request; Mpi2PortEnableReply_t *mpi_reply; - unsigned long timeleft; int r = 0; u16 smid; u16 ioc_status; @@ -4576,8 +4572,7 @@ _base_send_port_enable(struct MPT3SAS_ADAPTER *ioc) init_completion(>port_enable_cmds.done); mpt3sas_base_put_smid_default(ioc, smid); - timeleft = wait_for_completion_timeout(>port_enable_cmds.done, - 300*HZ); + wait_for_completion_timeout(>port_enable_cmds.done, 300*HZ); if (!(ioc->port_enable_cmds.status & MPT3_CMD_COMPLETE)) { pr_err(MPT3SAS_FMT "%s: timeout\n", ioc->name, __func__); @@ -4728,7 +4723,6 @@ static int _base_event_notification(struct MPT3SAS_ADAPTER *ioc) { Mpi2EventNotificationRequest_t *mpi_request; - unsigned long timeleft; u16 smid; int r = 0; int i; @@ -4760,7 +4754,7 @@ _base_event_notification(struct
[PATCH 1/3] mpt3sas: Eliminate conditional locking in mpt3sas_scsih_issue_tm()
This flag that conditionally acquires the mutex is confusing and prone to bugginess: refactor it into two separate function calls, and make the unlocked one complain if it's called outside the mutex. Signed-off-by: Calvin Owens --- drivers/scsi/mpt3sas/mpt3sas_base.h | 16 +++-- drivers/scsi/mpt3sas/mpt3sas_ctl.c | 5 ++- drivers/scsi/mpt3sas/mpt3sas_scsih.c | 66 +--- 3 files changed, 38 insertions(+), 49 deletions(-) diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.h b/drivers/scsi/mpt3sas/mpt3sas_base.h index eb7f5b0..f0baafd 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_base.h +++ b/drivers/scsi/mpt3sas/mpt3sas_base.h @@ -794,16 +794,6 @@ struct reply_post_struct { dma_addr_t reply_post_free_dma; }; -/** - * enum mutex_type - task management mutex type - * @TM_MUTEX_OFF: mutex is not required becuase calling function is acquiring it - * @TM_MUTEX_ON: mutex is required - */ -enum mutex_type { - TM_MUTEX_OFF = 0, - TM_MUTEX_ON = 1, -}; - typedef void (*MPT3SAS_FLUSH_RUNNING_CMDS)(struct MPT3SAS_ADAPTER *ioc); /** * struct MPT3SAS_ADAPTER - per adapter struct @@ -1291,7 +1281,11 @@ void mpt3sas_scsih_reset_handler(struct MPT3SAS_ADAPTER *ioc, int reset_phase); int mpt3sas_scsih_issue_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, uint channel, uint id, uint lun, u8 type, u16 smid_task, - ulong timeout, enum mutex_type m_type); + ulong timeout); +int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, + uint channel, uint id, uint lun, u8 type, u16 smid_task, + ulong timeout); + void mpt3sas_scsih_set_tm_flag(struct MPT3SAS_ADAPTER *ioc, u16 handle); void mpt3sas_scsih_clear_tm_flag(struct MPT3SAS_ADAPTER *ioc, u16 handle); void mpt3sas_expander_remove(struct MPT3SAS_ADAPTER *ioc, u64 sas_address); diff --git a/drivers/scsi/mpt3sas/mpt3sas_ctl.c b/drivers/scsi/mpt3sas/mpt3sas_ctl.c index 7d00f09..75ae533 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_ctl.c +++ b/drivers/scsi/mpt3sas/mpt3sas_ctl.c @@ -1001,10 +1001,9 @@ _ctl_do_mpt_command(struct MPT3SAS_ADAPTER *ioc, struct mpt3_ioctl_command karg, ioc->name, le16_to_cpu(mpi_request->FunctionDependent1)); mpt3sas_halt_firmware(ioc); - mpt3sas_scsih_issue_tm(ioc, + mpt3sas_scsih_issue_locked_tm(ioc, le16_to_cpu(mpi_request->FunctionDependent1), 0, 0, - 0, MPI2_SCSITASKMGMT_TASKTYPE_TARGET_RESET, 0, 30, - TM_MUTEX_ON); + 0, MPI2_SCSITASKMGMT_TASKTYPE_TARGET_RESET, 0, 30); } else mpt3sas_base_hard_reset_handler(ioc, CAN_SLEEP, FORCE_BIG_HAMMER); diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index acabe48..c93a7ba 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -2201,7 +2201,6 @@ mpt3sas_scsih_clear_tm_flag(struct MPT3SAS_ADAPTER *ioc, u16 handle) * @type: MPI2_SCSITASKMGMT_TASKTYPE__XXX (defined in mpi2_init.h) * @smid_task: smid assigned to the task * @timeout: timeout in seconds - * @m_type: TM_MUTEX_ON or TM_MUTEX_OFF * Context: user * * A generic API for sending task management requests to firmware. @@ -2212,8 +2211,7 @@ mpt3sas_scsih_clear_tm_flag(struct MPT3SAS_ADAPTER *ioc, u16 handle) */ int mpt3sas_scsih_issue_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, uint channel, - uint id, uint lun, u8 type, u16 smid_task, ulong timeout, - enum mutex_type m_type) + uint id, uint lun, u8 type, u16 smid_task, ulong timeout) { Mpi2SCSITaskManagementRequest_t *mpi_request; Mpi2SCSITaskManagementReply_t *mpi_reply; @@ -2224,21 +,19 @@ mpt3sas_scsih_issue_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, uint channel, int rc; u16 msix_task = 0; - if (m_type == TM_MUTEX_ON) - mutex_lock(>tm_cmds.mutex); + lockdep_assert_held(>tm_cmds.mutex); + if (ioc->tm_cmds.status != MPT3_CMD_NOT_USED) { pr_info(MPT3SAS_FMT "%s: tm_cmd busy!!!\n", __func__, ioc->name); - rc = FAILED; - goto err_out; + return FAILED; } if (ioc->shost_recovery || ioc->remove_host || ioc->pci_error_recovery) { pr_info(MPT3SAS_FMT "%s: host reset in progress!\n", __func__, ioc->name); - rc = FAILED; - goto err_out; + return FAILED; } ioc_state = mpt3sas_base_get_iocstate(ioc, 0); @@ -2247,8 +2243,7 @@ mpt3sas_scsih_issue_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, uint channel, "unexp
[PATCH 3/3] mpt3sas: Fix warnings exposed by W=1
Trivial non-functional changes for a couple annoying things: 1) Functions local to files are not declared static, which is frustrating when reading the code because it's non-obvious at first glance what's actually called from other files. 2) Set-but-unused variables abound, presumably to mask -Wunused-result errors in the past. None of these are flagged today though (with one exception noted below), so remove them. Fixing (2) exposed the fact that we improperly ignore the return value of scsi_device_reprobe() in _scsih_reprobe_lun(). Fixing the calling code to deal with the potential error is non-trivial, so for now just WARN(). Signed-off-by: Calvin Owens --- drivers/scsi/mpt3sas/mpt3sas_base.c | 18 +++- drivers/scsi/mpt3sas/mpt3sas_config.c| 4 +- drivers/scsi/mpt3sas/mpt3sas_ctl.c | 29 ++--- drivers/scsi/mpt3sas/mpt3sas_scsih.c | 70 +++- drivers/scsi/mpt3sas/mpt3sas_transport.c | 16 ++-- 5 files changed, 56 insertions(+), 81 deletions(-) diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c b/drivers/scsi/mpt3sas/mpt3sas_base.c index 0956183..df95d1a 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_base.c +++ b/drivers/scsi/mpt3sas/mpt3sas_base.c @@ -2039,7 +2039,7 @@ _base_enable_msix(struct MPT3SAS_ADAPTER *ioc) * mpt3sas_base_unmap_resources - free controller resources * @ioc: per adapter object */ -void +static void mpt3sas_base_unmap_resources(struct MPT3SAS_ADAPTER *ioc) { struct pci_dev *pdev = ioc->pdev; @@ -3884,7 +3884,6 @@ _base_handshake_req_reply_wait(struct MPT3SAS_ADAPTER *ioc, int request_bytes, MPI2DefaultReply_t *default_reply = (MPI2DefaultReply_t *)reply; int i; u8 failed; - u16 dummy; __le32 *mfp; /* make sure doorbell is not in use */ @@ -3964,7 +3963,7 @@ _base_handshake_req_reply_wait(struct MPT3SAS_ADAPTER *ioc, int request_bytes, return -EFAULT; } if (i >= reply_bytes/2) /* overflow case */ - dummy = readl(>chip->Doorbell); + readl(>chip->Doorbell); else reply[i] = le16_to_cpu(readl(>chip->Doorbell) & MPI2_DOORBELL_DATA_MASK); @@ -4009,7 +4008,6 @@ mpt3sas_base_sas_iounit_control(struct MPT3SAS_ADAPTER *ioc, { u16 smid; u32 ioc_state; - unsigned long timeleft; bool issue_reset = false; int rc; void *request; @@ -4062,7 +4060,7 @@ mpt3sas_base_sas_iounit_control(struct MPT3SAS_ADAPTER *ioc, ioc->ioc_link_reset_in_progress = 1; init_completion(>base_cmds.done); mpt3sas_base_put_smid_default(ioc, smid); - timeleft = wait_for_completion_timeout(>base_cmds.done, + wait_for_completion_timeout(>base_cmds.done, msecs_to_jiffies(1)); if ((mpi_request->Operation == MPI2_SAS_OP_PHY_HARD_RESET || mpi_request->Operation == MPI2_SAS_OP_PHY_LINK_RESET) && @@ -4112,7 +4110,6 @@ mpt3sas_base_scsi_enclosure_processor(struct MPT3SAS_ADAPTER *ioc, { u16 smid; u32 ioc_state; - unsigned long timeleft; bool issue_reset = false; int rc; void *request; @@ -4163,7 +4160,7 @@ mpt3sas_base_scsi_enclosure_processor(struct MPT3SAS_ADAPTER *ioc, memcpy(request, mpi_request, sizeof(Mpi2SepReply_t)); init_completion(>base_cmds.done); mpt3sas_base_put_smid_default(ioc, smid); - timeleft = wait_for_completion_timeout(>base_cmds.done, + wait_for_completion_timeout(>base_cmds.done, msecs_to_jiffies(1)); if (!(ioc->base_cmds.status & MPT3_CMD_COMPLETE)) { pr_err(MPT3SAS_FMT "%s: timeout\n", @@ -4548,7 +4545,6 @@ _base_send_port_enable(struct MPT3SAS_ADAPTER *ioc) { Mpi2PortEnableRequest_t *mpi_request; Mpi2PortEnableReply_t *mpi_reply; - unsigned long timeleft; int r = 0; u16 smid; u16 ioc_status; @@ -4576,8 +4572,7 @@ _base_send_port_enable(struct MPT3SAS_ADAPTER *ioc) init_completion(>port_enable_cmds.done); mpt3sas_base_put_smid_default(ioc, smid); - timeleft = wait_for_completion_timeout(>port_enable_cmds.done, - 300*HZ); + wait_for_completion_timeout(>port_enable_cmds.done, 300*HZ); if (!(ioc->port_enable_cmds.status & MPT3_CMD_COMPLETE)) { pr_err(MPT3SAS_FMT "%s: timeout\n", ioc->name, __func__); @@ -4728,7 +4723,6 @@ static int _base_event_notification(struct MPT3SAS_ADAPTER *ioc) { Mpi2EventNotificationRequest_t *mpi_request; - unsigned long timeleft; u16 smid; int r = 0; int i; @@ -4760,7 +4754,7 @@ _base_event_notification(struct MPT3SAS_ADAPTER *ioc) cpu_
[PATCH] mpt3sas: Ensure the connector_name string is NUL-terminated
We blindly trust the hardware to give us NUL-terminated strings, which is a bad idea because it doesn't always do that. For example: [ 481.184784] mpt3sas_cm0: enclosure level(0x), connector name( \x3) In this case, connector_name is four spaces. We got lucky here because the 2nd byte beyond our character array happens to be a NUL. Fix this by explicitly writing '\0' to the end of the string to ensure we don't run off the edge of the world in printk(). Signed-off-by: Calvin Owens <calvinow...@fb.com> --- drivers/scsi/mpt3sas/mpt3sas_base.h | 2 +- drivers/scsi/mpt3sas/mpt3sas_scsih.c | 10 ++ 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.h b/drivers/scsi/mpt3sas/mpt3sas_base.h index 892c9be..eb7f5b0 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_base.h +++ b/drivers/scsi/mpt3sas/mpt3sas_base.h @@ -478,7 +478,7 @@ struct _sas_device { u8 pfa_led_on; u8 pend_sas_rphy_add; u8 enclosure_level; - u8 connector_name[4]; + u8 connector_name[5]; struct kref refcount; }; diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index cd91a68..acabe48 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -5380,8 +5380,9 @@ _scsih_check_device(struct MPT3SAS_ADAPTER *ioc, MPI2_SAS_DEVICE0_FLAGS_ENCL_LEVEL_VALID) { sas_device->enclosure_level = le16_to_cpu(sas_device_pg0.EnclosureLevel); - memcpy(_device->connector_name[0], - _device_pg0.ConnectorName[0], 4); + memcpy(sas_device->connector_name, + sas_device_pg0.ConnectorName, 4); + sas_device->connector_name[4] = '\0'; } else { sas_device->enclosure_level = 0; sas_device->connector_name[0] = '\0'; @@ -5508,8 +5509,9 @@ _scsih_add_device(struct MPT3SAS_ADAPTER *ioc, u16 handle, u8 phy_num, if (sas_device_pg0.Flags & MPI2_SAS_DEVICE0_FLAGS_ENCL_LEVEL_VALID) { sas_device->enclosure_level = le16_to_cpu(sas_device_pg0.EnclosureLevel); - memcpy(_device->connector_name[0], - _device_pg0.ConnectorName[0], 4); + memcpy(sas_device->connector_name, + sas_device_pg0.ConnectorName, 4); + sas_device->connector_name[4] = '\0'; } else { sas_device->enclosure_level = 0; sas_device->connector_name[0] = '\0'; -- 2.8.0.rc2
[PATCH] mpt3sas: Ensure the connector_name string is NUL-terminated
We blindly trust the hardware to give us NUL-terminated strings, which is a bad idea because it doesn't always do that. For example: [ 481.184784] mpt3sas_cm0: enclosure level(0x), connector name( \x3) In this case, connector_name is four spaces. We got lucky here because the 2nd byte beyond our character array happens to be a NUL. Fix this by explicitly writing '\0' to the end of the string to ensure we don't run off the edge of the world in printk(). Signed-off-by: Calvin Owens --- drivers/scsi/mpt3sas/mpt3sas_base.h | 2 +- drivers/scsi/mpt3sas/mpt3sas_scsih.c | 10 ++ 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.h b/drivers/scsi/mpt3sas/mpt3sas_base.h index 892c9be..eb7f5b0 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_base.h +++ b/drivers/scsi/mpt3sas/mpt3sas_base.h @@ -478,7 +478,7 @@ struct _sas_device { u8 pfa_led_on; u8 pend_sas_rphy_add; u8 enclosure_level; - u8 connector_name[4]; + u8 connector_name[5]; struct kref refcount; }; diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index cd91a68..acabe48 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -5380,8 +5380,9 @@ _scsih_check_device(struct MPT3SAS_ADAPTER *ioc, MPI2_SAS_DEVICE0_FLAGS_ENCL_LEVEL_VALID) { sas_device->enclosure_level = le16_to_cpu(sas_device_pg0.EnclosureLevel); - memcpy(_device->connector_name[0], - _device_pg0.ConnectorName[0], 4); + memcpy(sas_device->connector_name, + sas_device_pg0.ConnectorName, 4); + sas_device->connector_name[4] = '\0'; } else { sas_device->enclosure_level = 0; sas_device->connector_name[0] = '\0'; @@ -5508,8 +5509,9 @@ _scsih_add_device(struct MPT3SAS_ADAPTER *ioc, u16 handle, u8 phy_num, if (sas_device_pg0.Flags & MPI2_SAS_DEVICE0_FLAGS_ENCL_LEVEL_VALID) { sas_device->enclosure_level = le16_to_cpu(sas_device_pg0.EnclosureLevel); - memcpy(_device->connector_name[0], - _device_pg0.ConnectorName[0], 4); + memcpy(sas_device->connector_name, + sas_device_pg0.ConnectorName, 4); + sas_device->connector_name[4] = '\0'; } else { sas_device->enclosure_level = 0; sas_device->connector_name[0] = '\0'; -- 2.8.0.rc2
Re: [PATCH] ses: Fix racy cleanup of /sys in remove_dev()
On 06/15/2016 01:24 PM, Calvin Owens wrote: On Thursday 06/02 at 15:50 -0700, Calvin Owens wrote: On 05/13/2016 01:28 PM, Calvin Owens wrote: Currently we free the resources backing the enclosure device before we call device_unregister(). This is racy: during rmmod of low-level SCSI drivers that hook into enclosure, we end up with a small window of time during which writing to /sys can OOPS. Example trace with mpt3sas: Ping? Any thoughts? Squinting at this more it still seems racy, but a narrow race is surely better than just blatantly freeing everything while the file is still exposed in /sys? Is there a better way you'd prefer I accomplish this? (I have boxes that OOPS all the time from monitoring code reading the /sys files, with this patch I haven't seen a single one.) Thanks, Calvin Ping? Thoughts, comments? general protection fault: [#1] SMP KASAN Modules linked in: mpt3sas(-) <...> RIP: [] ses_get_page2_descriptor.isra.6+0x38/0x220 [ses] Call Trace: [] ses_set_fault+0xf4/0x400 [ses] [] set_component_fault+0xa9/0xf0 [enclosure] [] dev_attr_store+0x3c/0x70 [] sysfs_kf_write+0x115/0x180 [] kernfs_fop_write+0x275/0x3a0 [] __vfs_write+0xe0/0x3e0 [] vfs_write+0x13f/0x4a0 [] SyS_write+0x111/0x230 [] entry_SYSCALL_64_fastpath+0x13/0x94 Fortunately the solution is extremely simple: call device_unregister() before we free the resources, and the race no longer exists. The driver core holds a reference over ->remove_dev(), so AFAICT this is safe. Signed-off-by: Calvin Owens <calvinow...@fb.com> --- drivers/scsi/ses.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c index 53ef1cb..0e8601a 100644 --- a/drivers/scsi/ses.c +++ b/drivers/scsi/ses.c @@ -778,6 +778,8 @@ static void ses_intf_remove_enclosure(struct scsi_device *sdev) if (!edev) return; + enclosure_unregister(edev); + ses_dev = edev->scratch; edev->scratch = NULL; @@ -789,7 +791,6 @@ static void ses_intf_remove_enclosure(struct scsi_device *sdev) kfree(edev->component[0].scratch); put_device(>edev); - enclosure_unregister(edev); } static void ses_intf_remove(struct device *cdev,
Re: [PATCH] ses: Fix racy cleanup of /sys in remove_dev()
On 06/15/2016 01:24 PM, Calvin Owens wrote: On Thursday 06/02 at 15:50 -0700, Calvin Owens wrote: On 05/13/2016 01:28 PM, Calvin Owens wrote: Currently we free the resources backing the enclosure device before we call device_unregister(). This is racy: during rmmod of low-level SCSI drivers that hook into enclosure, we end up with a small window of time during which writing to /sys can OOPS. Example trace with mpt3sas: Ping? Any thoughts? Squinting at this more it still seems racy, but a narrow race is surely better than just blatantly freeing everything while the file is still exposed in /sys? Is there a better way you'd prefer I accomplish this? (I have boxes that OOPS all the time from monitoring code reading the /sys files, with this patch I haven't seen a single one.) Thanks, Calvin Ping? Thoughts, comments? general protection fault: [#1] SMP KASAN Modules linked in: mpt3sas(-) <...> RIP: [] ses_get_page2_descriptor.isra.6+0x38/0x220 [ses] Call Trace: [] ses_set_fault+0xf4/0x400 [ses] [] set_component_fault+0xa9/0xf0 [enclosure] [] dev_attr_store+0x3c/0x70 [] sysfs_kf_write+0x115/0x180 [] kernfs_fop_write+0x275/0x3a0 [] __vfs_write+0xe0/0x3e0 [] vfs_write+0x13f/0x4a0 [] SyS_write+0x111/0x230 [] entry_SYSCALL_64_fastpath+0x13/0x94 Fortunately the solution is extremely simple: call device_unregister() before we free the resources, and the race no longer exists. The driver core holds a reference over ->remove_dev(), so AFAICT this is safe. Signed-off-by: Calvin Owens --- drivers/scsi/ses.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c index 53ef1cb..0e8601a 100644 --- a/drivers/scsi/ses.c +++ b/drivers/scsi/ses.c @@ -778,6 +778,8 @@ static void ses_intf_remove_enclosure(struct scsi_device *sdev) if (!edev) return; + enclosure_unregister(edev); + ses_dev = edev->scratch; edev->scratch = NULL; @@ -789,7 +791,6 @@ static void ses_intf_remove_enclosure(struct scsi_device *sdev) kfree(edev->component[0].scratch); put_device(>edev); - enclosure_unregister(edev); } static void ses_intf_remove(struct device *cdev,
Re: [BUG] Slab corruption during XFS writeback under memory pressure
On 07/18/2016 07:05 PM, Calvin Owens wrote: On 07/17/2016 11:02 PM, Dave Chinner wrote: On Sun, Jul 17, 2016 at 10:00:03AM +1000, Dave Chinner wrote: On Fri, Jul 15, 2016 at 05:18:02PM -0700, Calvin Owens wrote: Hello all, I've found a nasty source of slab corruption. Based on seeing similar symptoms on boxes at Facebook, I suspect it's been around since at least 3.10. It only reproduces under memory pressure so far as I can tell: the issue seems to be that XFS reclaims pages from buffers that are still in use by scsi/block. I'm not sure which side the bug lies on, but I've only observed it with XFS. [] But this indicates that the page is under writeback at this point, so that tends to indicate that the above freeing was incorrect. Hmmm - it's clear we've got direct reclaim involved here, and the suspicion of a dirty page that has had it's bufferheads cleared. Are there any other warnings in the log from XFS prior to kasan throwing the error? Can you try the patch below? Thanks for getting this out so quickly :) So far so good: I booted Linus' tree as of this morning and reproduced the ASAN splat. After applying your patch I haven't triggered it. I'm a bit wary since it was hard to trigger reliably in the first place... so I lined up a few dozen boxes to run the test case overnight. I'll confirm in the morning (-0700) they look good. All right, my testcase ran 2099 times overnight without triggering anything. For the overnight tests, I booted the boxes with "mem=" to artificially limit RAM, which makes my repro *much* more reliable (I feel silly for not thinking of that in the first place). With that setup, I hit the ASAN splat 21 times in 98 runs on vanilla 4.7-rc7. So I'm sold. Tested-by: Calvin Owens <calvinow...@fb.com> Again, really appreciate the quick response :) Thanks, Calvin Thanks, Calvin -Dave.
Re: [BUG] Slab corruption during XFS writeback under memory pressure
On 07/18/2016 07:05 PM, Calvin Owens wrote: On 07/17/2016 11:02 PM, Dave Chinner wrote: On Sun, Jul 17, 2016 at 10:00:03AM +1000, Dave Chinner wrote: On Fri, Jul 15, 2016 at 05:18:02PM -0700, Calvin Owens wrote: Hello all, I've found a nasty source of slab corruption. Based on seeing similar symptoms on boxes at Facebook, I suspect it's been around since at least 3.10. It only reproduces under memory pressure so far as I can tell: the issue seems to be that XFS reclaims pages from buffers that are still in use by scsi/block. I'm not sure which side the bug lies on, but I've only observed it with XFS. [] But this indicates that the page is under writeback at this point, so that tends to indicate that the above freeing was incorrect. Hmmm - it's clear we've got direct reclaim involved here, and the suspicion of a dirty page that has had it's bufferheads cleared. Are there any other warnings in the log from XFS prior to kasan throwing the error? Can you try the patch below? Thanks for getting this out so quickly :) So far so good: I booted Linus' tree as of this morning and reproduced the ASAN splat. After applying your patch I haven't triggered it. I'm a bit wary since it was hard to trigger reliably in the first place... so I lined up a few dozen boxes to run the test case overnight. I'll confirm in the morning (-0700) they look good. All right, my testcase ran 2099 times overnight without triggering anything. For the overnight tests, I booted the boxes with "mem=" to artificially limit RAM, which makes my repro *much* more reliable (I feel silly for not thinking of that in the first place). With that setup, I hit the ASAN splat 21 times in 98 runs on vanilla 4.7-rc7. So I'm sold. Tested-by: Calvin Owens Again, really appreciate the quick response :) Thanks, Calvin Thanks, Calvin -Dave.
Re: [BUG] Slab corruption during XFS writeback under memory pressure
On 07/17/2016 11:02 PM, Dave Chinner wrote: On Sun, Jul 17, 2016 at 10:00:03AM +1000, Dave Chinner wrote: On Fri, Jul 15, 2016 at 05:18:02PM -0700, Calvin Owens wrote: Hello all, I've found a nasty source of slab corruption. Based on seeing similar symptoms on boxes at Facebook, I suspect it's been around since at least 3.10. It only reproduces under memory pressure so far as I can tell: the issue seems to be that XFS reclaims pages from buffers that are still in use by scsi/block. I'm not sure which side the bug lies on, but I've only observed it with XFS. [] But this indicates that the page is under writeback at this point, so that tends to indicate that the above freeing was incorrect. Hmmm - it's clear we've got direct reclaim involved here, and the suspicion of a dirty page that has had it's bufferheads cleared. Are there any other warnings in the log from XFS prior to kasan throwing the error? Can you try the patch below? Thanks for getting this out so quickly :) So far so good: I booted Linus' tree as of this morning and reproduced the ASAN splat. After applying your patch I haven't triggered it. I'm a bit wary since it was hard to trigger reliably in the first place... so I lined up a few dozen boxes to run the test case overnight. I'll confirm in the morning (-0700) they look good. Thanks, Calvin -Dave.
Re: [BUG] Slab corruption during XFS writeback under memory pressure
On 07/17/2016 11:02 PM, Dave Chinner wrote: On Sun, Jul 17, 2016 at 10:00:03AM +1000, Dave Chinner wrote: On Fri, Jul 15, 2016 at 05:18:02PM -0700, Calvin Owens wrote: Hello all, I've found a nasty source of slab corruption. Based on seeing similar symptoms on boxes at Facebook, I suspect it's been around since at least 3.10. It only reproduces under memory pressure so far as I can tell: the issue seems to be that XFS reclaims pages from buffers that are still in use by scsi/block. I'm not sure which side the bug lies on, but I've only observed it with XFS. [] But this indicates that the page is under writeback at this point, so that tends to indicate that the above freeing was incorrect. Hmmm - it's clear we've got direct reclaim involved here, and the suspicion of a dirty page that has had it's bufferheads cleared. Are there any other warnings in the log from XFS prior to kasan throwing the error? Can you try the patch below? Thanks for getting this out so quickly :) So far so good: I booted Linus' tree as of this morning and reproduced the ASAN splat. After applying your patch I haven't triggered it. I'm a bit wary since it was hard to trigger reliably in the first place... so I lined up a few dozen boxes to run the test case overnight. I'll confirm in the morning (-0700) they look good. Thanks, Calvin -Dave.
[BUG] Slab corruption during XFS writeback under memory pressure
Hello all, I've found a nasty source of slab corruption. Based on seeing similar symptoms on boxes at Facebook, I suspect it's been around since at least 3.10. It only reproduces under memory pressure so far as I can tell: the issue seems to be that XFS reclaims pages from buffers that are still in use by scsi/block. I'm not sure which side the bug lies on, but I've only observed it with XFS. [67203.776421] == [67203.792521] BUG: KASAN: use-after-free in xfs_destroy_ioend+0x3bf/0x4c0 at addr 8804cf466288 [67203.812036] Read of size 8 by task python2.7/22913 [67203.822713] = [67203.840917] BUG buffer_head (Not tainted): kasan: bad access detected [67203.855253] - [67203.855253] [67203.876727] Disabling lock debugging due to kernel taint [67203.888575] INFO: Allocated in 0x8804cf465d40 age=18437180719206552994 cpu=2191548261 pid=-1 [67203.908139] alloc_buffer_head+0x22/0xd0 [67203.916903] ___slab_alloc+0x4e0/0x520 [67203.925286] __slab_alloc+0x43/0x70 [67203.933087] kmem_cache_alloc+0x228/0x2c0 [67203.942042] alloc_buffer_head+0x22/0xd0 [67203.950782] alloc_page_buffers+0xa9/0x1f0 [67203.959936] create_empty_buffers+0x30/0x420 [67203.969495] create_page_buffers+0x120/0x1b0 [67203.979029] __block_write_begin+0x16b/0x1010 [67203.988756] xfs_vm_write_begin+0x55/0x1b0 [67203.997884] generic_perform_write+0x288/0x510 [67204.007771] xfs_file_buffered_aio_write+0x316/0x780 [67204.018811] xfs_file_write_iter+0x26f/0x6c0 [67204.028313] __vfs_write+0x2a0/0x620 [67204.036276] vfs_write+0x159/0x4c0 [67204.043855] SyS_write+0xd2/0x1b0 [67204.051245] INFO: Freed in 0x103fc80ec age=18446651500051355200 cpu=2165122683 pid=-1 [67204.068634] free_buffer_head+0x41/0x90 [67204.077175] __slab_free+0x1ed/0x340 [67204.085138] kmem_cache_free+0x270/0x300 [67204.093867] free_buffer_head+0x41/0x90 [67204.102422] try_to_free_buffers+0x171/0x240 [67204.111925] xfs_vm_releasepage+0xcb/0x3b0 [67204.121101] try_to_release_page+0x106/0x190 [67204.130602] shrink_page_list+0x118e/0x1a10 [67204.139910] shrink_inactive_list+0x42c/0xdf0 [67204.149600] shrink_zone_memcg+0xa09/0xfa0 [67204.158715] shrink_zone+0x2c3/0xbc0 [67204.166679] do_try_to_free_pages+0x42a/0x12f0 [67204.176562] try_to_free_pages+0x1a3/0x5d0 [67204.185709] __alloc_pages_nodemask+0xbeb/0x20d0 [67204.195979] alloc_pages_vma+0x11b/0x5e0 [67204.204709] handle_mm_fault+0x2c27/0x47d0 [67204.213823] INFO: Slab 0xea00133d1900 objects=37 used=14 fp=0x8804cf464530 flags=0x20004080 [67204.235439] INFO: Object 0x8804cf466260 @offset=8800 fp=0x [67204.235439] [67204.455817] CPU: 1 PID: 22913 Comm: python2.7 Tainted: GB 4.7.0-rc7-calvinowens-1468357363-1-gcaa3dc6 #1 [67204.480313] Hardware name: Wiwynn HoneyBadger/PantherPlus, BIOS HBM6.71 02/03/2016 [67204.497509] 88075e99f480 88075ec87a30 81e8b8e4 8804cf464000 [67204.514224] 8804cf466260 88075ec87a60 8153a995 88075e99f480 [67204.530924] ea00133d1900 8804cf466260 dc00 88075ec87a88 [67204.547624] Call Trace: [67204.553086][] dump_stack+0x68/0x94 [67204.565946] [] print_trailer+0x115/0x1a0 [67204.578334] [] object_err+0x34/0x40 [67204.589762] [] kasan_report_error+0x217/0x530 [67204.616847] [] __asan_report_load8_noabort+0x43/0x50 [67204.645085] [] xfs_destroy_ioend+0x3bf/0x4c0 [67204.658243] [] xfs_end_bio+0x154/0x220 [67204.685362] [] bio_endio+0x158/0x1b0 [67204.696983] [] blk_update_request+0x18b/0xb80 [67204.710334] [] scsi_end_request+0x97/0x5a0 [67204.723108] [] scsi_io_completion+0x438/0x1690 [67204.807293] [] scsi_finish_command+0x375/0x4e0 [67204.820838] [] scsi_softirq_done+0x280/0x340 [67204.848884] [] blk_done_softirq+0x1ff/0x360 [67204.875074] [] __do_softirq+0x22d/0x8d7 [67204.887270] [] irq_exit+0x15c/0x190 [67204.898697] [] smp_apic_timer_interrupt+0x83/0xa0 [67204.912815] [] apic_timer_interrupt+0x89/0x90 [67205.029113] == Another ASAN trace: [10856.599645] == [10856.614109] BUG: KASAN: use-after-free in xfs_destroy_ioend+0x3b5/0x4c0 at addr 88006be5db90 [10856.631696] Read of size 8 by task kworker/13:1/314 [10856.641464] = [10856.657836] BUG buffer_head (Tainted: GB ): kasan: bad access detected [10856.673158] - [10856.673158] [10856.692477] INFO: Allocated in 0x88006be5c378 age=18445973393378446689 cpu=2191548517 pid=-1 [10856.710062] alloc_buffer_head+0x22/0xd0 [10856.717928] ___slab_alloc+0x4e0/0x520
[BUG] Slab corruption during XFS writeback under memory pressure
Hello all, I've found a nasty source of slab corruption. Based on seeing similar symptoms on boxes at Facebook, I suspect it's been around since at least 3.10. It only reproduces under memory pressure so far as I can tell: the issue seems to be that XFS reclaims pages from buffers that are still in use by scsi/block. I'm not sure which side the bug lies on, but I've only observed it with XFS. [67203.776421] == [67203.792521] BUG: KASAN: use-after-free in xfs_destroy_ioend+0x3bf/0x4c0 at addr 8804cf466288 [67203.812036] Read of size 8 by task python2.7/22913 [67203.822713] = [67203.840917] BUG buffer_head (Not tainted): kasan: bad access detected [67203.855253] - [67203.855253] [67203.876727] Disabling lock debugging due to kernel taint [67203.888575] INFO: Allocated in 0x8804cf465d40 age=18437180719206552994 cpu=2191548261 pid=-1 [67203.908139] alloc_buffer_head+0x22/0xd0 [67203.916903] ___slab_alloc+0x4e0/0x520 [67203.925286] __slab_alloc+0x43/0x70 [67203.933087] kmem_cache_alloc+0x228/0x2c0 [67203.942042] alloc_buffer_head+0x22/0xd0 [67203.950782] alloc_page_buffers+0xa9/0x1f0 [67203.959936] create_empty_buffers+0x30/0x420 [67203.969495] create_page_buffers+0x120/0x1b0 [67203.979029] __block_write_begin+0x16b/0x1010 [67203.988756] xfs_vm_write_begin+0x55/0x1b0 [67203.997884] generic_perform_write+0x288/0x510 [67204.007771] xfs_file_buffered_aio_write+0x316/0x780 [67204.018811] xfs_file_write_iter+0x26f/0x6c0 [67204.028313] __vfs_write+0x2a0/0x620 [67204.036276] vfs_write+0x159/0x4c0 [67204.043855] SyS_write+0xd2/0x1b0 [67204.051245] INFO: Freed in 0x103fc80ec age=18446651500051355200 cpu=2165122683 pid=-1 [67204.068634] free_buffer_head+0x41/0x90 [67204.077175] __slab_free+0x1ed/0x340 [67204.085138] kmem_cache_free+0x270/0x300 [67204.093867] free_buffer_head+0x41/0x90 [67204.102422] try_to_free_buffers+0x171/0x240 [67204.111925] xfs_vm_releasepage+0xcb/0x3b0 [67204.121101] try_to_release_page+0x106/0x190 [67204.130602] shrink_page_list+0x118e/0x1a10 [67204.139910] shrink_inactive_list+0x42c/0xdf0 [67204.149600] shrink_zone_memcg+0xa09/0xfa0 [67204.158715] shrink_zone+0x2c3/0xbc0 [67204.166679] do_try_to_free_pages+0x42a/0x12f0 [67204.176562] try_to_free_pages+0x1a3/0x5d0 [67204.185709] __alloc_pages_nodemask+0xbeb/0x20d0 [67204.195979] alloc_pages_vma+0x11b/0x5e0 [67204.204709] handle_mm_fault+0x2c27/0x47d0 [67204.213823] INFO: Slab 0xea00133d1900 objects=37 used=14 fp=0x8804cf464530 flags=0x20004080 [67204.235439] INFO: Object 0x8804cf466260 @offset=8800 fp=0x [67204.235439] [67204.455817] CPU: 1 PID: 22913 Comm: python2.7 Tainted: GB 4.7.0-rc7-calvinowens-1468357363-1-gcaa3dc6 #1 [67204.480313] Hardware name: Wiwynn HoneyBadger/PantherPlus, BIOS HBM6.71 02/03/2016 [67204.497509] 88075e99f480 88075ec87a30 81e8b8e4 8804cf464000 [67204.514224] 8804cf466260 88075ec87a60 8153a995 88075e99f480 [67204.530924] ea00133d1900 8804cf466260 dc00 88075ec87a88 [67204.547624] Call Trace: [67204.553086][] dump_stack+0x68/0x94 [67204.565946] [] print_trailer+0x115/0x1a0 [67204.578334] [] object_err+0x34/0x40 [67204.589762] [] kasan_report_error+0x217/0x530 [67204.616847] [] __asan_report_load8_noabort+0x43/0x50 [67204.645085] [] xfs_destroy_ioend+0x3bf/0x4c0 [67204.658243] [] xfs_end_bio+0x154/0x220 [67204.685362] [] bio_endio+0x158/0x1b0 [67204.696983] [] blk_update_request+0x18b/0xb80 [67204.710334] [] scsi_end_request+0x97/0x5a0 [67204.723108] [] scsi_io_completion+0x438/0x1690 [67204.807293] [] scsi_finish_command+0x375/0x4e0 [67204.820838] [] scsi_softirq_done+0x280/0x340 [67204.848884] [] blk_done_softirq+0x1ff/0x360 [67204.875074] [] __do_softirq+0x22d/0x8d7 [67204.887270] [] irq_exit+0x15c/0x190 [67204.898697] [] smp_apic_timer_interrupt+0x83/0xa0 [67204.912815] [] apic_timer_interrupt+0x89/0x90 [67205.029113] == Another ASAN trace: [10856.599645] == [10856.614109] BUG: KASAN: use-after-free in xfs_destroy_ioend+0x3b5/0x4c0 at addr 88006be5db90 [10856.631696] Read of size 8 by task kworker/13:1/314 [10856.641464] = [10856.657836] BUG buffer_head (Tainted: GB ): kasan: bad access detected [10856.673158] - [10856.673158] [10856.692477] INFO: Allocated in 0x88006be5c378 age=18445973393378446689 cpu=2191548517 pid=-1 [10856.710062] alloc_buffer_head+0x22/0xd0 [10856.717928] ___slab_alloc+0x4e0/0x520
Re: slab-out-of-bounds in rpc/nfs
On Friday 06/17 at 09:38 -0400, Benjamin Coddington wrote: > On 16 Jun 2016, at 13:52, Calvin Owens wrote: > > > On Tuesday 03/08 at 11:37 +0100, Dmitry Vyukov wrote: > > > On Tue, Mar 8, 2016 at 11:27 AM, Benjamin Coddington > > > <bcodd...@redhat.com> wrote: > > > > Adding linux-...@vger.kernel.org .. > > > > > > > > On Mon, 7 Mar 2016, Alexei Starovoitov wrote: > > > > > > > > > seeing on ton of these errors on net-next with kasan on. > > > > > Likely old bug though. > > > > > > > > > > [ 373.705691] BUG: KASAN: slab-out-of-bounds in > > > > > memcpy+0x28/0x40 at > > > > > addr 8811ada62cb0 > > > > > [ 373.707137] Write of size 28 by task bash/7059 > > > > > [ 373.708177] > > > > > = > > > > > [ 373.709711] BUG kmalloc-4096 (Tainted: GW ): kasan: > > > > > bad access detected > > > > > [ 373.711185] > > > > > - > > > > > [ 373.711185] > > > > > [ 373.721461] INFO: Allocated in rpc_malloc+0x58/0xd0 > > > > > age=21 cpu=5 pid=7059 > > > > > [ 373.727158] ___slab_alloc+0x4e2/0x500 > > > > > [ 373.728469] __slab_alloc+0x43/0x70 > > > > > [ 373.729222] __kmalloc+0x286/0x350 > > > > > [ 373.729978] rpc_malloc+0x58/0xd0 > > > > > [ 373.730590] call_allocate+0x333/0x690 > > > > > [ 373.731428] __rpc_execute+0x187/0xad0 > > > > > [ 373.734395] rpc_execute+0xe1/0x2c0 > > > > > [ 373.735020] rpc_run_task+0x1ce/0x250 > > > > > [ 373.735706] rpc_call_sync+0x93/0x150 > > > > > [ 373.736387] nfs3_rpc_wrapper.constprop.12+0x9b/0x240 > > > > > [ 373.742818] nfs3_proc_readdir+0x230/0x390 > > > > > [ 373.750157] nfs_readdir_xdr_to_array+0x501/0x9b0 > > > > > [ 373.753520] nfs_readdir_filler+0x68/0x160 > > > > > [ 373.758455] do_read_cache_page+0x8c/0x3c0 > > > > > [ 373.761745] read_cache_page+0x46/0x70 > > > > > [ 373.763269] nfs_readdir+0x420/0x1380 > > > > > [ 373.764078] INFO: Freed in rpc_free+0x41/0x70 age=64 > > > > > cpu=5 pid=7059 > > > > > [ 373.765335] __slab_free+0x175/0x280 > > > > > [ 373.766106] kfree+0x25c/0x2a0 > > > > > [ 373.766809] rpc_free+0x41/0x70 > > > > > [ 373.767629] xprt_release+0x2c5/0x8f0 > > > > > [ 373.768430] rpc_release_resources_task+0x14/0x80 > > > > > [ 373.769403] __rpc_execute+0x547/0xad0 > > > > > [ 373.770249] rpc_execute+0xe1/0x2c0 > > > > > [ 373.770995] rpc_run_task+0x1ce/0x250 > > > > > [ 373.771786] rpc_call_sync+0x93/0x150 > > > > > [ 373.772672] nfs3_rpc_wrapper.constprop.12+0x9b/0x240 > > > > > [ 373.773704] nfs3_proc_access+0x1f1/0x330 > > > > > [ 373.774544] nfs_do_access+0x94f/0x12d0 > > > > > [ 373.775572] nfs_permission+0x469/0x580 > > > > > [ 373.776465] __inode_permission+0x151/0x230 > > > > > [ 373.780764] inode_permission+0x21/0xf0 > > > > > [ 373.791392] may_open+0x14b/0x260 > > > > > > > > > > > The report misses the most interesting part -- the out-of-bounds > > > access stack. It should be at the bottom of the report. If you still > > > have the full report, please post it. > > > > I'm triggering this as well on 4.7-rc3. I can reproduce it as far back > > as 4.0, > > can't easily test any further back because that's when KASAN was merged. > > > > Logs and Kconfig follow. I can trigger this 100% of the time. > > Hi Calvin, how are you triggering this? I would guess this is getdents or a > readdir that's been signaled before the server replies.. Unfortunately my current repro is "boot a specific server type at Facebook", I'll drill down and see if I can get a minimal repro to send along. Thanks, Calvin
Re: slab-out-of-bounds in rpc/nfs
On Friday 06/17 at 09:38 -0400, Benjamin Coddington wrote: > On 16 Jun 2016, at 13:52, Calvin Owens wrote: > > > On Tuesday 03/08 at 11:37 +0100, Dmitry Vyukov wrote: > > > On Tue, Mar 8, 2016 at 11:27 AM, Benjamin Coddington > > > wrote: > > > > Adding linux-...@vger.kernel.org .. > > > > > > > > On Mon, 7 Mar 2016, Alexei Starovoitov wrote: > > > > > > > > > seeing on ton of these errors on net-next with kasan on. > > > > > Likely old bug though. > > > > > > > > > > [ 373.705691] BUG: KASAN: slab-out-of-bounds in > > > > > memcpy+0x28/0x40 at > > > > > addr 8811ada62cb0 > > > > > [ 373.707137] Write of size 28 by task bash/7059 > > > > > [ 373.708177] > > > > > = > > > > > [ 373.709711] BUG kmalloc-4096 (Tainted: GW ): kasan: > > > > > bad access detected > > > > > [ 373.711185] > > > > > - > > > > > [ 373.711185] > > > > > [ 373.721461] INFO: Allocated in rpc_malloc+0x58/0xd0 > > > > > age=21 cpu=5 pid=7059 > > > > > [ 373.727158] ___slab_alloc+0x4e2/0x500 > > > > > [ 373.728469] __slab_alloc+0x43/0x70 > > > > > [ 373.729222] __kmalloc+0x286/0x350 > > > > > [ 373.729978] rpc_malloc+0x58/0xd0 > > > > > [ 373.730590] call_allocate+0x333/0x690 > > > > > [ 373.731428] __rpc_execute+0x187/0xad0 > > > > > [ 373.734395] rpc_execute+0xe1/0x2c0 > > > > > [ 373.735020] rpc_run_task+0x1ce/0x250 > > > > > [ 373.735706] rpc_call_sync+0x93/0x150 > > > > > [ 373.736387] nfs3_rpc_wrapper.constprop.12+0x9b/0x240 > > > > > [ 373.742818] nfs3_proc_readdir+0x230/0x390 > > > > > [ 373.750157] nfs_readdir_xdr_to_array+0x501/0x9b0 > > > > > [ 373.753520] nfs_readdir_filler+0x68/0x160 > > > > > [ 373.758455] do_read_cache_page+0x8c/0x3c0 > > > > > [ 373.761745] read_cache_page+0x46/0x70 > > > > > [ 373.763269] nfs_readdir+0x420/0x1380 > > > > > [ 373.764078] INFO: Freed in rpc_free+0x41/0x70 age=64 > > > > > cpu=5 pid=7059 > > > > > [ 373.765335] __slab_free+0x175/0x280 > > > > > [ 373.766106] kfree+0x25c/0x2a0 > > > > > [ 373.766809] rpc_free+0x41/0x70 > > > > > [ 373.767629] xprt_release+0x2c5/0x8f0 > > > > > [ 373.768430] rpc_release_resources_task+0x14/0x80 > > > > > [ 373.769403] __rpc_execute+0x547/0xad0 > > > > > [ 373.770249] rpc_execute+0xe1/0x2c0 > > > > > [ 373.770995] rpc_run_task+0x1ce/0x250 > > > > > [ 373.771786] rpc_call_sync+0x93/0x150 > > > > > [ 373.772672] nfs3_rpc_wrapper.constprop.12+0x9b/0x240 > > > > > [ 373.773704] nfs3_proc_access+0x1f1/0x330 > > > > > [ 373.774544] nfs_do_access+0x94f/0x12d0 > > > > > [ 373.775572] nfs_permission+0x469/0x580 > > > > > [ 373.776465] __inode_permission+0x151/0x230 > > > > > [ 373.780764] inode_permission+0x21/0xf0 > > > > > [ 373.791392] may_open+0x14b/0x260 > > > > > > > > > > > The report misses the most interesting part -- the out-of-bounds > > > access stack. It should be at the bottom of the report. If you still > > > have the full report, please post it. > > > > I'm triggering this as well on 4.7-rc3. I can reproduce it as far back > > as 4.0, > > can't easily test any further back because that's when KASAN was merged. > > > > Logs and Kconfig follow. I can trigger this 100% of the time. > > Hi Calvin, how are you triggering this? I would guess this is getdents or a > readdir that's been signaled before the server replies.. Unfortunately my current repro is "boot a specific server type at Facebook", I'll drill down and see if I can get a minimal repro to send along. Thanks, Calvin
Re: [PATCH] ses: Fix racy cleanup of /sys in remove_dev()
On Thursday 06/02 at 15:50 -0700, Calvin Owens wrote: > On 05/13/2016 01:28 PM, Calvin Owens wrote: > > Currently we free the resources backing the enclosure device before we > > call device_unregister(). This is racy: during rmmod of low-level SCSI > > drivers that hook into enclosure, we end up with a small window of time > > during which writing to /sys can OOPS. Example trace with mpt3sas: > > Ping? Any thoughts? Squinting at this more it still seems racy, but a narrow race is surely better than just blatantly freeing everything while the file is still exposed in /sys? Is there a better way you'd prefer I accomplish this? (I have boxes that OOPS all the time from monitoring code reading the /sys files, with this patch I haven't seen a single one.) Thanks, Calvin > >general protection fault: [#1] SMP KASAN > >Modules linked in: mpt3sas(-) <...> > >RIP: [] ses_get_page2_descriptor.isra.6+0x38/0x220 > > [ses] > >Call Trace: > > [] ses_set_fault+0xf4/0x400 [ses] > > [] set_component_fault+0xa9/0xf0 [enclosure] > > [] dev_attr_store+0x3c/0x70 > > [] sysfs_kf_write+0x115/0x180 > > [] kernfs_fop_write+0x275/0x3a0 > > [] __vfs_write+0xe0/0x3e0 > > [] vfs_write+0x13f/0x4a0 > > [] SyS_write+0x111/0x230 > > [] entry_SYSCALL_64_fastpath+0x13/0x94 > > > > Fortunately the solution is extremely simple: call device_unregister() > > before we free the resources, and the race no longer exists. The driver > > core holds a reference over ->remove_dev(), so AFAICT this is safe. > > > > Signed-off-by: Calvin Owens <calvinow...@fb.com> > > --- > > drivers/scsi/ses.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c > > index 53ef1cb..0e8601a 100644 > > --- a/drivers/scsi/ses.c > > +++ b/drivers/scsi/ses.c > > @@ -778,6 +778,8 @@ static void ses_intf_remove_enclosure(struct > > scsi_device *sdev) > > if (!edev) > > return; > > > > + enclosure_unregister(edev); > > + > > ses_dev = edev->scratch; > > edev->scratch = NULL; > > > > @@ -789,7 +791,6 @@ static void ses_intf_remove_enclosure(struct > > scsi_device *sdev) > > kfree(edev->component[0].scratch); > > > > put_device(>edev); > > - enclosure_unregister(edev); > > } > > > > static void ses_intf_remove(struct device *cdev, > > >
Re: [PATCH] ses: Fix racy cleanup of /sys in remove_dev()
On Thursday 06/02 at 15:50 -0700, Calvin Owens wrote: > On 05/13/2016 01:28 PM, Calvin Owens wrote: > > Currently we free the resources backing the enclosure device before we > > call device_unregister(). This is racy: during rmmod of low-level SCSI > > drivers that hook into enclosure, we end up with a small window of time > > during which writing to /sys can OOPS. Example trace with mpt3sas: > > Ping? Any thoughts? Squinting at this more it still seems racy, but a narrow race is surely better than just blatantly freeing everything while the file is still exposed in /sys? Is there a better way you'd prefer I accomplish this? (I have boxes that OOPS all the time from monitoring code reading the /sys files, with this patch I haven't seen a single one.) Thanks, Calvin > >general protection fault: [#1] SMP KASAN > >Modules linked in: mpt3sas(-) <...> > >RIP: [] ses_get_page2_descriptor.isra.6+0x38/0x220 > > [ses] > >Call Trace: > > [] ses_set_fault+0xf4/0x400 [ses] > > [] set_component_fault+0xa9/0xf0 [enclosure] > > [] dev_attr_store+0x3c/0x70 > > [] sysfs_kf_write+0x115/0x180 > > [] kernfs_fop_write+0x275/0x3a0 > > [] __vfs_write+0xe0/0x3e0 > > [] vfs_write+0x13f/0x4a0 > > [] SyS_write+0x111/0x230 > > [] entry_SYSCALL_64_fastpath+0x13/0x94 > > > > Fortunately the solution is extremely simple: call device_unregister() > > before we free the resources, and the race no longer exists. The driver > > core holds a reference over ->remove_dev(), so AFAICT this is safe. > > > > Signed-off-by: Calvin Owens > > --- > > drivers/scsi/ses.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c > > index 53ef1cb..0e8601a 100644 > > --- a/drivers/scsi/ses.c > > +++ b/drivers/scsi/ses.c > > @@ -778,6 +778,8 @@ static void ses_intf_remove_enclosure(struct > > scsi_device *sdev) > > if (!edev) > > return; > > > > + enclosure_unregister(edev); > > + > > ses_dev = edev->scratch; > > edev->scratch = NULL; > > > > @@ -789,7 +791,6 @@ static void ses_intf_remove_enclosure(struct > > scsi_device *sdev) > > kfree(edev->component[0].scratch); > > > > put_device(>edev); > > - enclosure_unregister(edev); > > } > > > > static void ses_intf_remove(struct device *cdev, > > >
Re: [PATCH] ses: Fix racy cleanup of /sys in remove_dev()
On 05/13/2016 01:28 PM, Calvin Owens wrote: Currently we free the resources backing the enclosure device before we call device_unregister(). This is racy: during rmmod of low-level SCSI drivers that hook into enclosure, we end up with a small window of time during which writing to /sys can OOPS. Example trace with mpt3sas: Ping? general protection fault: [#1] SMP KASAN Modules linked in: mpt3sas(-) <...> RIP: [] ses_get_page2_descriptor.isra.6+0x38/0x220 [ses] Call Trace: [] ses_set_fault+0xf4/0x400 [ses] [] set_component_fault+0xa9/0xf0 [enclosure] [] dev_attr_store+0x3c/0x70 [] sysfs_kf_write+0x115/0x180 [] kernfs_fop_write+0x275/0x3a0 [] __vfs_write+0xe0/0x3e0 [] vfs_write+0x13f/0x4a0 [] SyS_write+0x111/0x230 [] entry_SYSCALL_64_fastpath+0x13/0x94 Fortunately the solution is extremely simple: call device_unregister() before we free the resources, and the race no longer exists. The driver core holds a reference over ->remove_dev(), so AFAICT this is safe. Signed-off-by: Calvin Owens <calvinow...@fb.com> --- drivers/scsi/ses.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c index 53ef1cb..0e8601a 100644 --- a/drivers/scsi/ses.c +++ b/drivers/scsi/ses.c @@ -778,6 +778,8 @@ static void ses_intf_remove_enclosure(struct scsi_device *sdev) if (!edev) return; + enclosure_unregister(edev); + ses_dev = edev->scratch; edev->scratch = NULL; @@ -789,7 +791,6 @@ static void ses_intf_remove_enclosure(struct scsi_device *sdev) kfree(edev->component[0].scratch); put_device(>edev); - enclosure_unregister(edev); } static void ses_intf_remove(struct device *cdev,
Re: [PATCH] ses: Fix racy cleanup of /sys in remove_dev()
On 05/13/2016 01:28 PM, Calvin Owens wrote: Currently we free the resources backing the enclosure device before we call device_unregister(). This is racy: during rmmod of low-level SCSI drivers that hook into enclosure, we end up with a small window of time during which writing to /sys can OOPS. Example trace with mpt3sas: Ping? general protection fault: [#1] SMP KASAN Modules linked in: mpt3sas(-) <...> RIP: [] ses_get_page2_descriptor.isra.6+0x38/0x220 [ses] Call Trace: [] ses_set_fault+0xf4/0x400 [ses] [] set_component_fault+0xa9/0xf0 [enclosure] [] dev_attr_store+0x3c/0x70 [] sysfs_kf_write+0x115/0x180 [] kernfs_fop_write+0x275/0x3a0 [] __vfs_write+0xe0/0x3e0 [] vfs_write+0x13f/0x4a0 [] SyS_write+0x111/0x230 [] entry_SYSCALL_64_fastpath+0x13/0x94 Fortunately the solution is extremely simple: call device_unregister() before we free the resources, and the race no longer exists. The driver core holds a reference over ->remove_dev(), so AFAICT this is safe. Signed-off-by: Calvin Owens --- drivers/scsi/ses.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c index 53ef1cb..0e8601a 100644 --- a/drivers/scsi/ses.c +++ b/drivers/scsi/ses.c @@ -778,6 +778,8 @@ static void ses_intf_remove_enclosure(struct scsi_device *sdev) if (!edev) return; + enclosure_unregister(edev); + ses_dev = edev->scratch; edev->scratch = NULL; @@ -789,7 +791,6 @@ static void ses_intf_remove_enclosure(struct scsi_device *sdev) kfree(edev->component[0].scratch); put_device(>edev); - enclosure_unregister(edev); } static void ses_intf_remove(struct device *cdev,