Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Calvin Owens
On Wednesday 03/27 at 00:24 +0900, Masami Hiramatsu wrote:
> On Tue, 26 Mar 2024 14:46:10 +
> Mark Rutland  wrote:
> 
> > Hi Masami,
> > 
> > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote:
> > > Hi Jarkko,
> > > 
> > > On Sun, 24 Mar 2024 01:29:08 +0200
> > > Jarkko Sakkinen  wrote:
> > > 
> > > > Tracing with kprobes while running a monolithic kernel is currently
> > > > impossible due the kernel module allocator dependency.
> > > > 
> > > > Address the issue by allowing architectures to implement module_alloc()
> > > > and module_memfree() independent of the module subsystem. An arch tree
> > > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file.
> > > > 
> > > > Realize the feature on RISC-V by separating allocator to module_alloc.c
> > > > and implementing module_memfree().
> > > 
> > > Even though, this involves changes in arch-independent part. So it should
> > > be solved by generic way. Did you checked Calvin's thread?
> > > 
> > > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/
> > > 
> > > I think, we'd better to introduce `alloc_execmem()`,
> > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first
> > > 
> > >   config HAVE_ALLOC_EXECMEM
> > >   bool
> > > 
> > >   config ALLOC_EXECMEM
> > >   bool "Executable trampline memory allocation"
> > >   depends on MODULES || HAVE_ALLOC_EXECMEM
> > > 
> > > And define fallback macro to module_alloc() like this.
> > > 
> > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM
> > > #define alloc_execmem(size, gfp)  module_alloc(size)
> > > #endif
> > 
> > Please can we *not* do this? I think this is abstracting at the wrong level 
> > (as
> > I mentioned on the prior execmem proposals).
> > 
> > Different exectuable allocations can have different requirements. For 
> > example,
> > on arm64 modules need to be within 2G of the kernel image, but the kprobes 
> > XOL
> > areas can be anywhere in the kernel VA space.
> > 
> > Forcing those behind the same interface makes things *harder* for 
> > architectures
> > and/or makes the common code more complicated (if that ends up having to 
> > track
> > all those different requirements). From my PoV it'd be much better to have
> > separate kprobes_alloc_*() functions for kprobes which an architecture can 
> > then
> > choose to implement using a common library if it wants to.
> > 
> > I took a look at doing that using the core ifdeffery fixups from Jarkko's 
> > v6,
> > and it looks pretty clean to me (and works in testing on arm64):
> > 
> >   
> > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules
> > 
> > Could we please start with that approach, with kprobe-specific alloc/free 
> > code
> > provided by the architecture?

Heh, I also noticed that dead !RWX branch in arm64 patch_map(), I was
about to send a patch to remove it.

> OK, as far as I can read the code, this method also works and neat! 
> (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM
> to user does not help, it should be an internal change. So hiding this change
> from user is better choice. Then there is no reason to introduce the new
> alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable.

I'm happy with this, it solves the first half of my problem. But I want
eBPF to work in the !MODULES case too.

I think Mark's approach can work for bpf as well, without needing to
touch module_alloc() at all? So I might be able to drop that first patch
entirely.

https://lore.kernel.org/all/a6b162aed1e6fea7f565ef9dd0204d6f2284bcce.1709676663.git.jcalvinow...@gmail.com/

Thanks,
Calvin

> Mark, can you send this series here, so that others can review/test it?
> 
> Thank you!
> 
> 
> > 
> > Thanks,
> > Mark.
> 
> 
> -- 
> Masami Hiramatsu (Google) 



Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-25 Thread Calvin Owens
On Monday 03/25 at 11:56 +0900, Masami Hiramatsu wrote:
> Hi Jarkko,
> 
> On Sun, 24 Mar 2024 01:29:08 +0200
> Jarkko Sakkinen  wrote:
> 
> > Tracing with kprobes while running a monolithic kernel is currently
> > impossible due the kernel module allocator dependency.
> > 
> > Address the issue by allowing architectures to implement module_alloc()
> > and module_memfree() independent of the module subsystem. An arch tree
> > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file.
> > 
> > Realize the feature on RISC-V by separating allocator to module_alloc.c
> > and implementing module_memfree().
> 
> Even though, this involves changes in arch-independent part. So it should
> be solved by generic way. Did you checked Calvin's thread?
> 
> https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/

FYI, I should have v2 of that series out later this week.

Thanks,
Calvin

> I think, we'd better to introduce `alloc_execmem()`,
> CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first
> 
>   config HAVE_ALLOC_EXECMEM
>   bool
> 
>   config ALLOC_EXECMEM
>   bool "Executable trampline memory allocation"
>   depends on MODULES || HAVE_ALLOC_EXECMEM
> 
> And define fallback macro to module_alloc() like this.
> 
> #ifndef CONFIG_HAVE_ALLOC_EXECMEM
> #define alloc_execmem(size, gfp)  module_alloc(size)
> #endif
> 
> Then, introduce a new dependency to kprobes
> 
>   config KPROBES
>   bool "Kprobes"
>   select ALLOC_EXECMEM
> 
> and update kprobes to use alloc_execmem and remove module related
> code from it.
> 
> You also should consider using IS_ENABLED(CONFIG_MODULE) in the code to
> avoid using #ifdefs.
> 
> Finally, you can add RISCV implementation patch of HAVE_ALLOC_EXECMEM in the
> next patch.
> 
> Thank you,
> 
> 
> > 
> > Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
> > stack
> > Link: 
> > https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # 
> > continuation
> > Signed-off-by: Jarkko Sakkinen 
> > ---
> > v2:
> > - Better late than never right? :-)
> > - Focus only to RISC-V for now to make the patch more digestable. This
> >   is the arch where I use the patch on a daily basis to help with QA.
> > - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
> > ---
> >  arch/Kconfig |  8 +++-
> >  arch/riscv/Kconfig   |  1 +
> >  arch/riscv/kernel/Makefile   |  5 +
> >  arch/riscv/kernel/module.c   | 11 ---
> >  arch/riscv/kernel/module_alloc.c | 28 
> >  kernel/kprobes.c | 10 ++
> >  kernel/trace/trace_kprobe.c  | 18 --
> >  7 files changed, 67 insertions(+), 14 deletions(-)
> >  create mode 100644 arch/riscv/kernel/module_alloc.c
> > 
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index a5af0edd3eb8..c931f1de98a7 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -52,7 +52,7 @@ config GENERIC_ENTRY
> >  
> >  config KPROBES
> > bool "Kprobes"
> > -   depends on MODULES
> > +   depends on MODULES || HAVE_KPROBES_ALLOC
> > depends on HAVE_KPROBES
> > select KALLSYMS
> > select TASKS_RCU if PREEMPTION
> > @@ -215,6 +215,12 @@ config HAVE_OPTPROBES
> >  config HAVE_KPROBES_ON_FTRACE
> > bool
> >  
> > +config HAVE_KPROBES_ALLOC
> > +   bool
> > +   help
> > + Architectures that select this option are capable of allocating memory
> > + for kprobes withou the kernel module allocator.
> > +
> >  config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
> > bool
> > help
> > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > index e3142ce531a0..4f1b925e83d8 100644
> > --- a/arch/riscv/Kconfig
> > +++ b/arch/riscv/Kconfig
> > @@ -132,6 +132,7 @@ config RISCV
> > select HAVE_KPROBES if !XIP_KERNEL
> > select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
> > select HAVE_KRETPROBES if !XIP_KERNEL
> > +   select HAVE_KPROBES_ALLOC if !XIP_KERNEL
> > # https://github.com/ClangBuiltLinux/linux/issues/1881
> > select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
> > select HAVE_MOVE_PMD
> > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > index 604d6bf7e476..46318194bce1 100644
> > --- a/arch/riscv/kernel/Makefile
> > +++ b/arch/riscv/kernel/Makefile
> > @@ -73,6 +73,11 @@ obj-$(CONFIG_SMP)+= cpu_ops.o
> >  
> >  obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o
> >  obj-$(CONFIG_MODULES)  += module.o
> > +ifeq ($(CONFIG_MODULES),y)
> > +obj-y  += module_alloc.o
> > +else
> > +obj-$(CONFIG_KPROBES)  += module_alloc.o
> > +endif
> >  obj-$(CONFIG_MODULE_SECTIONS)  += module-sections.o
> >  
> >  obj-$(CONFIG_CPU_PM)   += suspend_entry.o suspend.o
> > diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
> > index 5e5a82644451..cc324b450f2e 100644
> > --- a/arch/riscv/kernel/module.c
> > 

Re: [RFC][PATCH 2/4] bpf: Allow BPF_JIT with CONFIG_MODULES=n

2024-03-08 Thread Calvin Owens
On Thursday 03/07 at 22:09 +, Christophe Leroy wrote:
> 
> 
> Le 06/03/2024 à 21:05, Calvin Owens a écrit :
> > [Vous ne recevez pas souvent de courriers de jcalvinow...@gmail.com. 
> > Découvrez pourquoi ceci est important à 
> > https://aka.ms/LearnAboutSenderIdentification ]
> > 
> > No BPF code has to change, except in struct_ops (for module refs).
> > 
> > This conflicts with bpf-next because of this (relevant) series:
> > 
> >  
> > https://lore.kernel.org/all/20240119225005.668602-1-thinker...@gmail.com/
> > 
> > If something like this is merged down the road, it can go through
> > bpf-next at leisure once the module_alloc change is in: it's a one-way
> > dependency.
> > 
> > Signed-off-by: Calvin Owens 
> > ---
> >   kernel/bpf/Kconfig  |  2 +-
> >   kernel/bpf/bpf_struct_ops.c | 28 
> >   2 files changed, 25 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
> > index 6a906ff93006..77df483a8925 100644
> > --- a/kernel/bpf/Kconfig
> > +++ b/kernel/bpf/Kconfig
> > @@ -42,7 +42,7 @@ config BPF_JIT
> >  bool "Enable BPF Just In Time compiler"
> >  depends on BPF
> >  depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT
> > -   depends on MODULES
> > +   select MODULE_ALLOC
> >  help
> >BPF programs are normally handled by a BPF interpreter. This 
> > option
> >allows the kernel to generate native code when a program is 
> > loaded
> > diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> > index 02068bd0e4d9..fbf08a1bb00c 100644
> > --- a/kernel/bpf/bpf_struct_ops.c
> > +++ b/kernel/bpf/bpf_struct_ops.c
> > @@ -108,11 +108,30 @@ const struct bpf_prog_ops bpf_struct_ops_prog_ops = {
> >   #endif
> >   };
> > 
> > +#if IS_ENABLED(CONFIG_MODULES)
> 
> Can you avoid ifdefs as much as possible ?

Similar to the other one, this was just a misguided attempt to avoid
triggering -Wunused, I'll clean it up.

This particular patch will look very different when rebased on bpf-next.

> >   static const struct btf_type *module_type;
> > 
> > +static int bpf_struct_module_type_init(struct btf *btf)
> > +{
> > +   s32 module_id;
> 
> Could be:
> 
>   if (!IS_ENABLED(CONFIG_MODULES))
>   return 0;
> 
> > +
> > +   module_id = btf_find_by_name_kind(btf, "module", BTF_KIND_STRUCT);
> > +   if (module_id < 0)
> > +   return 1;
> > +
> > +   module_type = btf_type_by_id(btf, module_id);
> > +   return 0;
> > +}
> > +#else
> > +static int bpf_struct_module_type_init(struct btf *btf)
> > +{
> > +   return 0;
> > +}
> > +#endif
> > +
> >   void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log)
> >   {
> > -   s32 type_id, value_id, module_id;
> > +   s32 type_id, value_id;
> >  const struct btf_member *member;
> >  struct bpf_struct_ops *st_ops;
> >  const struct btf_type *t;
> > @@ -125,12 +144,10 @@ void bpf_struct_ops_init(struct btf *btf, struct 
> > bpf_verifier_log *log)
> >   #include "bpf_struct_ops_types.h"
> >   #undef BPF_STRUCT_OPS_TYPE
> > 
> > -   module_id = btf_find_by_name_kind(btf, "module", BTF_KIND_STRUCT);
> > -   if (module_id < 0) {
> > +   if (bpf_struct_module_type_init(btf)) {
> >  pr_warn("Cannot find struct module in btf_vmlinux\n");
> >  return;
> >  }
> > -   module_type = btf_type_by_id(btf, module_id);
> > 
> >  for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
> >  st_ops = bpf_struct_ops[i];
> > @@ -433,12 +450,15 @@ static long bpf_struct_ops_map_update_elem(struct 
> > bpf_map *map, void *key,
> > 
> >  moff = __btf_member_bit_offset(t, member) / 8;
> >  ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, 
> > NULL);
> > +
> > +#if IS_ENABLED(CONFIG_MODULES)
> 
> Can't see anything depending on CONFIG_MODULES here, can you instead do:
> 
>   if (IS_ENABLED(CONFIG_MODULES) && ptype == module_type) {
> 
> >  if (ptype == module_type) {
> >  if (*(void **)(udata + moff))
> >  goto reset_unlock;
> >  *(void **)(kdata + moff) = BPF_MODULE_OWNER;
> >  continue;
> >  }
> > +#endif
> > 
> >  err = st_ops->init_member(t, member, kdata, udata);
> >  if (err < 0)
> > --
> > 2.43.0
> > 
> > 



Re: [RFC][PATCH 3/4] kprobes: Allow kprobes with CONFIG_MODULES=n

2024-03-08 Thread Calvin Owens
On Thursday 03/07 at 22:16 +, Christophe Leroy wrote:
> 
> 
> Le 06/03/2024 à 21:05, Calvin Owens a écrit :
> > [Vous ne recevez pas souvent de courriers de jcalvinow...@gmail.com. 
> > Découvrez pourquoi ceci est important à 
> > https://aka.ms/LearnAboutSenderIdentification ]
> > 
> > If something like this is merged down the road, it can go in at leisure
> > once the module_alloc change is in: it's a one-way dependency.
> 
> Too many #ifdef, please reorganise stuff to avoid that and avoid 
> changing prototypes based of CONFIG_MODULES.
> 
> Other few comments below.

TBH the ugliness here was just me trying not to trigger -Wunused, but
that was silly: as you point out below, it's unncessary. I'll clean it
up.

> > 
> > Signed-off-by: Calvin Owens 
> > ---
> >   arch/Kconfig|  2 +-
> >   kernel/kprobes.c| 22 ++
> >   kernel/trace/trace_kprobe.c | 11 +++
> >   3 files changed, 34 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index cfc24ced16dd..e60ce984d095 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -52,8 +52,8 @@ config GENERIC_ENTRY
> > 
> >   config KPROBES
> >  bool "Kprobes"
> > -   depends on MODULES
> >  depends on HAVE_KPROBES
> > +   select MODULE_ALLOC
> >  select KALLSYMS
> >  select TASKS_RCU if PREEMPTION
> >  help
> > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > index 9d9095e81792..194270e17d57 100644
> > --- a/kernel/kprobes.c
> > +++ b/kernel/kprobes.c
> > @@ -1556,8 +1556,12 @@ static bool is_cfi_preamble_symbol(unsigned long 
> > addr)
> >  str_has_prefix("__pfx_", symbuf);
> >   }
> > 
> > +#if IS_ENABLED(CONFIG_MODULES)
> >   static int check_kprobe_address_safe(struct kprobe *p,
> >   struct module **probed_mod)
> > +#else
> > +static int check_kprobe_address_safe(struct kprobe *p)
> > +#endif
> 
> A bit ugly to have to change the prototype, why not just keep probed_mod 
> at all time ?
> 
> When CONFIG_MODULES is not selected, __module_text_address() returns 
> NULL so it should work without that many #ifdefs.
> 
> >   {
> >  int ret;
> > 
> > @@ -1580,6 +1584,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
> >  goto out;
> >  }
> > 
> > +#if IS_ENABLED(CONFIG_MODULES)
> >  /* Check if 'p' is probing a module. */
> >  *probed_mod = __module_text_address((unsigned long) p->addr);
> >  if (*probed_mod) {
> > @@ -1603,6 +1608,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
> >  ret = -ENOENT;
> >  }
> >  }
> > +#endif
> > +
> >   out:
> >  preempt_enable();
> >  jump_label_unlock();
> > @@ -1614,7 +1621,9 @@ int register_kprobe(struct kprobe *p)
> >   {
> >  int ret;
> >  struct kprobe *old_p;
> > +#if IS_ENABLED(CONFIG_MODULES)
> >  struct module *probed_mod;
> > +#endif
> >  kprobe_opcode_t *addr;
> >  bool on_func_entry;
> > 
> > @@ -1633,7 +1642,11 @@ int register_kprobe(struct kprobe *p)
> >  p->nmissed = 0;
> >  INIT_LIST_HEAD(>list);
> > 
> > +#if IS_ENABLED(CONFIG_MODULES)
> >  ret = check_kprobe_address_safe(p, _mod);
> > +#else
> > +   ret = check_kprobe_address_safe(p);
> > +#endif
> >  if (ret)
> >  return ret;
> > 
> > @@ -1676,8 +1689,10 @@ int register_kprobe(struct kprobe *p)
> >   out:
> >  mutex_unlock(_mutex);
> > 
> > +#if IS_ENABLED(CONFIG_MODULES)
> >  if (probed_mod)
> >  module_put(probed_mod);
> > +#endif
> > 
> >  return ret;
> >   }
> > @@ -2482,6 +2497,7 @@ int kprobe_add_area_blacklist(unsigned long start, 
> > unsigned long end)
> >  return 0;
> >   }
> > 
> > +#if IS_ENABLED(CONFIG_MODULES)
> >   /* Remove all symbols in given area from kprobe blacklist */
> >   static void kprobe_remove_area_blacklist(unsigned long start, unsigned 
> > long end)
> >   {
> > @@ -2499,6 +2515,7 @@ static void kprobe_remove_ksym_blacklist(unsigned 
> > long entry)
> >   {
> >  kprobe_remove_area_blacklist(entry, entry + 1);
> >   }
>

Re: [RFC][PATCH 3/4] kprobes: Allow kprobes with CONFIG_MODULES=n

2024-03-08 Thread Calvin Owens
On Friday 03/08 at 11:46 +0900, Masami Hiramatsu wrote:
> On Wed,  6 Mar 2024 12:05:10 -0800
> Calvin Owens  wrote:
> 
> > If something like this is merged down the road, it can go in at leisure
> > once the module_alloc change is in: it's a one-way dependency.
> > 
> > Signed-off-by: Calvin Owens 
> > ---
> >  arch/Kconfig|  2 +-
> >  kernel/kprobes.c| 22 ++
> >  kernel/trace/trace_kprobe.c | 11 +++
> >  3 files changed, 34 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index cfc24ced16dd..e60ce984d095 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -52,8 +52,8 @@ config GENERIC_ENTRY
> >  
> >  config KPROBES
> > bool "Kprobes"
> > -   depends on MODULES
> > depends on HAVE_KPROBES
> > +   select MODULE_ALLOC
> 
> OK, if we use EXEC_ALLOC,
> 
> config EXEC_ALLOC
>   depends on HAVE_EXEC_ALLOC
> 
> And 
> 
>   config KPROBES
>   bool "Kprobes"
>   depends on MODULES || EXEC_ALLOC
>   select EXEC_ALLOC if HAVE_EXEC_ALLOC
> 
> then kprobes can be enabled either modules supported or exec_alloc is 
> supported.
> (new arch does not need to implement exec_alloc)
> 
> Maybe we also need something like
> 
> #ifdef CONFIG_EXEC_ALLOC
> #define module_alloc(size) exec_alloc(size)
> #endif
> 
> in kprobes.h, or just add `replacing module_alloc with exec_alloc` patch.
> 
> Thank you,

The example was helpful, thanks. I see what you mean with
HAVE_EXEC_ALLOC, I'll implement it like that in the next verison.

> > select KALLSYMS
> > select TASKS_RCU if PREEMPTION
> > help
> > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > index 9d9095e81792..194270e17d57 100644
> > --- a/kernel/kprobes.c
> > +++ b/kernel/kprobes.c
> > @@ -1556,8 +1556,12 @@ static bool is_cfi_preamble_symbol(unsigned long 
> > addr)
> > str_has_prefix("__pfx_", symbuf);
> >  }
> >  
> > +#if IS_ENABLED(CONFIG_MODULES)
> >  static int check_kprobe_address_safe(struct kprobe *p,
> >  struct module **probed_mod)
> > +#else
> > +static int check_kprobe_address_safe(struct kprobe *p)
> > +#endif
> >  {
> > int ret;
> >  
> > @@ -1580,6 +1584,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
> > goto out;
> > }
> >  
> > +#if IS_ENABLED(CONFIG_MODULES)
> > /* Check if 'p' is probing a module. */
> > *probed_mod = __module_text_address((unsigned long) p->addr);
> > if (*probed_mod) {
> > @@ -1603,6 +1608,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
> > ret = -ENOENT;
> > }
> > }
> > +#endif
> > +
> >  out:
> > preempt_enable();
> > jump_label_unlock();
> > @@ -1614,7 +1621,9 @@ int register_kprobe(struct kprobe *p)
> >  {
> > int ret;
> > struct kprobe *old_p;
> > +#if IS_ENABLED(CONFIG_MODULES)
> > struct module *probed_mod;
> > +#endif
> > kprobe_opcode_t *addr;
> > bool on_func_entry;
> >  
> > @@ -1633,7 +1642,11 @@ int register_kprobe(struct kprobe *p)
> > p->nmissed = 0;
> > INIT_LIST_HEAD(>list);
> >  
> > +#if IS_ENABLED(CONFIG_MODULES)
> > ret = check_kprobe_address_safe(p, _mod);
> > +#else
> > +   ret = check_kprobe_address_safe(p);
> > +#endif
> > if (ret)
> > return ret;
> >  
> > @@ -1676,8 +1689,10 @@ int register_kprobe(struct kprobe *p)
> >  out:
> > mutex_unlock(_mutex);
> >  
> > +#if IS_ENABLED(CONFIG_MODULES)
> > if (probed_mod)
> > module_put(probed_mod);
> > +#endif
> >  
> > return ret;
> >  }
> > @@ -2482,6 +2497,7 @@ int kprobe_add_area_blacklist(unsigned long start, 
> > unsigned long end)
> > return 0;
> >  }
> >  
> > +#if IS_ENABLED(CONFIG_MODULES)
> >  /* Remove all symbols in given area from kprobe blacklist */
> >  static void kprobe_remove_area_blacklist(unsigned long start, unsigned 
> > long end)
> >  {
> > @@ -2499,6 +2515,7 @@ static void kprobe_remove_ksym_blacklist(unsigned 
> > long entry)
> >  {
> > kprobe_remove_area_blacklist(entry, entry + 1);
> >  }
> > +#endif
> >  
> >  int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long 
> > *value,
&

Re: [RFC][PATCH 1/4] module: mm: Make module_alloc() generally available

2024-03-08 Thread Calvin Owens
On Thursday 03/07 at 14:43 +, Christophe Leroy wrote:
> Hi Calvin,
> 
> Le 06/03/2024 à 21:05, Calvin Owens a écrit :
> > [Vous ne recevez pas souvent de courriers de jcalvinow...@gmail.com. 
> > Découvrez pourquoi ceci est important à 
> > https://aka.ms/LearnAboutSenderIdentification ]
> > 
> > Both BPF_JIT and KPROBES depend on CONFIG_MODULES, but only require
> > module_alloc() itself, which can be easily separated into a standalone
> > allocator for executable kernel memory.
> 
> Easily maybe, but not as easily as you think, see below.
> 
> > 
> > Thomas Gleixner sent a patch to do that for x86 as part of a larger
> > series a couple years ago:
> > 
> >  https://lore.kernel.org/all/20220716230953.442937...@linutronix.de/
> > 
> > I've simply extended that approach to the whole kernel.
> > 
> > Signed-off-by: Calvin Owens 
> > ---
> >   arch/Kconfig |   2 +-
> >   arch/arm/kernel/module.c |  35 -
> >   arch/arm/mm/Makefile |   2 +
> >   arch/arm/mm/module_alloc.c   |  40 ++
> >   arch/arm64/kernel/module.c   | 127 --
> >   arch/arm64/mm/Makefile   |   1 +
> >   arch/arm64/mm/module_alloc.c | 130 +++
> >   arch/loongarch/kernel/module.c   |   6 --
> >   arch/loongarch/mm/Makefile   |   2 +
> >   arch/loongarch/mm/module_alloc.c |  10 +++
> >   arch/mips/kernel/module.c|  10 ---
> >   arch/mips/mm/Makefile|   2 +
> >   arch/mips/mm/module_alloc.c  |  13 
> >   arch/nios2/kernel/module.c   |  20 -
> >   arch/nios2/mm/Makefile   |   2 +
> >   arch/nios2/mm/module_alloc.c |  22 ++
> >   arch/parisc/kernel/module.c  |  12 ---
> >   arch/parisc/mm/Makefile  |   1 +
> >   arch/parisc/mm/module_alloc.c|  15 
> >   arch/powerpc/kernel/module.c |  36 -
> >   arch/powerpc/mm/Makefile |   1 +
> >   arch/powerpc/mm/module_alloc.c   |  41 ++
> 
> Missing several powerpc changes to make it work. You must audit every 
> use of CONFIG_MODULES inside powerpc. Here are a few exemples:
> 
> Function get_patch_pfn() to enable text code patching.
> 
> arch/powerpc/Kconfig :select KASAN_VMALLOCif 
> KASAN && MODULES
> 
> arch/powerpc/include/asm/kasan.h:
> 
> #if defined(CONFIG_MODULES) && defined(CONFIG_PPC32)
> #define KASAN_KERN_START  ALIGN_DOWN(PAGE_OFFSET - SZ_256M, SZ_256M)
> #else
> #define KASAN_KERN_START  PAGE_OFFSET
> #endif
> 
> arch/powerpc/kernel/head_8xx.S and arch/powerpc/kernel/head_book3s_32.S: 
> InstructionTLBMiss interrupt handler must know that there is executable 
> kernel text outside kernel core.
> 
> Function is_module_segment() to identified segments used for module text 
> and set NX (NoExec) MMU flag on non-module segments.

Thanks Christophe, I'll fix that up.

I'm sure there are many other issues like this in the arch stuff here,
I'm going to run them all through QEMU to catch everything I can before
the next respin.

> >   arch/riscv/kernel/module.c   |  11 ---
> >   arch/riscv/mm/Makefile   |   1 +
> >   arch/riscv/mm/module_alloc.c |  17 
> >   arch/s390/kernel/module.c|  37 -
> >   arch/s390/mm/Makefile|   1 +
> >   arch/s390/mm/module_alloc.c  |  42 ++
> >   arch/sparc/kernel/module.c   |  31 
> >   arch/sparc/mm/Makefile   |   2 +
> >   arch/sparc/mm/module_alloc.c |  31 
> >   arch/x86/kernel/ftrace.c |   2 +-
> >   arch/x86/kernel/module.c |  56 -
> >   arch/x86/mm/Makefile |   2 +
> >   arch/x86/mm/module_alloc.c   |  59 ++
> >   fs/proc/kcore.c  |   2 +-
> >   kernel/module/Kconfig|   1 +
> >   kernel/module/main.c |  17 
> >   mm/Kconfig   |   3 +
> >   mm/Makefile  |   1 +
> >   mm/module_alloc.c|  21 +
> >   mm/vmalloc.c |   2 +-
> >   42 files changed, 467 insertions(+), 402 deletions(-)
> 
> ...
> 
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index ffc3a2ba3a8c..92bfb5ae2e95 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -1261,6 +1261,9 @@ config LOCK_MM_AND_FIND_VMA
> >   config IOMMU_MM_DATA
> >  bool
> > 
> > +config MODULE_ALLOC
> > +   def_bool n
> > +
> 
> I'd call it something else than CONFIG_MODULE_ALLOC as you want to use 
> it when CONFIG_MODULE is not selected.
> 
> Something like CONFIG_EXECMEM_ALLOC or CONFIG_DYNAMIC_EXECMEM ?
> 
> 
> 
> Christophe



Re: [RFC][PATCH 1/4] module: mm: Make module_alloc() generally available

2024-03-08 Thread Calvin Owens
On Friday 03/08 at 11:16 +0900, Masami Hiramatsu wrote:
> Hi Calvin,
> 
> On Wed,  6 Mar 2024 12:05:08 -0800
> Calvin Owens  wrote:
> 
> > Both BPF_JIT and KPROBES depend on CONFIG_MODULES, but only require
> > module_alloc() itself, which can be easily separated into a standalone
> > allocator for executable kernel memory.
> 
> Thanks for your work!
> As Luis pointed, it is better to use different name because this
> is not only for modules and it does not depend on CONFIG_MODULES.
> 
> > 
> > Thomas Gleixner sent a patch to do that for x86 as part of a larger
> > series a couple years ago:
> > 
> > https://lore.kernel.org/all/20220716230953.442937...@linutronix.de/
> > 
> > I've simply extended that approach to the whole kernel.
> 
> I would like to see a series of patches for each architecture so that
> architecture maintainers carefully check and test this feature.
> 
> What about introducing CONFIG_HAVE_EXEC_ALLOC and enable it on
> each architecture? Then you can start small set of major architectures
> and expand it later. 

Thanks Masami. That makes sense to me, I'll do it.

I'm also working on getting the other architectures running in QEMU, so
hopefully I'll be able to iron out more of the arch problems on my own
before the next respin.

> Thank you,
> 
> > 
> > Signed-off-by: Calvin Owens 
> > ---
> >  arch/Kconfig |   2 +-
> >  arch/arm/kernel/module.c |  35 -
> >  arch/arm/mm/Makefile |   2 +
> >  arch/arm/mm/module_alloc.c   |  40 ++
> >  arch/arm64/kernel/module.c   | 127 --
> >  arch/arm64/mm/Makefile   |   1 +
> >  arch/arm64/mm/module_alloc.c | 130 +++
> >  arch/loongarch/kernel/module.c   |   6 --
> >  arch/loongarch/mm/Makefile   |   2 +
> >  arch/loongarch/mm/module_alloc.c |  10 +++
> >  arch/mips/kernel/module.c|  10 ---
> >  arch/mips/mm/Makefile|   2 +
> >  arch/mips/mm/module_alloc.c  |  13 
> >  arch/nios2/kernel/module.c   |  20 -
> >  arch/nios2/mm/Makefile   |   2 +
> >  arch/nios2/mm/module_alloc.c |  22 ++
> >  arch/parisc/kernel/module.c  |  12 ---
> >  arch/parisc/mm/Makefile  |   1 +
> >  arch/parisc/mm/module_alloc.c|  15 
> >  arch/powerpc/kernel/module.c |  36 -
> >  arch/powerpc/mm/Makefile |   1 +
> >  arch/powerpc/mm/module_alloc.c   |  41 ++
> >  arch/riscv/kernel/module.c   |  11 ---
> >  arch/riscv/mm/Makefile   |   1 +
> >  arch/riscv/mm/module_alloc.c |  17 
> >  arch/s390/kernel/module.c|  37 -
> >  arch/s390/mm/Makefile|   1 +
> >  arch/s390/mm/module_alloc.c  |  42 ++
> >  arch/sparc/kernel/module.c   |  31 
> >  arch/sparc/mm/Makefile   |   2 +
> >  arch/sparc/mm/module_alloc.c |  31 
> >  arch/x86/kernel/ftrace.c |   2 +-
> >  arch/x86/kernel/module.c |  56 -
> >  arch/x86/mm/Makefile |   2 +
> >  arch/x86/mm/module_alloc.c   |  59 ++
> >  fs/proc/kcore.c  |   2 +-
> >  kernel/module/Kconfig|   1 +
> >  kernel/module/main.c |  17 
> >  mm/Kconfig   |   3 +
> >  mm/Makefile  |   1 +
> >  mm/module_alloc.c|  21 +
> >  mm/vmalloc.c |   2 +-
> >  42 files changed, 467 insertions(+), 402 deletions(-)
> >  create mode 100644 arch/arm/mm/module_alloc.c
> >  create mode 100644 arch/arm64/mm/module_alloc.c
> >  create mode 100644 arch/loongarch/mm/module_alloc.c
> >  create mode 100644 arch/mips/mm/module_alloc.c
> >  create mode 100644 arch/nios2/mm/module_alloc.c
> >  create mode 100644 arch/parisc/mm/module_alloc.c
> >  create mode 100644 arch/powerpc/mm/module_alloc.c
> >  create mode 100644 arch/riscv/mm/module_alloc.c
> >  create mode 100644 arch/s390/mm/module_alloc.c
> >  create mode 100644 arch/sparc/mm/module_alloc.c
> >  create mode 100644 arch/x86/mm/module_alloc.c
> >  create mode 100644 mm/module_alloc.c
> > 
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index a5af0edd3eb8..cfc24ced16dd 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -1305,7 +1305,7 @@ config ARCH_HAS_STRICT_MODULE_RWX
> >  
> >  config STRICT_MODULE_RWX
> > bool "Set loadable kernel module data as NX and text as RO" if 
> > ARCH_OPTION

Re: [RFC][PATCH 3/4] kprobes: Allow kprobes with CONFIG_MODULES=n

2024-03-08 Thread Calvin Owens
On Thursday 03/07 at 09:22 +0200, Mike Rapoport wrote:
> On Wed, Mar 06, 2024 at 12:05:10PM -0800, Calvin Owens wrote:
> > If something like this is merged down the road, it can go in at leisure
> > once the module_alloc change is in: it's a one-way dependency.
> > 
> > Signed-off-by: Calvin Owens 
> > ---
> >  arch/Kconfig|  2 +-
> >  kernel/kprobes.c| 22 ++
> >  kernel/trace/trace_kprobe.c | 11 +++
> >  3 files changed, 34 insertions(+), 1 deletion(-)
> 
> When I did this in my last execmem posting, I think I've got slightly less
> ugly ifdery, you may want to take a look at that:
> 
> https://lore.kernel.org/all/20230918072955.2507221-13-r...@kernel.org

Thanks Mike, I definitely agree. I'm annoyed at myself for not finding
your patches, I spent some time looking for prior work and I really
don't know how I missed it...

> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index cfc24ced16dd..e60ce984d095 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -52,8 +52,8 @@ config GENERIC_ENTRY
> >  
> >  config KPROBES
> > bool "Kprobes"
> > -   depends on MODULES
> > depends on HAVE_KPROBES
> > +   select MODULE_ALLOC
> > select KALLSYMS
> > select TASKS_RCU if PREEMPTION
> > help
> > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > index 9d9095e81792..194270e17d57 100644
> > --- a/kernel/kprobes.c
> > +++ b/kernel/kprobes.c
> > @@ -1556,8 +1556,12 @@ static bool is_cfi_preamble_symbol(unsigned long 
> > addr)
> > str_has_prefix("__pfx_", symbuf);
> >  }
> >  
> > +#if IS_ENABLED(CONFIG_MODULES)
> >  static int check_kprobe_address_safe(struct kprobe *p,
> >  struct module **probed_mod)
> > +#else
> > +static int check_kprobe_address_safe(struct kprobe *p)
> > +#endif
> >  {
> > int ret;
> >  
> > @@ -1580,6 +1584,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
> > goto out;
> > }
> >  
> > +#if IS_ENABLED(CONFIG_MODULES)
> 
> Plain #ifdef will do here and below. IS_ENABLED is for usage withing the
> code, like
> 
>   if (IS_ENABLED(CONFIG_MODULES))
>   ;
> 
> > /* Check if 'p' is probing a module. */
> > *probed_mod = __module_text_address((unsigned long) p->addr);
> > if (*probed_mod) {
> 
> -- 
> Sincerely yours,
> Mike.



Re: [RFC][PATCH 0/4] Make bpf_jit and kprobes work with CONFIG_MODULES=n

2024-03-08 Thread Calvin Owens
On Thursday 03/07 at 18:55 -0800, Luis Chamberlain wrote:
> On Thu, Mar 7, 2024 at 6:50 PM Masami Hiramatsu  wrote:
> >
> > On Wed, 6 Mar 2024 17:58:14 -0800
> > Song Liu  wrote:
> >
> > > Hi Calvin,
> > >
> > > It is great to hear from you! :)
> > >
> > > On Wed, Mar 6, 2024 at 3:23 PM Calvin Owens  
> > > wrote:
> > > >
> > > > On Wednesday 03/06 at 13:34 -0800, Luis Chamberlain wrote:
> > > > > On Wed, Mar 06, 2024 at 12:05:07PM -0800, Calvin Owens wrote:
> > > > > > Hello all,
> > > > > >
> > > > > > This patchset makes it possible to use bpftrace with kprobes on 
> > > > > > kernels
> > > > > > built without loadable module support.
> > > > >
> > > > > This is a step in the right direction for another reason: clearly the
> > > > > module_alloc() is not about modules, and we have special reasons for 
> > > > > it
> > > > > now beyond modules. The effort to share a generalize a huge page for
> > > > > these things is also another reason for some of this but that is more
> > > > > long term.
> > > > >
> > > > > I'm all for minor changes here so to avoid regressions but it seems a
> > > > > rename is in order -- if we're going to all this might as well do it
> > > > > now. And for that I'd just like to ask you paint the bikeshed with
> > > > > Song Liu as he's been the one slowly making way to help us get there
> > > > > with the "module: replace module_layout with module_memory",
> > > > > and Mike Rapoport as he's had some follow up attempts [0]. As I see 
> > > > > it,
> > > > > the EXECMEM stuff would be what we use instead then. Mike kept the
> > > > > module_alloc() and the execmem was just a wrapper but your move of the
> > > > > arch stuff makes sense as well and I think would complement his series
> > > > > nicely.
> > > >
> > > > I apologize for missing that. I think these are the four most recent
> > > > versions of the different series referenced from that LWN link:
> > > >
> > > >   a) 
> > > > https://lore.kernel.org/all/20230918072955.2507221-1-r...@kernel.org/
> > > >   b) 
> > > > https://lore.kernel.org/all/20230526051529.3387103-1-s...@kernel.org/
> > > >   c) 
> > > > https://lore.kernel.org/all/20221107223921.3451913-1-s...@kernel.org/
> > > >   d) 
> > > > https://lore.kernel.org/all/20201120202426.18009-1-rick.p.edgeco...@intel.com/
> > > >
> > > > Song and Mike, please correct me if I'm wrong, but I think what I've
> > > > done here (see [1], sorry for not adding you initially) is compatible
> > > > with everything both of you have recently proposed above. How do you
> > > > feel about this as a first step?
> > >
> > > I agree that the work here is compatible with other efforts. I have no
> > > objection to making this the first step.
> > >
> > > >
> > > > For naming, execmem_alloc() seems reasonable to me? I have no strong
> > > > feelings at all, I'll just use that going forward unless somebody else
> > > > expresses an opinion.
> > >
> > > I am not good at naming things. No objection from me to "execmem_alloc".
> >
> > Hm, it sounds good to me too. I think we should add a patch which just
> > rename the module_alloc/module_memfree with execmem_alloc/free first.
> 
> I think that would be cleaner, yes. Leaving the possible move to a
> secondary patch and placing the testing more on the later part.

Makes sense to me.



Re: [RFC][PATCH 0/4] Make bpf_jit and kprobes work with CONFIG_MODULES=n

2024-03-06 Thread Calvin Owens
On Wednesday 03/06 at 13:34 -0800, Luis Chamberlain wrote:
> On Wed, Mar 06, 2024 at 12:05:07PM -0800, Calvin Owens wrote:
> > Hello all,
> > 
> > This patchset makes it possible to use bpftrace with kprobes on kernels
> > built without loadable module support.
> 
> This is a step in the right direction for another reason: clearly the
> module_alloc() is not about modules, and we have special reasons for it
> now beyond modules. The effort to share a generalize a huge page for
> these things is also another reason for some of this but that is more
> long term.
> 
> I'm all for minor changes here so to avoid regressions but it seems a
> rename is in order -- if we're going to all this might as well do it
> now. And for that I'd just like to ask you paint the bikeshed with
> Song Liu as he's been the one slowly making way to help us get there
> with the "module: replace module_layout with module_memory",
> and Mike Rapoport as he's had some follow up attempts [0]. As I see it,
> the EXECMEM stuff would be what we use instead then. Mike kept the
> module_alloc() and the execmem was just a wrapper but your move of the
> arch stuff makes sense as well and I think would complement his series
> nicely.

I apologize for missing that. I think these are the four most recent
versions of the different series referenced from that LWN link:

  a) https://lore.kernel.org/all/20230918072955.2507221-1-r...@kernel.org/
  b) https://lore.kernel.org/all/20230526051529.3387103-1-s...@kernel.org/
  c) https://lore.kernel.org/all/20221107223921.3451913-1-s...@kernel.org/
  d) 
https://lore.kernel.org/all/20201120202426.18009-1-rick.p.edgeco...@intel.com/

Song and Mike, please correct me if I'm wrong, but I think what I've
done here (see [1], sorry for not adding you initially) is compatible
with everything both of you have recently proposed above. How do you
feel about this as a first step?

For naming, execmem_alloc() seems reasonable to me? I have no strong
feelings at all, I'll just use that going forward unless somebody else
expresses an opinion.

[1] 
https://lore.kernel.org/lkml/cover.1709676663.git.jcalvinow...@gmail.com/T/#m337096e158a5f771d0c7c2fb15a3b80a4443226a

> If you're gonna split code up to move to another place, it'd be nice
> if you can add copyright headers as was done with the kernel/module.c
> split into kernel/module/*.c

Silly question: should it be the same copyright header as the original
corresponding module.c, or a new one? I tried to preserve the license
header because I wasn't sure what to do about it.

Thanks,
Calvin

> Can we start with some small basic stuff we can all agree on?
> 
> [0] https://lwn.net/Articles/944857/
> 
>   Luis



[RFC][PATCH 4/4] selftests/bpf: Support testing the !MODULES case

2024-03-06 Thread Calvin Owens
This symlinks bpf_testmod into the main source, so it can be built-in
for running selftests in the new !MODULES case.

To be clear, no changes to the existing selftests are required: this
only exists to enable testing the new case which was not previously
possible. I'm sure somebody will be able to suggest a less ugly way I
can do this...

Signed-off-by: Calvin Owens 
---
 include/trace/events/bpf_testmod.h|  1 +
 kernel/bpf/Kconfig|  9 ++
 kernel/bpf/Makefile   |  2 ++
 kernel/bpf/bpf_testmod/Makefile   |  1 +
 kernel/bpf/bpf_testmod/bpf_testmod.c  |  1 +
 kernel/bpf/bpf_testmod/bpf_testmod.h  |  1 +
 kernel/bpf/bpf_testmod/bpf_testmod_kfunc.h|  1 +
 net/bpf/test_run.c|  2 ++
 tools/testing/selftests/bpf/Makefile  | 28 +--
 .../selftests/bpf/bpf_testmod/Makefile|  2 +-
 .../bpf/bpf_testmod/bpf_testmod-events.h  |  6 
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  4 +++
 .../bpf/bpf_testmod/bpf_testmod_kfunc.h   |  2 ++
 tools/testing/selftests/bpf/config|  5 
 tools/testing/selftests/bpf/config.mods   |  5 
 tools/testing/selftests/bpf/config.nomods |  1 +
 .../selftests/bpf/progs/btf_type_tag_percpu.c |  2 ++
 .../selftests/bpf/progs/btf_type_tag_user.c   |  2 ++
 tools/testing/selftests/bpf/progs/core_kern.c |  2 ++
 .../selftests/bpf/progs/iters_testmod_seq.c   |  2 ++
 .../bpf/progs/test_core_reloc_module.c|  2 ++
 .../selftests/bpf/progs/test_ldsx_insn.c  |  2 ++
 .../selftests/bpf/progs/test_module_attach.c  |  3 ++
 .../selftests/bpf/progs/tracing_struct.c  |  2 ++
 tools/testing/selftests/bpf/testing_helpers.c | 14 ++
 tools/testing/selftests/bpf/vmtest.sh | 24 ++--
 26 files changed, 110 insertions(+), 16 deletions(-)
 create mode 12 include/trace/events/bpf_testmod.h
 create mode 100644 kernel/bpf/bpf_testmod/Makefile
 create mode 12 kernel/bpf/bpf_testmod/bpf_testmod.c
 create mode 12 kernel/bpf/bpf_testmod/bpf_testmod.h
 create mode 12 kernel/bpf/bpf_testmod/bpf_testmod_kfunc.h
 create mode 100644 tools/testing/selftests/bpf/config.mods
 create mode 100644 tools/testing/selftests/bpf/config.nomods

diff --git a/include/trace/events/bpf_testmod.h 
b/include/trace/events/bpf_testmod.h
new file mode 12
index ..ae237a90d381
--- /dev/null
+++ b/include/trace/events/bpf_testmod.h
@@ -0,0 +1 @@
+../../../tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
\ No newline at end of file
diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
index 77df483a8925..d5ba795182e5 100644
--- a/kernel/bpf/Kconfig
+++ b/kernel/bpf/Kconfig
@@ -100,4 +100,13 @@ config BPF_LSM
 
  If you are unsure how to answer this question, answer N.
 
+config BPF_TEST_MODULE
+   bool "Build the module for BPF selftests as a built-in"
+   depends on BPF_SYSCALL
+   depends on BPF_JIT
+   depends on !MODULES
+   default n
+   help
+ This allows most of the bpf selftests to run without modules.
+
 endmenu # "BPF subsystem"
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index f526b7573e97..04b3e50ff940 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -46,3 +46,5 @@ obj-$(CONFIG_BPF_PRELOAD) += preload/
 obj-$(CONFIG_BPF_SYSCALL) += relo_core.o
 $(obj)/relo_core.o: $(srctree)/tools/lib/bpf/relo_core.c FORCE
$(call if_changed_rule,cc_o_c)
+
+obj-$(CONFIG_BPF_TEST_MODULE) += bpf_testmod/
diff --git a/kernel/bpf/bpf_testmod/Makefile b/kernel/bpf/bpf_testmod/Makefile
new file mode 100644
index ..55a73fd8443e
--- /dev/null
+++ b/kernel/bpf/bpf_testmod/Makefile
@@ -0,0 +1 @@
+obj-y += bpf_testmod.o
diff --git a/kernel/bpf/bpf_testmod/bpf_testmod.c 
b/kernel/bpf/bpf_testmod/bpf_testmod.c
new file mode 12
index ..ca3baca5d9c4
--- /dev/null
+++ b/kernel/bpf/bpf_testmod/bpf_testmod.c
@@ -0,0 +1 @@
+../../../tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
\ No newline at end of file
diff --git a/kernel/bpf/bpf_testmod/bpf_testmod.h 
b/kernel/bpf/bpf_testmod/bpf_testmod.h
new file mode 12
index ..f8d3df98b6a5
--- /dev/null
+++ b/kernel/bpf/bpf_testmod/bpf_testmod.h
@@ -0,0 +1 @@
+../../../tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
\ No newline at end of file
diff --git a/kernel/bpf/bpf_testmod/bpf_testmod_kfunc.h 
b/kernel/bpf/bpf_testmod/bpf_testmod_kfunc.h
new file mode 12
index ..fdf42f5eaeb0
--- /dev/null
+++ b/kernel/bpf/bpf_testmod/bpf_testmod_kfunc.h
@@ -0,0 +1 @@
+../../../tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h
\ No newline at end of file
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index dfd919374017..33029c91bf92 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -573,10 +573,12 @@ __bpf_kfunc int bpf_modify_return_test2(int a, int *b, 
short c, int d,
 

[RFC][PATCH 3/4] kprobes: Allow kprobes with CONFIG_MODULES=n

2024-03-06 Thread Calvin Owens
If something like this is merged down the road, it can go in at leisure
once the module_alloc change is in: it's a one-way dependency.

Signed-off-by: Calvin Owens 
---
 arch/Kconfig|  2 +-
 kernel/kprobes.c| 22 ++
 kernel/trace/trace_kprobe.c | 11 +++
 3 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index cfc24ced16dd..e60ce984d095 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -52,8 +52,8 @@ config GENERIC_ENTRY
 
 config KPROBES
bool "Kprobes"
-   depends on MODULES
depends on HAVE_KPROBES
+   select MODULE_ALLOC
select KALLSYMS
select TASKS_RCU if PREEMPTION
help
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9d9095e81792..194270e17d57 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1556,8 +1556,12 @@ static bool is_cfi_preamble_symbol(unsigned long addr)
str_has_prefix("__pfx_", symbuf);
 }
 
+#if IS_ENABLED(CONFIG_MODULES)
 static int check_kprobe_address_safe(struct kprobe *p,
 struct module **probed_mod)
+#else
+static int check_kprobe_address_safe(struct kprobe *p)
+#endif
 {
int ret;
 
@@ -1580,6 +1584,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
 
+#if IS_ENABLED(CONFIG_MODULES)
/* Check if 'p' is probing a module. */
*probed_mod = __module_text_address((unsigned long) p->addr);
if (*probed_mod) {
@@ -1603,6 +1608,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
ret = -ENOENT;
}
}
+#endif
+
 out:
preempt_enable();
jump_label_unlock();
@@ -1614,7 +1621,9 @@ int register_kprobe(struct kprobe *p)
 {
int ret;
struct kprobe *old_p;
+#if IS_ENABLED(CONFIG_MODULES)
struct module *probed_mod;
+#endif
kprobe_opcode_t *addr;
bool on_func_entry;
 
@@ -1633,7 +1642,11 @@ int register_kprobe(struct kprobe *p)
p->nmissed = 0;
INIT_LIST_HEAD(>list);
 
+#if IS_ENABLED(CONFIG_MODULES)
ret = check_kprobe_address_safe(p, _mod);
+#else
+   ret = check_kprobe_address_safe(p);
+#endif
if (ret)
return ret;
 
@@ -1676,8 +1689,10 @@ int register_kprobe(struct kprobe *p)
 out:
mutex_unlock(_mutex);
 
+#if IS_ENABLED(CONFIG_MODULES)
if (probed_mod)
module_put(probed_mod);
+#endif
 
return ret;
 }
@@ -2482,6 +2497,7 @@ int kprobe_add_area_blacklist(unsigned long start, 
unsigned long end)
return 0;
 }
 
+#if IS_ENABLED(CONFIG_MODULES)
 /* Remove all symbols in given area from kprobe blacklist */
 static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
end)
 {
@@ -2499,6 +2515,7 @@ static void kprobe_remove_ksym_blacklist(unsigned long 
entry)
 {
kprobe_remove_area_blacklist(entry, entry + 1);
 }
+#endif
 
 int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value,
   char *type, char *sym)
@@ -2564,6 +2581,7 @@ static int __init populate_kprobe_blacklist(unsigned long 
*start,
return ret ? : arch_populate_kprobe_blacklist();
 }
 
+#if IS_ENABLED(CONFIG_MODULES)
 static void add_module_kprobe_blacklist(struct module *mod)
 {
unsigned long start, end;
@@ -2665,6 +2683,7 @@ static struct notifier_block kprobe_module_nb = {
.notifier_call = kprobes_module_callback,
.priority = 0
 };
+#endif /* IS_ENABLED(CONFIG_MODULES) */
 
 void kprobe_free_init_mem(void)
 {
@@ -2724,8 +2743,11 @@ static int __init init_kprobes(void)
err = arch_init_kprobes();
if (!err)
err = register_die_notifier(_exceptions_nb);
+
+#if IS_ENABLED(CONFIG_MODULES)
if (!err)
err = register_module_notifier(_module_nb);
+#endif
 
kprobes_initialized = (err == 0);
kprobe_sysctls_init();
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index c4c6e0e0068b..dd4598f775b9 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -102,6 +102,7 @@ static nokprobe_inline bool trace_kprobe_has_gone(struct 
trace_kprobe *tk)
return kprobe_gone(>rp.kp);
 }
 
+#if IS_ENABLED(CONFIG_MODULES)
 static nokprobe_inline bool trace_kprobe_within_module(struct trace_kprobe *tk,
 struct module *mod)
 {
@@ -129,6 +130,12 @@ static nokprobe_inline bool 
trace_kprobe_module_exist(struct trace_kprobe *tk)
 
return ret;
 }
+#else
+static nokprobe_inline bool trace_kprobe_module_exist(struct trace_kprobe *tk)
+{
+   return true;
+}
+#endif
 
 static bool trace_kprobe_is_busy(struct dyn_event *ev)
 {
@@ -670,6 +677,7 @@ static int register_trace_kprobe(struct trace_kprobe *tk)
return ret;
 }
 
+#if IS_ENABLED(CONFIG_MODULES)
 /* Module notifier call ba

[RFC][PATCH 1/4] module: mm: Make module_alloc() generally available

2024-03-06 Thread Calvin Owens
Both BPF_JIT and KPROBES depend on CONFIG_MODULES, but only require
module_alloc() itself, which can be easily separated into a standalone
allocator for executable kernel memory.

Thomas Gleixner sent a patch to do that for x86 as part of a larger
series a couple years ago:

https://lore.kernel.org/all/20220716230953.442937...@linutronix.de/

I've simply extended that approach to the whole kernel.

Signed-off-by: Calvin Owens 
---
 arch/Kconfig |   2 +-
 arch/arm/kernel/module.c |  35 -
 arch/arm/mm/Makefile |   2 +
 arch/arm/mm/module_alloc.c   |  40 ++
 arch/arm64/kernel/module.c   | 127 --
 arch/arm64/mm/Makefile   |   1 +
 arch/arm64/mm/module_alloc.c | 130 +++
 arch/loongarch/kernel/module.c   |   6 --
 arch/loongarch/mm/Makefile   |   2 +
 arch/loongarch/mm/module_alloc.c |  10 +++
 arch/mips/kernel/module.c|  10 ---
 arch/mips/mm/Makefile|   2 +
 arch/mips/mm/module_alloc.c  |  13 
 arch/nios2/kernel/module.c   |  20 -
 arch/nios2/mm/Makefile   |   2 +
 arch/nios2/mm/module_alloc.c |  22 ++
 arch/parisc/kernel/module.c  |  12 ---
 arch/parisc/mm/Makefile  |   1 +
 arch/parisc/mm/module_alloc.c|  15 
 arch/powerpc/kernel/module.c |  36 -
 arch/powerpc/mm/Makefile |   1 +
 arch/powerpc/mm/module_alloc.c   |  41 ++
 arch/riscv/kernel/module.c   |  11 ---
 arch/riscv/mm/Makefile   |   1 +
 arch/riscv/mm/module_alloc.c |  17 
 arch/s390/kernel/module.c|  37 -
 arch/s390/mm/Makefile|   1 +
 arch/s390/mm/module_alloc.c  |  42 ++
 arch/sparc/kernel/module.c   |  31 
 arch/sparc/mm/Makefile   |   2 +
 arch/sparc/mm/module_alloc.c |  31 
 arch/x86/kernel/ftrace.c |   2 +-
 arch/x86/kernel/module.c |  56 -
 arch/x86/mm/Makefile |   2 +
 arch/x86/mm/module_alloc.c   |  59 ++
 fs/proc/kcore.c  |   2 +-
 kernel/module/Kconfig|   1 +
 kernel/module/main.c |  17 
 mm/Kconfig   |   3 +
 mm/Makefile  |   1 +
 mm/module_alloc.c|  21 +
 mm/vmalloc.c |   2 +-
 42 files changed, 467 insertions(+), 402 deletions(-)
 create mode 100644 arch/arm/mm/module_alloc.c
 create mode 100644 arch/arm64/mm/module_alloc.c
 create mode 100644 arch/loongarch/mm/module_alloc.c
 create mode 100644 arch/mips/mm/module_alloc.c
 create mode 100644 arch/nios2/mm/module_alloc.c
 create mode 100644 arch/parisc/mm/module_alloc.c
 create mode 100644 arch/powerpc/mm/module_alloc.c
 create mode 100644 arch/riscv/mm/module_alloc.c
 create mode 100644 arch/s390/mm/module_alloc.c
 create mode 100644 arch/sparc/mm/module_alloc.c
 create mode 100644 arch/x86/mm/module_alloc.c
 create mode 100644 mm/module_alloc.c

diff --git a/arch/Kconfig b/arch/Kconfig
index a5af0edd3eb8..cfc24ced16dd 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1305,7 +1305,7 @@ config ARCH_HAS_STRICT_MODULE_RWX
 
 config STRICT_MODULE_RWX
bool "Set loadable kernel module data as NX and text as RO" if 
ARCH_OPTIONAL_KERNEL_RWX
-   depends on ARCH_HAS_STRICT_MODULE_RWX && MODULES
+   depends on ARCH_HAS_STRICT_MODULE_RWX && MODULE_ALLOC
default !ARCH_OPTIONAL_KERNEL_RWX || ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
help
  If this is set, module text and rodata memory will be made read-only,
diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index e74d84f58b77..1c8798732d12 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -4,15 +4,12 @@
  *
  *  Copyright (C) 2002 Russell King.
  *  Modified for nommu by Hyok S. Choi
- *
- * Module allocation method suggested by Andi Kleen.
  */
 #include 
 #include 
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -22,38 +19,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_XIP_KERNEL
-/*
- * The XIP kernel text is mapped in the module area for modules and
- * some other stuff to work without any indirect relocations.
- * MODULES_VADDR is redefined here and not in asm/memory.h to avoid
- * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off.
- */
-#undef MODULES_VADDR
-#define MODULES_VADDR  (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK)
-#endif
-
-#ifdef CONFIG_MMU
-void *module_alloc(unsigned long size)
-{
-   gfp_t gfp_mask = GFP_KERNEL;
-   void *p;
-
-   /* Silence the initial allocation */
-   if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS))
-   gfp_mask |= __GFP_NOWARN;
-
-   p = __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-   gfp_mask, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
-   __builtin_return_addres

[RFC][PATCH 2/4] bpf: Allow BPF_JIT with CONFIG_MODULES=n

2024-03-06 Thread Calvin Owens
No BPF code has to change, except in struct_ops (for module refs).

This conflicts with bpf-next because of this (relevant) series:

https://lore.kernel.org/all/20240119225005.668602-1-thinker...@gmail.com/

If something like this is merged down the road, it can go through
bpf-next at leisure once the module_alloc change is in: it's a one-way
dependency.

Signed-off-by: Calvin Owens 
---
 kernel/bpf/Kconfig  |  2 +-
 kernel/bpf/bpf_struct_ops.c | 28 
 2 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
index 6a906ff93006..77df483a8925 100644
--- a/kernel/bpf/Kconfig
+++ b/kernel/bpf/Kconfig
@@ -42,7 +42,7 @@ config BPF_JIT
bool "Enable BPF Just In Time compiler"
depends on BPF
depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT
-   depends on MODULES
+   select MODULE_ALLOC
help
  BPF programs are normally handled by a BPF interpreter. This option
  allows the kernel to generate native code when a program is loaded
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 02068bd0e4d9..fbf08a1bb00c 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -108,11 +108,30 @@ const struct bpf_prog_ops bpf_struct_ops_prog_ops = {
 #endif
 };
 
+#if IS_ENABLED(CONFIG_MODULES)
 static const struct btf_type *module_type;
 
+static int bpf_struct_module_type_init(struct btf *btf)
+{
+   s32 module_id;
+
+   module_id = btf_find_by_name_kind(btf, "module", BTF_KIND_STRUCT);
+   if (module_id < 0)
+   return 1;
+
+   module_type = btf_type_by_id(btf, module_id);
+   return 0;
+}
+#else
+static int bpf_struct_module_type_init(struct btf *btf)
+{
+   return 0;
+}
+#endif
+
 void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log)
 {
-   s32 type_id, value_id, module_id;
+   s32 type_id, value_id;
const struct btf_member *member;
struct bpf_struct_ops *st_ops;
const struct btf_type *t;
@@ -125,12 +144,10 @@ void bpf_struct_ops_init(struct btf *btf, struct 
bpf_verifier_log *log)
 #include "bpf_struct_ops_types.h"
 #undef BPF_STRUCT_OPS_TYPE
 
-   module_id = btf_find_by_name_kind(btf, "module", BTF_KIND_STRUCT);
-   if (module_id < 0) {
+   if (bpf_struct_module_type_init(btf)) {
pr_warn("Cannot find struct module in btf_vmlinux\n");
return;
}
-   module_type = btf_type_by_id(btf, module_id);
 
for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
st_ops = bpf_struct_ops[i];
@@ -433,12 +450,15 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map 
*map, void *key,
 
moff = __btf_member_bit_offset(t, member) / 8;
ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, NULL);
+
+#if IS_ENABLED(CONFIG_MODULES)
if (ptype == module_type) {
if (*(void **)(udata + moff))
goto reset_unlock;
*(void **)(kdata + moff) = BPF_MODULE_OWNER;
continue;
}
+#endif
 
err = st_ops->init_member(t, member, kdata, udata);
if (err < 0)
-- 
2.43.0




[RFC][PATCH 0/4] Make bpf_jit and kprobes work with CONFIG_MODULES=n

2024-03-06 Thread Calvin Owens
Hello all,

This patchset makes it possible to use bpftrace with kprobes on kernels
built without loadable module support.

On a Raspberry Pi 4b, this saves about 700KB of memory where BPF is
needed but loadable module support is not. These two kernels had
identical configurations, except CONFIG_MODULE was off in the second:

   - Linux version 6.8.0-rc7
   - Memory: 3330672K/4050944K available (16576K kernel code, 2390K rwdata,
   - 12364K rodata, 5632K init, 675K bss, 195984K reserved, 524288K 
cma-reserved)
   + Linux version 6.8.0-rc7-3-g2af01251ca21
   + Memory: 3331400K/4050944K available (16512K kernel code, 2384K rwdata,
   + 11728K rodata, 5632K init, 673K bss, 195256K reserved, 524288K 
cma-reserved)

I don't intend to present an exhaustive list of !MODULES usecases, since
I'm sure there are many I'm not aware of. Performance is a common one,
the primary justification being that static text is mapped on hugepages
and module text is not. Security is another, since rootkits are much
harder to implement without modules.

The first patch is the interesting one: it moves module_alloc() into its
own file with its own Kconfig option, so it can be utilized even when
loadable module support is disabled. I got the idea from an unmerged
patch from a few years ago I found on lkml (see [1/4] for details). I
think this also has value in its own right, since I suspect there are
potential users beyond bpf, hopefully we will hear from some.

Patches 2-3 are proofs of concept to demonstrate the first patch is
sufficient to achieve my goal (full ebpf functionality without modules).

Patch 4 adds a new "-n" argument to vmtest.sh to run the BPF selftests
without modules, so the prior three patches can be rigorously tested.

If something like the first patch were to eventually be merged, the rest
could go through the normal bpf-next process as I clean them up: I've
only based them on Linus' tree and combined them into a series here to
introduce the idea.

If you prefer to fetch the patches via git:

  [1/4] https://github.com/jcalvinowens/linux.git work/module-alloc
 +[2/4]+[3/4] https://github.com/jcalvinowens/linux.git work/nomodule-bpf
 +[4/4] https://github.com/jcalvinowens/linux.git testing/nomodule-bpf-ci

In addition to the automated BPF selftests, I've lightly tested this on
my laptop (x86_64), a Raspberry Pi 4b (arm64), and a Raspberry Pi Zero W
(arm). The other architectures have only been compile tested.

I didn't want to spam all the arch maintainers with what I expect will
be a discussion mostly about modules and bpf, so I've left them off this
first submission. I will be sure to add them on future submissions of
the first patch. Of course, feedback on the arch bits is welcome here.

In addition to feedback on the patches themselves, I'm interested in
hearing from anybody else who might find this functionality useful.

Thanks,
Calvin


Calvin Owens (4):
  module: mm: Make module_alloc() generally available
  bpf: Allow BPF_JIT with CONFIG_MODULES=n
  kprobes: Allow kprobes with CONFIG_MODULES=n
  selftests/bpf: Support testing the !MODULES case

 arch/Kconfig  |   4 +-
 arch/arm/kernel/module.c  |  35 -
 arch/arm/mm/Makefile  |   2 +
 arch/arm/mm/module_alloc.c|  40 ++
 arch/arm64/kernel/module.c| 127 -
 arch/arm64/mm/Makefile|   1 +
 arch/arm64/mm/module_alloc.c  | 130 ++
 arch/loongarch/kernel/module.c|   6 -
 arch/loongarch/mm/Makefile|   2 +
 arch/loongarch/mm/module_alloc.c  |  10 ++
 arch/mips/kernel/module.c |  10 --
 arch/mips/mm/Makefile |   2 +
 arch/mips/mm/module_alloc.c   |  13 ++
 arch/nios2/kernel/module.c|  20 ---
 arch/nios2/mm/Makefile|   2 +
 arch/nios2/mm/module_alloc.c  |  22 +++
 arch/parisc/kernel/module.c   |  12 --
 arch/parisc/mm/Makefile   |   1 +
 arch/parisc/mm/module_alloc.c |  15 ++
 arch/powerpc/kernel/module.c  |  36 -
 arch/powerpc/mm/Makefile  |   1 +
 arch/powerpc/mm/module_alloc.c|  41 ++
 arch/riscv/kernel/module.c|  11 --
 arch/riscv/mm/Makefile|   1 +
 arch/riscv/mm/module_alloc.c  |  17 +++
 arch/s390/kernel/module.c |  37 -
 arch/s390/mm/Makefile |   1 +
 arch/s390/mm/module_alloc.c   |  42 ++
 arch/sparc/kernel/module.c|  31 -
 arch/sparc/mm/Makefile|   2 +
 arch/sparc/mm/module_alloc.c  |  31 +
 arch/x86/kernel/ftrace.c  |   2 +-
 arch/x86/kerne

Re: [PATCH 3/4] printk: Add consoles to a virtual "console" bus

2019-03-12 Thread Calvin Owens
On Monday 03/11 at 14:33 +0100, Petr Mladek wrote:
> On Fri 2019-03-01 16:48:19, Calvin Owens wrote:
> > This patch embeds a device struct in the console struct, and registers
> > them on a "console" bus so we can expose attributes in sysfs.
> > 
> > Currently, most drivers declare static console structs, and that is
> > incompatible with the dev refcount model. So we end up needing to patch
> > all of the console drivers to:
> > 
> > 1. Dynamically allocate the console struct using a new helper
> > 2. Handle the allocation in (1) possibly failing
> > 3. Dispose of (1) with put_device()
> > 
> > Early console structures must still be static, since they're required
> > before we're able to allocate memory. The least ugly way I can come up
> > with to handle this is an "is_static" flag in the structure which makes
> > the gets and puts NOPs, and is checked in ->release() to catch mistakes.
> > 
> > diff --git a/drivers/char/lp.c b/drivers/char/lp.c
> > index 5c8d780637bd..e09cb192a469 100644
> > --- a/drivers/char/lp.c
> > +++ b/drivers/char/lp.c
> > @@ -857,12 +857,12 @@ static void lp_console_write(struct console *co, 
> > const char *s,
> > parport_release(dev);
> >  }
> >  
> > -static struct console lpcons = {
> > -   .name   = "lp",
> > +static const struct console_operations lp_cons_ops = {
> > .write  = lp_console_write,
> > -   .flags  = CON_PRINTBUFFER,
> >  };
> >  
> > +static struct console *lpcons;
> 
> I have got the following compilation error (see below):
> 
>   CC  drivers/char/lp.o
> drivers/char/lp.c: In function ‘lp_register’:
> drivers/char/lp.c:925:2: error: ‘lpcons’ undeclared (first use in this 
> function)
>   lpcons = allocate_console_dfl(_cons_ops, "lp", NULL);
>   ^
> drivers/char/lp.c:925:2: note: each undeclared identifier is reported only 
> once for each function it appears in
> In file included from drivers/char/lp.c:125:0:
> drivers/char/lp.c:925:33: error: ‘lp_cons_ops’ undeclared (first use in this 
> function)

D'oh, will fix.
 
> 
> >  #endif /* console on line printer */
> >  
> >  /* --- initialisation code - */
> > @@ -921,6 +921,11 @@ static int lp_register(int nr, struct parport *port)
> >   _cb, nr);
> > if (lp_table[nr].dev == NULL)
> > return 1;
> > +
> > +   lpcons = allocate_console_dfl(_cons_ops, "lp", NULL);
> > +   if (!lpcons)
> > +   return -ENOMEM;
> 
> This should be done inside #ifdef CONFIG_LP_CONSOLE
> to avoid the above compilation error.
> 
> > +
> > lp_table[nr].flags |= LP_EXIST;
> >  
> > if (reset)
> 
> [...]
> > diff --git a/include/linux/console.h b/include/linux/console.h
> > index 3c27a4a29b8c..382591683033 100644
> > --- a/include/linux/console.h
> > +++ b/include/linux/console.h
> > @@ -142,20 +143,28 @@ static inline int con_debug_leave(void)
> >  #define CON_BRL(32) /* Used for a braille device */
> >  #define CON_EXTENDED   (64) /* Use the extended output format a la 
> > /dev/kmsg */
> >  
> > -struct console {
> > -   charname[16];
> > +struct console;
> > +
> > +struct console_operations {
> > void(*write)(struct console *, const char *, unsigned);
> > int (*read)(struct console *, char *, unsigned);
> > struct tty_driver *(*device)(struct console *, int *);
> > void(*unblank)(void);
> > int (*setup)(struct console *, char *);
> > int (*match)(struct console *, char *name, int idx, char *options);
> > +};
> > +
> > +struct console {
> > +   charname[16];
> > short   flags;
> > short   index;
> > int cflag;
> > void*data;
> > struct   console *next;
> > int level;
> > +   const struct console_operations *ops;
> > +   struct device dev;
> > +   int is_static;
> >  };
> >  
> >  /*
> > @@ -167,6 +176,29 @@ struct console {
> >  extern int console_set_on_cmdline;
> >  extern struct console *early_console;
> >  
> > +extern struct console *allocate_console(const struct console_operations 
> > *ops,
> > +   const char *name, short flags,
> > +   short index, void *data);
> > +
> > +#define allocate_console_dfl(ops, name, data) \
> &g

Re: [PATCH 1/4] printk: Introduce per-console loglevel setting

2019-03-12 Thread Calvin Owens
On Friday 03/08 at 12:10 +0900, Sergey Senozhatsky wrote:
> On (03/01/19 16:48), Calvin Owens wrote:
> [..]
> > msg = log_from_idx(console_idx);
> > -   if (suppress_message_printing(msg->level)) {
> > -   /*
> > -* Skip record we have buffered and already printed
> > -* directly to the console when we received it, and
> > -* record that has level above the console loglevel.
> > -*/
> > -   console_idx = log_next(console_idx);
> > -   console_seq++;
> > -   goto skip;
> > -   }
> >  
> > /* Output to all consoles once old messages replayed. */
> > if (unlikely(exclusive_console &&
> > @@ -2405,7 +2402,7 @@ void console_unlock(void)
> > console_lock_spinning_enable();
> >  
> > stop_critical_timings();/* don't trace print latency */
> > -   call_console_drivers(ext_text, ext_len, text, len);
> > +   call_console_drivers(ext_text, ext_len, text, len, msg->level);
> > start_critical_timings();
> 
> So it seems that now we always format the text and ext message (if
> needed) and only then check if there is at least one console we can
> print that message on.
> 
> Can we iterate the consoles first and check if msg is worth
> the effort (per console suppress_message_printing()) and only
> if it is do all the formatting and call console drivers?

Makes sense, will do.

Thanks,
Calvin
 
>   -ss


Re: [PATCH 3/4] printk: Add consoles to a virtual "console" bus

2019-03-12 Thread Calvin Owens
On Friday 03/08 at 16:53 +0100, Petr Mladek wrote:
> On Fri 2019-03-01 16:48:19, Calvin Owens wrote:
> > This patch embeds a device struct in the console struct, and registers
> > them on a "console" bus so we can expose attributes in sysfs.
> > 
> > Early console structures must still be static, since they're required
> > before we're able to allocate memory. The least ugly way I can come up
> > with to handle this is an "is_static" flag in the structure which makes
> > the gets and puts NOPs, and is checked in ->release() to catch mistakes.
> 
> I wonder if it might get detected by is_kernel_inittext().

I don't think inittext() in particular would work, since these actually need
to exist forever if you pass "earlyprintk=[...],keep" so they aren't __init.

But I bet you're right that we could catch the static case without needing
the explicit flag, something like is_module_address() (but it would also need
to work for the built-in case). I'll see if I can get this to work.

Thanks,
Calvin
 
> Best Regards,
> Petr


Re: [PATCH 3/4] printk: Add consoles to a virtual "console" bus

2019-03-12 Thread Calvin Owens
On Friday 03/08 at 17:34 +0100, Greg Kroah-Hartman wrote:
> On Fri, Mar 08, 2019 at 04:58:14PM +0100, Petr Mladek wrote:
> > On Fri 2019-03-08 03:56:19, John Ogness wrote:
> > > On 2019-03-02, Calvin Owens  wrote:
> > > > This patch embeds a device struct in the console struct, and registers
> > > > them on a "console" bus so we can expose attributes in sysfs.
> > > 
> > > I expect that "class" would be more appropriate than "bus". These
> > > devices really are grouped together based on their function and not the
> > > medium by which they are accessed.
> > 
> > Good point. "class" looks better to me as well.
> > 
> > Greg, any opinion, where to put the entries for struct console ?
> 
> Hang them off of the device that the console belongs to?
> 
> Classes and busses are almost identical except:
>   - busses is the binding of a driver to a device (usb, pci, etc.)
>   - classes are usually userspace interactions to a device (input,
> tty, etc.)
> 
> So this sounds like a class to me.

Sounds good, will make it a class.
 
> If you want me to review this, I'll be glad to so do once 5.1-rc1 is
> out...

Yeah, I realized after sending this the timing was pretty terrible, I'll
wait for 5.1-rc1 before rebasing/resending.

Thanks,
Calvin
 
> thanks,
> 
> greg k-h


Re: [PATCH] tpm: Make timeout logic simpler and more robust

2019-03-12 Thread Calvin Owens
On Tuesday 03/12 at 13:04 -0400, Mimi Zohar wrote:
> On Mon, 2019-03-11 at 16:54 -0700, Calvin Owens wrote:
> > We're having lots of problems with TPM commands timing out, and we're
> > seeing these problems across lots of different hardware (both v1/v2).
> > 
> > I instrumented the driver to collect latency data, but I wasn't able to
> > find any specific timeout to fix: it seems like many of them are too
> > aggressive. So I tried replacing all the timeout logic with a single
> > universal long timeout, and found that makes our TPMs 100% reliable.
> > 
> > Given that this timeout logic is very complex, problematic, and appears
> > to serve no real purpose, I propose simply deleting all of it.
> 
> Normally before sending such a massive change like this, included in
> the bug report or patch description, there would be some indication as
> to which kernel introduced a regression.  Has this always been a
> problem? Is this something new? How new?

Honestly we've always had problems with flakiness from these devices,
but it seems to have regressed sometime between 4.11 and 4.16.

I wish a had a better answer for you: we need on the order of a hundred
machines to see the difference, and setting up these 100+ machine tests
is unfortunately involved enough that e.g. bisecting it just isn't
feasible :/

What I can say for sure is that this patch makes everything much better
for us. If there's anything in particular you'd like me to test, I have
an army of machines I'm happy to put to use, let me know :)

Thanks,
Calvin
 
> Mimi
> 
> > 
> > Signed-off-by: Calvin Owens 
> > ---
> >  drivers/char/tpm/st33zp24/st33zp24.c |  28 +-
> >  drivers/char/tpm/tpm-interface.c |  41 +--
> >  drivers/char/tpm/tpm-sysfs.c |  34 ---
> >  drivers/char/tpm/tpm.h   |  60 +---
> >  drivers/char/tpm/tpm1-cmd.c  | 423 ++-
> >  drivers/char/tpm/tpm2-cmd.c  | 120 
> >  drivers/char/tpm/tpm_crb.c   |  20 +-
> >  drivers/char/tpm/tpm_i2c_atmel.c |   6 -
> >  drivers/char/tpm/tpm_i2c_infineon.c  |  33 +--
> >  drivers/char/tpm/tpm_i2c_nuvoton.c   |  42 +--
> >  drivers/char/tpm/tpm_nsc.c   |   6 +-
> >  drivers/char/tpm/tpm_tis_core.c  |  96 +-
> >  drivers/char/tpm/xen-tpmfront.c  |  17 +-
> >  13 files changed, 108 insertions(+), 818 deletions(-)
> > 
> > diff --git a/drivers/char/tpm/st33zp24/st33zp24.c 
> > b/drivers/char/tpm/st33zp24/st33zp24.c
> > index 64dc560859f2..433b9a72f0ef 100644
> > --- a/drivers/char/tpm/st33zp24/st33zp24.c
> > +++ b/drivers/char/tpm/st33zp24/st33zp24.c
> > @@ -154,13 +154,13 @@ static int request_locality(struct tpm_chip *chip)
> > if (ret < 0)
> > return ret;
> > 
> > -   stop = jiffies + chip->timeout_a;
> > +   stop = jiffies + TPM_UNIVERSAL_TIMEOUT_JIFFIES;
> > 
> > /* Request locality is usually effective after the request */
> > do {
> > if (check_locality(chip))
> > return tpm_dev->locality;
> > -   msleep(TPM_TIMEOUT);
> > +   msleep(TPM_TIMEOUT_POLL_MS);
> > } while (time_before(jiffies, stop));
> > 
> > /* could not get locality */
> > @@ -193,7 +193,7 @@ static int get_burstcount(struct tpm_chip *chip)
> > int burstcnt, status;
> > u8 temp;
> > 
> > -   stop = jiffies + chip->timeout_d;
> > +   stop = jiffies + TPM_UNIVERSAL_TIMEOUT_JIFFIES;
> > do {
> > status = tpm_dev->ops->recv(tpm_dev->phy_id, TPM_STS + 1,
> > , 1);
> > @@ -209,7 +209,7 @@ static int get_burstcount(struct tpm_chip *chip)
> > burstcnt |= temp << 8;
> > if (burstcnt)
> > return burstcnt;
> > -   msleep(TPM_TIMEOUT);
> > +   msleep(TPM_TIMEOUT_POLL_MS);
> > } while (time_before(jiffies, stop));
> > return -EBUSY;
> >  } /* get_burstcount() */
> > @@ -248,11 +248,11 @@ static bool wait_for_tpm_stat_cond(struct tpm_chip 
> > *chip, u8 mask,
> >   * @param: check_cancel, does the command can be cancelled ?
> >   * @return: the tpm status, 0 if success, -ETIME if timeout is reached.
> >   */
> > -static int wait_for_stat(struct tpm_chip *chip, u8 mask, unsigned long 
> > timeout,
> > +static int wait_for_stat(struct tpm_chip *chip, u8 mask,
> > wait_queue_head_t *queue, bool check_cancel)
> >  {
> > struct st33zp24_dev *tpm_dev = dev_get_drvdata(>dev);
> > -   unsigned long st

Re: [PATCH] tpm: Make timeout logic simpler and more robust

2019-03-12 Thread Calvin Owens
On Tuesday 03/12 at 17:39 +0200, Jarkko Sakkinen wrote:
> On Tue, Mar 12, 2019 at 07:42:46AM -0700, James Bottomley wrote:
> > On Tue, 2019-03-12 at 14:50 +0200, Jarkko Sakkinen wrote:
> > > On Mon, Mar 11, 2019 at 05:27:43PM -0700, James Bottomley wrote:
> > > > On Mon, 2019-03-11 at 16:54 -0700, Calvin Owens wrote:
> > > > > e're having lots of problems with TPM commands timing out, and
> > > > > we're seeing these problems across lots of different hardware
> > > > > (both v1/v2).
> > > > > 
> > > > > I instrumented the driver to collect latency data, but I wasn't
> > > > > able to find any specific timeout to fix: it seems like many of
> > > > > them are too aggressive. So I tried replacing all the timeout
> > > > > logic with a single universal long timeout, and found that makes
> > > > > our TPMs 100% reliable.
> > > > > 
> > > > > Given that this timeout logic is very complex, problematic, and
> > > > > appears to serve no real purpose, I propose simply deleting all
> > > > > of it.
> > > > 
> > > > "no real purpose" is a bit strong given that all these timeouts are
> > > > standards mandated.  The purpose stated by the standards is that
> > > > there needs to be a way of differentiating the TPM crashed from the
> > > > TPM is taking a very long time to respond.  For a normally
> > > > functioning TPM it looks complex and unnecessary, but for a
> > > > malfunctioning one it's a lifesaver.
> > > 
> > > Standards should be only followed when they make practical sense and
> > > ignored when not. The range is only up to 2s anyway.
> > 
> > I don't disagree ... and I'm certainly not going to defend the TCG
> > because I do think the complexity of some of its standards contributed
> > to the lack of use of TPM 1.2.
> > 
> > However, I am saying we should root cause this problem rather than take
> > a blind shot at the apparent timeout complexity.  My timeout
> > instability is definitely related to the polling adjustments, so it's
> > not unreasonable to think Facebooks might be as well.
> 
> Yeah, referring to my review comment, I think the very first thing
> that should be done is to split patch into two. Then we can probably
> give better feedback.

Absolutely, will do.

Thanks,
Calvin
 
> /Jarkko


Re: [PATCH] tpm: Make timeout logic simpler and more robust

2019-03-12 Thread Calvin Owens
On Monday 03/11 at 17:27 -0700, James Bottomley wrote:
> On Mon, 2019-03-11 at 16:54 -0700, Calvin Owens wrote:
> > e're having lots of problems with TPM commands timing out, and we're
> > seeing these problems across lots of different hardware (both v1/v2).
> > 
> > I instrumented the driver to collect latency data, but I wasn't able
> > to find any specific timeout to fix: it seems like many of them are
> > too aggressive. So I tried replacing all the timeout logic with a
> > single universal long timeout, and found that makes our TPMs 100%
> > reliable.
> > 
> > Given that this timeout logic is very complex, problematic, and
> > appears to serve no real purpose, I propose simply deleting all of
> > it.
> 
> "no real purpose" is a bit strong given that all these timeouts are
> standards mandated.  

Sure, in fairness I said "appears to" ;)

We tested this on roughly a hundred machines with a variety of hardware,
they were flaky before and essentially perfectly reliable after this
patch. So that's where I'm coming from here.

> The purpose stated by the standards is that there needs to be a way of
> differentiating the TPM crashed from the TPM is taking a very long
> time to respond.  For a normally functioning TPM it looks complex and
> unnecessary, but for a malfunctioning one it's a lifesaver.

Does getting -EWHATEVER some 2-3 seconds more quickly really make much
of a difference? That's all we're talking about changing here, right?

> Could you first check it's not a problem we introduced with our polling
> changes?  My nuvoton still doesn't work properly with the default poll
> timings but it works flawlessly if I use the patch below.  I think my
> nuvoton is a bit out of spec (it's a very early model that was software
> upgraded from 1.2 to 2.0) because no-one else on the list seems to see
> the problems I see, but perhaps you are.

I did consider the polling changes. My thinking was that, since the poll
loops I was seeing time out are all gated on time_before(), it would
only potentially change how much the final poll overruns the target
jiffies, and wasn't as likely to help as changing the timeouts
themselves.

The theory about poking it too aggressively making it fall off the bus
definitely makes sense, but the success of this "universal timeout"
approach suggests to me that the timeouts themselves are the root
problem with the flakiness we're seeing in production.

Thanks,
Calvin
 
> James
> 
> ---
> 
> From 249d60a9fafa8638433e545b50dab6987346cb26 Mon Sep 17 00:00:00 2001
> From: James Bottomley 
> Date: Wed, 11 Jul 2018 10:11:14 -0700
> Subject: [PATCH] tpm.h: increase poll timings to fix tpm_tis regression
> 
> tpm_tis regressed recently to the point where the TPM being driven by
> it falls off the bus and cannot be contacted after some hours of use.
> This is the failure trace:
> 
> jejb@jarvis:~> dmesg|grep tpm
> [3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id 2)
> [14566.626614] tpm tpm0: Operation Timed out
> [14566.626621] tpm tpm0: tpm2_load_context: failed with a system error -62
> [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> [14570.626605] tpm tpm0: tpm2_load_context: failed with a system error -62
> [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> ...
> 
> The problem is caused by a change that caused us to poke the TPM far
> more often to see if it's ready.  Apparently something about the bus
> its on and the TPM means that it crashes or falls off the bus if you
> poke it too often and once this happens, only a reboot will recover
> it.
> 
> The fix I've come up with is to adjust the timings so the TPM no
> longer falls of the bus.  Obviously, this fix works for my Nuvoton
> NPCT6xxx but that's the only TPM I've tested it with.
> 
> Fixes: 424eaf910c32 tpm: reduce polling time to usecs for even finer 
> granularity
> Signed-off-by: James Bottomley 
> 
> diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
> index 4b104245afed..a6c806d98950 100644
> --- a/drivers/char/tpm/tpm.h
> +++ b/drivers/char/tpm/tpm.h
> @@ -64,8 +64,8 @@ enum tpm_timeout {
>   TPM_TIMEOUT_RETRY = 100, /* msecs */
>   TPM_TIMEOUT_RANGE_US = 300, /* usecs */
>   TPM_TIMEOUT_POLL = 1,   /* msecs */
> - TPM_TIMEOUT_USECS_MIN = 100,  /* usecs */
> - TPM_TIMEOUT_USECS_MAX = 500  /* usecs */
> + TPM_TIMEOUT_USECS_MIN = 750,  /* usecs */
> + TPM_TIMEOUT_USECS_MAX = 1000,  /* usecs */
>  };
>  
>  /* TPM addresses */


[PATCH] tpm: Make timeout logic simpler and more robust

2019-03-11 Thread Calvin Owens
We're having lots of problems with TPM commands timing out, and we're
seeing these problems across lots of different hardware (both v1/v2).

I instrumented the driver to collect latency data, but I wasn't able to
find any specific timeout to fix: it seems like many of them are too
aggressive. So I tried replacing all the timeout logic with a single
universal long timeout, and found that makes our TPMs 100% reliable.

Given that this timeout logic is very complex, problematic, and appears
to serve no real purpose, I propose simply deleting all of it.

Signed-off-by: Calvin Owens 
---
 drivers/char/tpm/st33zp24/st33zp24.c |  28 +-
 drivers/char/tpm/tpm-interface.c |  41 +--
 drivers/char/tpm/tpm-sysfs.c |  34 ---
 drivers/char/tpm/tpm.h   |  60 +---
 drivers/char/tpm/tpm1-cmd.c  | 423 ++-
 drivers/char/tpm/tpm2-cmd.c  | 120 
 drivers/char/tpm/tpm_crb.c   |  20 +-
 drivers/char/tpm/tpm_i2c_atmel.c |   6 -
 drivers/char/tpm/tpm_i2c_infineon.c  |  33 +--
 drivers/char/tpm/tpm_i2c_nuvoton.c   |  42 +--
 drivers/char/tpm/tpm_nsc.c   |   6 +-
 drivers/char/tpm/tpm_tis_core.c  |  96 +-
 drivers/char/tpm/xen-tpmfront.c  |  17 +-
 13 files changed, 108 insertions(+), 818 deletions(-)

diff --git a/drivers/char/tpm/st33zp24/st33zp24.c 
b/drivers/char/tpm/st33zp24/st33zp24.c
index 64dc560859f2..433b9a72f0ef 100644
--- a/drivers/char/tpm/st33zp24/st33zp24.c
+++ b/drivers/char/tpm/st33zp24/st33zp24.c
@@ -154,13 +154,13 @@ static int request_locality(struct tpm_chip *chip)
if (ret < 0)
return ret;
 
-   stop = jiffies + chip->timeout_a;
+   stop = jiffies + TPM_UNIVERSAL_TIMEOUT_JIFFIES;
 
/* Request locality is usually effective after the request */
do {
if (check_locality(chip))
return tpm_dev->locality;
-   msleep(TPM_TIMEOUT);
+   msleep(TPM_TIMEOUT_POLL_MS);
} while (time_before(jiffies, stop));
 
/* could not get locality */
@@ -193,7 +193,7 @@ static int get_burstcount(struct tpm_chip *chip)
int burstcnt, status;
u8 temp;
 
-   stop = jiffies + chip->timeout_d;
+   stop = jiffies + TPM_UNIVERSAL_TIMEOUT_JIFFIES;
do {
status = tpm_dev->ops->recv(tpm_dev->phy_id, TPM_STS + 1,
, 1);
@@ -209,7 +209,7 @@ static int get_burstcount(struct tpm_chip *chip)
burstcnt |= temp << 8;
if (burstcnt)
return burstcnt;
-   msleep(TPM_TIMEOUT);
+   msleep(TPM_TIMEOUT_POLL_MS);
} while (time_before(jiffies, stop));
return -EBUSY;
 } /* get_burstcount() */
@@ -248,11 +248,11 @@ static bool wait_for_tpm_stat_cond(struct tpm_chip *chip, 
u8 mask,
  * @param: check_cancel, does the command can be cancelled ?
  * @return: the tpm status, 0 if success, -ETIME if timeout is reached.
  */
-static int wait_for_stat(struct tpm_chip *chip, u8 mask, unsigned long timeout,
+static int wait_for_stat(struct tpm_chip *chip, u8 mask,
wait_queue_head_t *queue, bool check_cancel)
 {
struct st33zp24_dev *tpm_dev = dev_get_drvdata(>dev);
-   unsigned long stop;
+   unsigned long stop, timeout;
int ret = 0;
bool canceled = false;
bool condition;
@@ -264,7 +264,7 @@ static int wait_for_stat(struct tpm_chip *chip, u8 mask, 
unsigned long timeout,
if ((status & mask) == mask)
return 0;
 
-   stop = jiffies + timeout;
+   stop = jiffies + TPM_UNIVERSAL_TIMEOUT_JIFFIES;
 
if (chip->flags & TPM_CHIP_FLAG_IRQ) {
cur_intrs = tpm_dev->intrs;
@@ -296,7 +296,7 @@ static int wait_for_stat(struct tpm_chip *chip, u8 mask, 
unsigned long timeout,
 
} else {
do {
-   msleep(TPM_TIMEOUT);
+   msleep(TPM_TIMEOUT_POLL_MS);
status = chip->ops->status(chip);
if ((status & mask) == mask)
return 0;
@@ -321,7 +321,6 @@ static int recv_data(struct tpm_chip *chip, u8 *buf, size_t 
count)
while (size < count &&
   wait_for_stat(chip,
 TPM_STS_DATA_AVAIL | TPM_STS_VALID,
-chip->timeout_c,
 _dev->read_queue, true) == 0) {
burstcnt = get_burstcount(chip);
if (burstcnt < 0)
@@ -384,7 +383,7 @@ static int st33zp24_send(struct tpm_chip *chip, unsigned 
char *buf,
if ((status & TPM_STS_COMMAND_READY) == 0) {
st33zp24_cancel(chip);
if (wait_for_stat
-   (chip, TPM_STS_COMMAND_READY, chip->timeout_b,
+   (chip, TPM_STS_COMMAND_READY,

Re: [PATCH 4/4] printk: Add a device attribute for the per-console loglevel

2019-03-04 Thread Calvin Owens
On Monday 03/04 at 17:06 +0900, Sergey Senozhatsky wrote:
> On (03/01/19 16:48), Calvin Owens wrote:
> > +static struct attribute *console_sysfs_attrs[] = {
> > +   _attr_loglevel.attr,
> > +   NULL,
> > +};
> > +ATTRIBUTE_GROUPS(console_sysfs);
> > +
> >  static struct bus_type console_subsys = {
> > .name = "console",
> > +   .dev_groups = console_sysfs_groups,
> >  };
> 
> Do we really need to change this dynamically? Console options are
> traditionally static (boot param or DT). Can we also be happy with
> the static per-console loglevel?

It really does need to be runtime configurable: there are a lot of usecases
that enables, like turning the fast console up to KERN_DEBUG on a pile of
machines you want to take a closer look at. The 'kernel.printk' global
loglevel is also already changable at runtime, and since that setting
interacts with this one it would be strange if only the former were able
to be changed.

I also want to add more attribute knobs related to extended consoles,
so the plumbing to get things exposed in sysfs is worth it for me.

Thanks,
Calvin


[PATCH 2/4] printk: Add ability to set loglevel via "console=" cmdline

2019-03-01 Thread Calvin Owens
This extends the "console=" interface to allow setting the per-console
loglevel by adding "/N" to the string, where N is the desired loglevel
expressed as a base 10 integer. Invalid values are silently ignored.

Signed-off-by: Calvin Owens 
---
 .../admin-guide/kernel-parameters.txt |  6 ++--
 kernel/printk/console_cmdline.h   |  1 +
 kernel/printk/printk.c| 30 +++
 3 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 858b6c0b9a15..afada61dcbce 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -612,10 +612,10 @@
ttyS[,options]
ttyUSB0[,options]
Use the specified serial port.  The options are of
-   the form "pnf", where "" is the baud rate,
+   the form "pnf/l", where "" is the baud rate,
"p" is parity ("n", "o", or "e"), "n" is number of
-   bits, and "f" is flow control ("r" for RTS or
-   omit it).  Default is "9600n8".
+   bits, "f" is flow control ("r" for RTS or omit it),
+   and "l" is the loglevel on [0,7]. Default is "9600n8".
 
See Documentation/admin-guide/serial-console.rst for 
more
information.  See
diff --git a/kernel/printk/console_cmdline.h b/kernel/printk/console_cmdline.h
index 11f19c466af5..fbf9b539366e 100644
--- a/kernel/printk/console_cmdline.h
+++ b/kernel/printk/console_cmdline.h
@@ -6,6 +6,7 @@ struct console_cmdline
 {
charname[16];   /* Name of the driver   */
int index;  /* Minor dev. to use*/
+   int loglevel;   /* Loglevel to use */
char*options;   /* Options for the driver   */
 #ifdef CONFIG_A11Y_BRAILLE_CONSOLE
char*brl_options;   /* Options for braille driver */
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 6ead14f8c2bc..2e0eb89f046c 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2057,7 +2057,7 @@ asmlinkage __visible void early_printk(const char *fmt, 
...)
 #endif
 
 static int __add_preferred_console(char *name, int idx, char *options,
-  char *brl_options)
+  int loglevel, char *brl_options)
 {
struct console_cmdline *c;
int i;
@@ -2083,6 +2083,7 @@ static int __add_preferred_console(char *name, int idx, 
char *options,
c->options = options;
braille_set_options(c, brl_options);
 
+   c->loglevel = loglevel;
c->index = idx;
return 0;
 }
@@ -2104,8 +2105,8 @@ __setup("console_msg_format=", console_msg_format_setup);
 static int __init console_setup(char *str)
 {
char buf[sizeof(console_cmdline[0].name) + 4]; /* 4 for "ttyS" */
-   char *s, *options, *brl_options = NULL;
-   int idx;
+   char *s, *options, *llevel, *brl_options = NULL;
+   int idx, loglevel = LOGLEVEL_EMERG;
 
if (_braille_console_setup(, _options))
return 1;
@@ -2123,6 +2124,14 @@ static int __init console_setup(char *str)
options = strchr(str, ',');
if (options)
*(options++) = 0;
+
+   llevel = strchr(str, '/');
+   if (llevel) {
+   *(llevel++) = 0;
+   if (kstrtoint(llevel, 10, ))
+   loglevel = LOGLEVEL_EMERG;
+   }
+
 #ifdef __sparc__
if (!strcmp(str, "ttya"))
strcpy(buf, "ttyS0");
@@ -2135,7 +2144,7 @@ static int __init console_setup(char *str)
idx = simple_strtoul(s, NULL, 10);
*s = 0;
 
-   __add_preferred_console(buf, idx, options, brl_options);
+   __add_preferred_console(buf, idx, options, loglevel, brl_options);
console_set_on_cmdline = 1;
return 1;
 }
@@ -2156,7 +2165,8 @@ __setup("console=", console_setup);
  */
 int add_preferred_console(char *name, int idx, char *options)
 {
-   return __add_preferred_console(name, idx, options, NULL);
+   return __add_preferred_console(name, idx, options, LOGLEVEL_EMERG,
+  NULL);
 }
 
 bool console_suspend_enabled = true;
@@ -2574,6 +2584,7 @@ void register_console(struct console *newcon)
struct console *bcon = NULL;
struct console_cmdline *c;
static bool has_preferred;
+   bool cmdline_exists = false;
 
 

[PATCH 1/4] printk: Introduce per-console loglevel setting

2019-03-01 Thread Calvin Owens
Not all consoles are created equal: depending on the actual hardware,
the latency of a printk() call can vary dramatically. The worst examples
are serial consoles, where it can spin for tens of milliseconds banging
the UART to emit a message, which can cause application-level problems
when the kernel spews onto the console.

At Facebook we use netconsole to monitor our fleet, but we still have
serial consoles attached on each host for live debugging, and the latter
has caused problems. An obvious solution is to disable the kernel
console output to ttyS0, but this makes live debugging frustrating,
since crashes become silent and opaque to the ttyS0 user. Enabling it on
the fly when needed isn't feasible, since boxes you need to debug via
serial are likely to be borked in ways that make this impossible.

That puts us between a rock and a hard place: we'd love to set
kernel.printk to KERN_INFO and get all the logs. But while netconsole is
fast enough to permit that without perturbing userspace, ttyS0 is not,
and we're forced to limit console logging to KERN_WARNING and higher.

This patch introduces a new per-console loglevel setting, and changes
console_unlock() to use max(global_level, per_console_level) when
deciding whether or not to emit a given log message.

This lets us have our cake and eat it too: instead of being forced to
limit all consoles verbosity based on the speed of the slowest one, we
can "promote" the faster console while still using a conservative system
loglevel setting to avoid disturbing applications.

Signed-off-by: Calvin Owens 
---
 include/linux/console.h |  1 +
 kernel/printk/printk.c  | 36 +++-
 2 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/include/linux/console.h b/include/linux/console.h
index ec9bdb3d7bab..3c27a4a29b8c 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -155,6 +155,7 @@ struct console {
int cflag;
void*data;
struct   console *next;
+   int level;
 };
 
 /*
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index d3d170374ceb..6ead14f8c2bc 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1164,9 +1164,14 @@ module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(ignore_loglevel,
 "ignore loglevel setting (prints all kernel messages to the 
console)");
 
-static bool suppress_message_printing(int level)
+static int effective_loglevel(struct console *con)
 {
-   return (level >= console_loglevel && !ignore_loglevel);
+   return max(console_loglevel, con ? con->level : LOGLEVEL_EMERG);
+}
+
+static bool suppress_message_printing(int level, struct console *con)
+{
+   return (level >= effective_loglevel(con) && !ignore_loglevel);
 }
 
 #ifdef CONFIG_BOOT_PRINTK_DELAY
@@ -1198,7 +1203,7 @@ static void boot_delay_msec(int level)
unsigned long timeout;
 
if ((boot_delay == 0 || system_state >= SYSTEM_RUNNING)
-   || suppress_message_printing(level)) {
+   || suppress_message_printing(level, NULL)) {
return;
}
 
@@ -1712,7 +1717,7 @@ static int console_trylock_spinning(void)
  * The console_lock must be held.
  */
 static void call_console_drivers(const char *ext_text, size_t ext_len,
-const char *text, size_t len)
+const char *text, size_t len, int level)
 {
struct console *con;
 
@@ -1731,6 +1736,8 @@ static void call_console_drivers(const char *ext_text, 
size_t ext_len,
if (!cpu_online(smp_processor_id()) &&
!(con->flags & CON_ANYTIME))
continue;
+   if (suppress_message_printing(level, con))
+   continue;
if (con->flags & CON_EXTENDED)
con->write(con, ext_text, ext_len);
else
@@ -2022,7 +2029,7 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
 static void console_lock_spinning_enable(void) { }
 static int console_lock_spinning_disable_and_check(void) { return 0; }
 static void call_console_drivers(const char *ext_text, size_t ext_len,
-const char *text, size_t len) {}
+const char *text, size_t len, int level) {}
 static size_t msg_print_text(const struct printk_log *msg, bool syslog,
 bool time, char *buf, size_t size) { return 0; }
 static bool suppress_message_printing(int level) { return false; }
@@ -2358,21 +2365,11 @@ void console_unlock(void)
} else {
len = 0;
}
-skip:
+
if (console_seq == log_next_seq)
break;
 
msg = log_from_idx(console_idx);
-   if (suppress_message_printing(msg->level)) {
- 

[PATCH 4/4] printk: Add a device attribute for the per-console loglevel

2019-03-01 Thread Calvin Owens
Signed-off-by: Calvin Owens 
---
 kernel/printk/printk.c | 40 
 1 file changed, 40 insertions(+)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 67e1e993ab80..e7e602fa2d0b 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2560,8 +2560,48 @@ static int __init keep_bootcon_setup(char *str)
 
 early_param("keep_bootcon", keep_bootcon_setup);
 
+static ssize_t loglevel_show(struct device *dev, struct device_attribute *attr,
+char *buf)
+{
+   struct console *con = container_of(dev, struct console, dev);
+   return sprintf(buf, "%d\n", con->level);
+}
+
+static ssize_t loglevel_store(struct device *dev, struct device_attribute 
*attr,
+ const char *buf, size_t count)
+{
+   struct console *con = container_of(dev, struct console, dev);
+   ssize_t ret;
+   int tmp;
+
+   ret = kstrtoint(buf, 10, );
+   if (ret < 0)
+   return ret;
+
+   if (tmp < LOGLEVEL_EMERG)
+   return -ERANGE;
+
+   /*
+* Mimic the behavior of /dev/kmsg with respect to minimum_loglevel.
+*/
+   if (tmp < minimum_console_loglevel)
+   tmp = minimum_console_loglevel;
+
+   con->level = tmp;
+   return ret;
+}
+
+static DEVICE_ATTR_RW(loglevel);
+
+static struct attribute *console_sysfs_attrs[] = {
+   _attr_loglevel.attr,
+   NULL,
+};
+ATTRIBUTE_GROUPS(console_sysfs);
+
 static struct bus_type console_subsys = {
.name = "console",
+   .dev_groups = console_sysfs_groups,
 };
 
 static void console_release(struct device *dev)
-- 
2.17.1



[RFC][PATCH 0/4] Per-console loglevel support, console device bus

2019-03-01 Thread Calvin Owens
Hello all,

This is an extremely overdue refresh of this series:

https://lkml.org/lkml/2017/9/28/770

The big change here is the 3rd patch, which actually wires up the console
drivers to support embedding a device structure, so we can place them on
a "console" bus and expose attributes in sysfs.

I left the very long list of driver maintainers off this first submission,
once there's agreement on the core idea here I'll add them.

Thanks,
Calvin


Calvin Owens (4):
  printk: Introduce per-console loglevel setting
  printk: Add ability to set loglevel via "console=" cmdline
  printk: Add consoles to a virtual "console" bus
  printk: Add a device attribute for the per-console loglevel

 131 files changed, 1859 insertions(+), 1061 deletions(-)

-- 
2.17.1



[PATCH] bnxt_en: Fix sources of spurious netpoll warnings

2017-12-08 Thread Calvin Owens
After applying 2270bc5da3497945 ("bnxt_en: Fix netpoll handling") and
903649e718f80da2 ("bnxt_en: Improve -ENOMEM logic in NAPI poll loop."),
we still see the following WARN fire:

  [ cut here ]
  WARNING: CPU: 0 PID: 1875170 at net/core/netpoll.c:165 
netpoll_poll_dev+0x15a/0x160
  bnxt_poll+0x0/0xd0 exceeded budget in poll
  
  Call Trace:
   [] dump_stack+0x4d/0x70
   [] __warn+0xd3/0xf0
   [] warn_slowpath_fmt+0x4f/0x60
   [] netpoll_poll_dev+0x15a/0x160
   [] netpoll_send_skb_on_dev+0x168/0x250
   [] netpoll_send_udp+0x2dc/0x440
   [] write_ext_msg+0x20e/0x250
   [] call_console_drivers.constprop.23+0xa5/0x110
   [] console_unlock+0x339/0x5b0
   [] vprintk_emit+0x2c8/0x450
   [] vprintk_default+0x1f/0x30
   [] printk+0x48/0x50
   [] edac_raw_mc_handle_error+0x563/0x5c0 [edac_core]
   [] edac_mc_handle_error+0x42b/0x6e0 [edac_core]
   [] sbridge_mce_output_error+0x410/0x10d0 [sb_edac]
   [] sbridge_check_error+0xac/0x130 [sb_edac]
   [] edac_mc_workq_function+0x3c/0x90 [edac_core]
   [] process_one_work+0x19b/0x480
   [] worker_thread+0x6a/0x520
   [] kthread+0xe4/0x100
   [] ret_from_fork+0x22/0x40

This happens because we increment rx_pkts on -ENOMEM and -EIO, resulting
in rx_pkts > 0. Fix this by only bumping rx_pkts if we were actually
given a non-zero budget.

Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index c5c38d4..f38160f 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1883,7 +1883,7 @@ static int bnxt_poll_work(struct bnxt *bp, struct 
bnxt_napi *bnapi, int budget)
 * here forever if we consistently cannot allocate
 * buffers.
 */
-   else if (rc == -ENOMEM)
+   else if (rc == -ENOMEM && budget)
rx_pkts++;
else if (rc == -EBUSY)  /* partial completion */
break;
@@ -1969,7 +1969,7 @@ static int bnxt_poll_nitroa0(struct napi_struct *napi, 
int budget)
cpu_to_le32(RX_CMPL_ERRORS_CRC_ERROR);
 
rc = bnxt_rx_pkt(bp, bnapi, _cons, );
-   if (likely(rc == -EIO))
+   if (likely(rc == -EIO) && budget)
rx_pkts++;
else if (rc == -EBUSY)  /* partial completion */
break;
-- 
2.9.5



[PATCH] bnxt_en: Fix sources of spurious netpoll warnings

2017-12-08 Thread Calvin Owens
After applying 2270bc5da3497945 ("bnxt_en: Fix netpoll handling") and
903649e718f80da2 ("bnxt_en: Improve -ENOMEM logic in NAPI poll loop."),
we still see the following WARN fire:

  [ cut here ]
  WARNING: CPU: 0 PID: 1875170 at net/core/netpoll.c:165 
netpoll_poll_dev+0x15a/0x160
  bnxt_poll+0x0/0xd0 exceeded budget in poll
  
  Call Trace:
   [] dump_stack+0x4d/0x70
   [] __warn+0xd3/0xf0
   [] warn_slowpath_fmt+0x4f/0x60
   [] netpoll_poll_dev+0x15a/0x160
   [] netpoll_send_skb_on_dev+0x168/0x250
   [] netpoll_send_udp+0x2dc/0x440
   [] write_ext_msg+0x20e/0x250
   [] call_console_drivers.constprop.23+0xa5/0x110
   [] console_unlock+0x339/0x5b0
   [] vprintk_emit+0x2c8/0x450
   [] vprintk_default+0x1f/0x30
   [] printk+0x48/0x50
   [] edac_raw_mc_handle_error+0x563/0x5c0 [edac_core]
   [] edac_mc_handle_error+0x42b/0x6e0 [edac_core]
   [] sbridge_mce_output_error+0x410/0x10d0 [sb_edac]
   [] sbridge_check_error+0xac/0x130 [sb_edac]
   [] edac_mc_workq_function+0x3c/0x90 [edac_core]
   [] process_one_work+0x19b/0x480
   [] worker_thread+0x6a/0x520
   [] kthread+0xe4/0x100
   [] ret_from_fork+0x22/0x40

This happens because we increment rx_pkts on -ENOMEM and -EIO, resulting
in rx_pkts > 0. Fix this by only bumping rx_pkts if we were actually
given a non-zero budget.

Signed-off-by: Calvin Owens 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index c5c38d4..f38160f 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1883,7 +1883,7 @@ static int bnxt_poll_work(struct bnxt *bp, struct 
bnxt_napi *bnapi, int budget)
 * here forever if we consistently cannot allocate
 * buffers.
 */
-   else if (rc == -ENOMEM)
+   else if (rc == -ENOMEM && budget)
rx_pkts++;
else if (rc == -EBUSY)  /* partial completion */
break;
@@ -1969,7 +1969,7 @@ static int bnxt_poll_nitroa0(struct napi_struct *napi, 
int budget)
cpu_to_le32(RX_CMPL_ERRORS_CRC_ERROR);
 
rc = bnxt_rx_pkt(bp, bnapi, _cons, );
-   if (likely(rc == -EIO))
+   if (likely(rc == -EIO) && budget)
rx_pkts++;
else if (rc == -EBUSY)  /* partial completion */
break;
-- 
2.9.5



Re: [PATCH 2/3] printk: Add /sys/consoles/ interface

2017-11-08 Thread Calvin Owens

On 11/03/2017 07:32 AM, Kroah-Hartman wrote:

On Fri, Nov 03, 2017 at 03:21:14PM +0100, Petr Mladek wrote:

On Thu 2017-09-28 17:43:56, Calvin Owens wrote:

This adds a new sysfs interface that contains a directory for each
console registered on the system. Each directory contains a single
"loglevel" file for reading and setting the per-console loglevel.

We can let kobject destruction race with console removal: if it does,
loglevel_{show,store}() will safely fail with -ENODEV. This is a little
weird, but avoids embedding the kobject and therefore needing to totally
refactor the way we handle console struct lifetime.


It looks like a sane approach. It might be worth a comment in the code.



  Documentation/ABI/testing/sysfs-consoles | 13 +
  include/linux/console.h  |  1 +
  kernel/printk/printk.c   | 88 
  3 files changed, 102 insertions(+)
  create mode 100644 Documentation/ABI/testing/sysfs-consoles

diff --git a/Documentation/ABI/testing/sysfs-consoles 
b/Documentation/ABI/testing/sysfs-consoles
new file mode 100644
index 000..6a1593e
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-consoles
@@ -0,0 +1,13 @@
+What:  /sys/consoles/


Eeek, what!


I rather add Greg in CC. I am not 100% sure that the top level
directory is the right thing to do.


Neither do I.


Sure. This is a placeholder I choose arbitrarily pending some real input on
the location, sorry I didn't make that clear.


Alternative might be to hide this under /sys/kernel/consoles/.


No no no.


+Date:  September 2017
+KernelVersion: 4.15
+Contact:       Calvin Owens <calvinow...@fb.com>
+Description:   The /sys/consoles tree contains a directory for each console
+   configured on the system. These directories contain the
+   following attributes:
+
+   * "loglevel"  Set the per-console loglevel: the kernel uses
+   max(system_loglevel, perconsole_loglevel) when
+   deciding whether to emit a given message. The
+   default is 0, which means max() always yields
+   the system setting in the kernel.printk sysctl.


I would call the attribute "min_loglevel". The name "loglevel" should
be reserved for the really used loglevel that depends also on the
global loglevel value.



diff --git a/include/linux/console.h b/include/linux/console.h
index a5b5d79..76840be 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -148,6 +148,7 @@ struct console {
void*data;
struct   console *next;
int level;
+   struct kobject *kobj;


Why are you using "raw" kobjects and not a "real" struct device?  This
is a device, use that interface instead please.

If you need a console 'bus' to place them on, fine, but the virtual bus
is probably best and simpler to use.


The problem is that the console corresponds to no actual device (this is what
Petr was getting at in the other mail). A console *may* be associated with a 
real
TTY device, but this isn't universally true (for example, see netconsole_ext).

Embedding a device struct in the console structure is problematic for the same
reason embedding a raw kobject is: we'd need to rewrite all the code to deal 
with
the new refcount/release semantics.

While that's certainly possible, it ends up being a much bigger thorny change. 
If
we deal with the "get()/deregister()" race in a safe way, it becomes very 
simple.

(If it were as trivial as replacing kfrees with puts and adding release 
callbacks,
that'd be the obvious way to go, but of course it doesn't end up being that 
nice...)


That is if you _really_ feel you need sysfs interaction with the console
layer (hint, I am not yet convinced...)


How would you expose this setting if not via sysfs? All I care about is having 
the
setting, how exactly userspace pokes it is not at all important :)


  };
  
  /*

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 3f1675e..488bda3 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -105,6 +105,8 @@ enum devkmsg_log_masks {
  
  static unsigned int __read_mostly devkmsg_log = DEVKMSG_LOG_MASK_DEFAULT;
  
+static struct kobject *consoles_dir_kobj;


  static int __control_devkmsg(char *str)
  {
if (!str)
@@ -2371,6 +2373,82 @@ static int __init keep_bootcon_setup(char *str)
  
  early_param("keep_bootcon", keep_bootcon_setup);
  
+static ssize_t loglevel_show(struct kobject *kobj, struct kobj_attribute *attr,

+char *buf)
+{
+   struct console *con;
+   ssize_t ret = -ENODEV;
+


This might deserve a comment. Something like:

/*
 * Find the related struct console a safe way. The kobject
 * desctruction is asynchronous.
 */

+   console_

Re: [PATCH 2/3] printk: Add /sys/consoles/ interface

2017-11-08 Thread Calvin Owens

On 11/03/2017 07:32 AM, Kroah-Hartman wrote:

On Fri, Nov 03, 2017 at 03:21:14PM +0100, Petr Mladek wrote:

On Thu 2017-09-28 17:43:56, Calvin Owens wrote:

This adds a new sysfs interface that contains a directory for each
console registered on the system. Each directory contains a single
"loglevel" file for reading and setting the per-console loglevel.

We can let kobject destruction race with console removal: if it does,
loglevel_{show,store}() will safely fail with -ENODEV. This is a little
weird, but avoids embedding the kobject and therefore needing to totally
refactor the way we handle console struct lifetime.


It looks like a sane approach. It might be worth a comment in the code.



  Documentation/ABI/testing/sysfs-consoles | 13 +
  include/linux/console.h  |  1 +
  kernel/printk/printk.c   | 88 
  3 files changed, 102 insertions(+)
  create mode 100644 Documentation/ABI/testing/sysfs-consoles

diff --git a/Documentation/ABI/testing/sysfs-consoles 
b/Documentation/ABI/testing/sysfs-consoles
new file mode 100644
index 000..6a1593e
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-consoles
@@ -0,0 +1,13 @@
+What:  /sys/consoles/


Eeek, what!


I rather add Greg in CC. I am not 100% sure that the top level
directory is the right thing to do.


Neither do I.


Sure. This is a placeholder I choose arbitrarily pending some real input on
the location, sorry I didn't make that clear.


Alternative might be to hide this under /sys/kernel/consoles/.


No no no.


+Date:  September 2017
+KernelVersion: 4.15
+Contact:       Calvin Owens 
+Description:   The /sys/consoles tree contains a directory for each console
+   configured on the system. These directories contain the
+   following attributes:
+
+   * "loglevel"  Set the per-console loglevel: the kernel uses
+   max(system_loglevel, perconsole_loglevel) when
+   deciding whether to emit a given message. The
+   default is 0, which means max() always yields
+   the system setting in the kernel.printk sysctl.


I would call the attribute "min_loglevel". The name "loglevel" should
be reserved for the really used loglevel that depends also on the
global loglevel value.



diff --git a/include/linux/console.h b/include/linux/console.h
index a5b5d79..76840be 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -148,6 +148,7 @@ struct console {
void*data;
struct   console *next;
int level;
+   struct kobject *kobj;


Why are you using "raw" kobjects and not a "real" struct device?  This
is a device, use that interface instead please.

If you need a console 'bus' to place them on, fine, but the virtual bus
is probably best and simpler to use.


The problem is that the console corresponds to no actual device (this is what
Petr was getting at in the other mail). A console *may* be associated with a 
real
TTY device, but this isn't universally true (for example, see netconsole_ext).

Embedding a device struct in the console structure is problematic for the same
reason embedding a raw kobject is: we'd need to rewrite all the code to deal 
with
the new refcount/release semantics.

While that's certainly possible, it ends up being a much bigger thorny change. 
If
we deal with the "get()/deregister()" race in a safe way, it becomes very 
simple.

(If it were as trivial as replacing kfrees with puts and adding release 
callbacks,
that'd be the obvious way to go, but of course it doesn't end up being that 
nice...)


That is if you _really_ feel you need sysfs interaction with the console
layer (hint, I am not yet convinced...)


How would you expose this setting if not via sysfs? All I care about is having 
the
setting, how exactly userspace pokes it is not at all important :)


  };
  
  /*

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 3f1675e..488bda3 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -105,6 +105,8 @@ enum devkmsg_log_masks {
  
  static unsigned int __read_mostly devkmsg_log = DEVKMSG_LOG_MASK_DEFAULT;
  
+static struct kobject *consoles_dir_kobj;


  static int __control_devkmsg(char *str)
  {
if (!str)
@@ -2371,6 +2373,82 @@ static int __init keep_bootcon_setup(char *str)
  
  early_param("keep_bootcon", keep_bootcon_setup);
  
+static ssize_t loglevel_show(struct kobject *kobj, struct kobj_attribute *attr,

+char *buf)
+{
+   struct console *con;
+   ssize_t ret = -ENODEV;
+


This might deserve a comment. Something like:

/*
 * Find the related struct console a safe way. The kobject
 * desctruction is asynchronous.
 */

+   console_lock();
+   for_each_c

Re: [PATCH 1/3] printk: Introduce per-console loglevel setting

2017-10-20 Thread Calvin Owens

On 10/20/2017 01:05 AM, Petr Mladek wrote:

On Thu 2017-10-19 16:40:45, Calvin Owens wrote:

On 09/28/2017 05:43 PM, Calvin Owens wrote:

Not all consoles are created equal: depending on the actual hardware,
the latency of a printk() call can vary dramatically. The worst examples
are serial consoles, where it can spin for tens of milliseconds banging
the UART to emit a message, which can cause application-level problems
when the kernel spews onto the console.


Any thoughts on this series? Happy to resend again, but if there are no
objections I'd love to see it merged sooner rather than later :)

Happy to resend too, just let me know.


There is no need to resend the patch. It is on my radar and I am
going to look at it.

Please, be patient, you hit conference, illness, after vacation
season. We do not want to unnecessarily delay it but it is
not a trivial change that might be accepted within minutes.


No worries, just wanted to make sure it hadn't been missed :)

Thanks,
Calvin


Best Regards,
Petr


Re: [PATCH 1/3] printk: Introduce per-console loglevel setting

2017-10-20 Thread Calvin Owens

On 10/20/2017 01:05 AM, Petr Mladek wrote:

On Thu 2017-10-19 16:40:45, Calvin Owens wrote:

On 09/28/2017 05:43 PM, Calvin Owens wrote:

Not all consoles are created equal: depending on the actual hardware,
the latency of a printk() call can vary dramatically. The worst examples
are serial consoles, where it can spin for tens of milliseconds banging
the UART to emit a message, which can cause application-level problems
when the kernel spews onto the console.


Any thoughts on this series? Happy to resend again, but if there are no
objections I'd love to see it merged sooner rather than later :)

Happy to resend too, just let me know.


There is no need to resend the patch. It is on my radar and I am
going to look at it.

Please, be patient, you hit conference, illness, after vacation
season. We do not want to unnecessarily delay it but it is
not a trivial change that might be accepted within minutes.


No worries, just wanted to make sure it hadn't been missed :)

Thanks,
Calvin


Best Regards,
Petr


Re: [PATCH 1/3] printk: Introduce per-console loglevel setting

2017-10-19 Thread Calvin Owens

On 09/28/2017 05:43 PM, Calvin Owens wrote:

Not all consoles are created equal: depending on the actual hardware,
the latency of a printk() call can vary dramatically. The worst examples
are serial consoles, where it can spin for tens of milliseconds banging
the UART to emit a message, which can cause application-level problems
when the kernel spews onto the console.


Any thoughts on this series? Happy to resend again, but if there are no
objections I'd love to see it merged sooner rather than later :)

Happy to resend too, just let me know.

Thanks,
Calvin


At Facebook we use netconsole to monitor our fleet, but we still have
serial consoles attached on each host for live debugging, and the latter
has caused problems. An obvious solution is to disable the kernel
console output to ttyS0, but this makes live debugging frustrating,
since crashes become silent and opaque to the ttyS0 user. Enabling it on
the fly when needed isn't feasible, since boxes you need to debug via
serial are likely to be borked in ways that make this impossible.

That puts us between a rock and a hard place: we'd love to set
kernel.printk to KERN_INFO and get all the logs. But while netconsole is
fast enough to permit that without perturbing userspace, ttyS0 is not,
and we're forced to limit console logging to KERN_WARNING and higher.

This patch introduces a new per-console loglevel setting, and changes
console_unlock() to use max(global_level, per_console_level) when
deciding whether or not to emit a given log message.

This lets us have our cake and eat it too: instead of being forced to
limit all consoles verbosity based on the speed of the slowest one, we
can "promote" the faster console while still using a conservative system
loglevel setting to avoid disturbing applications.

Cc: Petr Mladek <pmla...@suse.com>
Cc: Steven Rostedt <rost...@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhat...@gmail.com>
Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
(V1: https://lkml.org/lkml/2017/4/4/783)

Changes in V2:
* Honor the ignore_loglevel setting in all cases
* Change semantics to use max(global, console) as the loglevel
  for a console, instead of the previous patch where we treated
  the per-console one as a filter downstream of the global one.

  include/linux/console.h |  1 +
  kernel/printk/printk.c  | 38 +++---
  2 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/include/linux/console.h b/include/linux/console.h
index b8920a0..a5b5d79 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -147,6 +147,7 @@ struct console {
int cflag;
void*data;
struct   console *next;
+   int level;
  };
  
  /*

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 512f7c2..3f1675e 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1141,9 +1141,14 @@ module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR);
  MODULE_PARM_DESC(ignore_loglevel,
 "ignore loglevel setting (prints all kernel messages to the 
console)");
  
-static bool suppress_message_printing(int level)

+static int effective_loglevel(struct console *con)
  {
-   return (level >= console_loglevel && !ignore_loglevel);
+   return max(console_loglevel, con ? con->level : LOGLEVEL_EMERG);
+}
+
+static bool suppress_message_printing(int level, struct console *con)
+{
+   return (level >= effective_loglevel(con) && !ignore_loglevel);
  }
  
  #ifdef CONFIG_BOOT_PRINTK_DELAY

@@ -1175,7 +1180,7 @@ static void boot_delay_msec(int level)
unsigned long timeout;
  
  	if ((boot_delay == 0 || system_state >= SYSTEM_RUNNING)

-   || suppress_message_printing(level)) {
+   || suppress_message_printing(level, NULL)) {
return;
}
  
@@ -1549,7 +1554,7 @@ SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, int, len)

   * The console_lock must be held.
   */
  static void call_console_drivers(const char *ext_text, size_t ext_len,
-const char *text, size_t len)
+const char *text, size_t len, int level)
  {
struct console *con;
  
@@ -1568,6 +1573,8 @@ static void call_console_drivers(const char *ext_text, size_t ext_len,

if (!cpu_online(smp_processor_id()) &&
!(con->flags & CON_ANYTIME))
continue;
+   if (suppress_message_printing(level, con))
+   continue;
if (con->flags & CON_EXTENDED)
con->write(con, ext_text, ext_len);
else
@@ -1856,10 +1863,9 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
  char *dict, size_t dict_len,
  char *text, size_

Re: [PATCH 1/3] printk: Introduce per-console loglevel setting

2017-10-19 Thread Calvin Owens

On 09/28/2017 05:43 PM, Calvin Owens wrote:

Not all consoles are created equal: depending on the actual hardware,
the latency of a printk() call can vary dramatically. The worst examples
are serial consoles, where it can spin for tens of milliseconds banging
the UART to emit a message, which can cause application-level problems
when the kernel spews onto the console.


Any thoughts on this series? Happy to resend again, but if there are no
objections I'd love to see it merged sooner rather than later :)

Happy to resend too, just let me know.

Thanks,
Calvin


At Facebook we use netconsole to monitor our fleet, but we still have
serial consoles attached on each host for live debugging, and the latter
has caused problems. An obvious solution is to disable the kernel
console output to ttyS0, but this makes live debugging frustrating,
since crashes become silent and opaque to the ttyS0 user. Enabling it on
the fly when needed isn't feasible, since boxes you need to debug via
serial are likely to be borked in ways that make this impossible.

That puts us between a rock and a hard place: we'd love to set
kernel.printk to KERN_INFO and get all the logs. But while netconsole is
fast enough to permit that without perturbing userspace, ttyS0 is not,
and we're forced to limit console logging to KERN_WARNING and higher.

This patch introduces a new per-console loglevel setting, and changes
console_unlock() to use max(global_level, per_console_level) when
deciding whether or not to emit a given log message.

This lets us have our cake and eat it too: instead of being forced to
limit all consoles verbosity based on the speed of the slowest one, we
can "promote" the faster console while still using a conservative system
loglevel setting to avoid disturbing applications.

Cc: Petr Mladek 
Cc: Steven Rostedt 
Cc: Sergey Senozhatsky 
Signed-off-by: Calvin Owens 
---
(V1: https://lkml.org/lkml/2017/4/4/783)

Changes in V2:
* Honor the ignore_loglevel setting in all cases
* Change semantics to use max(global, console) as the loglevel
  for a console, instead of the previous patch where we treated
  the per-console one as a filter downstream of the global one.

  include/linux/console.h |  1 +
  kernel/printk/printk.c  | 38 +++---
  2 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/include/linux/console.h b/include/linux/console.h
index b8920a0..a5b5d79 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -147,6 +147,7 @@ struct console {
int cflag;
void*data;
struct   console *next;
+   int level;
  };
  
  /*

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 512f7c2..3f1675e 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1141,9 +1141,14 @@ module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR);
  MODULE_PARM_DESC(ignore_loglevel,
 "ignore loglevel setting (prints all kernel messages to the 
console)");
  
-static bool suppress_message_printing(int level)

+static int effective_loglevel(struct console *con)
  {
-   return (level >= console_loglevel && !ignore_loglevel);
+   return max(console_loglevel, con ? con->level : LOGLEVEL_EMERG);
+}
+
+static bool suppress_message_printing(int level, struct console *con)
+{
+   return (level >= effective_loglevel(con) && !ignore_loglevel);
  }
  
  #ifdef CONFIG_BOOT_PRINTK_DELAY

@@ -1175,7 +1180,7 @@ static void boot_delay_msec(int level)
unsigned long timeout;
  
  	if ((boot_delay == 0 || system_state >= SYSTEM_RUNNING)

-   || suppress_message_printing(level)) {
+   || suppress_message_printing(level, NULL)) {
return;
}
  
@@ -1549,7 +1554,7 @@ SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, int, len)

   * The console_lock must be held.
   */
  static void call_console_drivers(const char *ext_text, size_t ext_len,
-const char *text, size_t len)
+const char *text, size_t len, int level)
  {
struct console *con;
  
@@ -1568,6 +1573,8 @@ static void call_console_drivers(const char *ext_text, size_t ext_len,

if (!cpu_online(smp_processor_id()) &&
!(con->flags & CON_ANYTIME))
continue;
+   if (suppress_message_printing(level, con))
+   continue;
if (con->flags & CON_EXTENDED)
con->write(con, ext_text, ext_len);
else
@@ -1856,10 +1863,9 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
  char *dict, size_t dict_len,
  char *text, size_t text_len) { return 0; }
  static void call_console_drivers(const char *ext_text, size

[PATCH 1/3] printk: Introduce per-console loglevel setting

2017-09-28 Thread Calvin Owens
Not all consoles are created equal: depending on the actual hardware,
the latency of a printk() call can vary dramatically. The worst examples
are serial consoles, where it can spin for tens of milliseconds banging
the UART to emit a message, which can cause application-level problems
when the kernel spews onto the console.

At Facebook we use netconsole to monitor our fleet, but we still have
serial consoles attached on each host for live debugging, and the latter
has caused problems. An obvious solution is to disable the kernel
console output to ttyS0, but this makes live debugging frustrating,
since crashes become silent and opaque to the ttyS0 user. Enabling it on
the fly when needed isn't feasible, since boxes you need to debug via
serial are likely to be borked in ways that make this impossible.

That puts us between a rock and a hard place: we'd love to set
kernel.printk to KERN_INFO and get all the logs. But while netconsole is
fast enough to permit that without perturbing userspace, ttyS0 is not,
and we're forced to limit console logging to KERN_WARNING and higher.

This patch introduces a new per-console loglevel setting, and changes
console_unlock() to use max(global_level, per_console_level) when
deciding whether or not to emit a given log message.

This lets us have our cake and eat it too: instead of being forced to
limit all consoles verbosity based on the speed of the slowest one, we
can "promote" the faster console while still using a conservative system
loglevel setting to avoid disturbing applications.

Cc: Petr Mladek <pmla...@suse.com>
Cc: Steven Rostedt <rost...@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhat...@gmail.com>
Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
(V1: https://lkml.org/lkml/2017/4/4/783)

Changes in V2:
* Honor the ignore_loglevel setting in all cases
* Change semantics to use max(global, console) as the loglevel
  for a console, instead of the previous patch where we treated
  the per-console one as a filter downstream of the global one.

 include/linux/console.h |  1 +
 kernel/printk/printk.c  | 38 +++---
 2 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/include/linux/console.h b/include/linux/console.h
index b8920a0..a5b5d79 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -147,6 +147,7 @@ struct console {
int cflag;
void*data;
struct   console *next;
+   int level;
 };
 
 /*
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 512f7c2..3f1675e 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1141,9 +1141,14 @@ module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(ignore_loglevel,
 "ignore loglevel setting (prints all kernel messages to the 
console)");
 
-static bool suppress_message_printing(int level)
+static int effective_loglevel(struct console *con)
 {
-   return (level >= console_loglevel && !ignore_loglevel);
+   return max(console_loglevel, con ? con->level : LOGLEVEL_EMERG);
+}
+
+static bool suppress_message_printing(int level, struct console *con)
+{
+   return (level >= effective_loglevel(con) && !ignore_loglevel);
 }
 
 #ifdef CONFIG_BOOT_PRINTK_DELAY
@@ -1175,7 +1180,7 @@ static void boot_delay_msec(int level)
unsigned long timeout;
 
if ((boot_delay == 0 || system_state >= SYSTEM_RUNNING)
-   || suppress_message_printing(level)) {
+   || suppress_message_printing(level, NULL)) {
return;
}
 
@@ -1549,7 +1554,7 @@ SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, 
int, len)
  * The console_lock must be held.
  */
 static void call_console_drivers(const char *ext_text, size_t ext_len,
-const char *text, size_t len)
+const char *text, size_t len, int level)
 {
struct console *con;
 
@@ -1568,6 +1573,8 @@ static void call_console_drivers(const char *ext_text, 
size_t ext_len,
if (!cpu_online(smp_processor_id()) &&
!(con->flags & CON_ANYTIME))
continue;
+   if (suppress_message_printing(level, con))
+   continue;
if (con->flags & CON_EXTENDED)
con->write(con, ext_text, ext_len);
else
@@ -1856,10 +1863,9 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
  char *dict, size_t dict_len,
  char *text, size_t text_len) { return 0; }
 static void call_console_drivers(const char *ext_text, size_t ext_len,
-const char *text, size_t len) {}
+const char *text, size_t len, int level) {}
 static size_t msg_print_te

[PATCH 1/3] printk: Introduce per-console loglevel setting

2017-09-28 Thread Calvin Owens
Not all consoles are created equal: depending on the actual hardware,
the latency of a printk() call can vary dramatically. The worst examples
are serial consoles, where it can spin for tens of milliseconds banging
the UART to emit a message, which can cause application-level problems
when the kernel spews onto the console.

At Facebook we use netconsole to monitor our fleet, but we still have
serial consoles attached on each host for live debugging, and the latter
has caused problems. An obvious solution is to disable the kernel
console output to ttyS0, but this makes live debugging frustrating,
since crashes become silent and opaque to the ttyS0 user. Enabling it on
the fly when needed isn't feasible, since boxes you need to debug via
serial are likely to be borked in ways that make this impossible.

That puts us between a rock and a hard place: we'd love to set
kernel.printk to KERN_INFO and get all the logs. But while netconsole is
fast enough to permit that without perturbing userspace, ttyS0 is not,
and we're forced to limit console logging to KERN_WARNING and higher.

This patch introduces a new per-console loglevel setting, and changes
console_unlock() to use max(global_level, per_console_level) when
deciding whether or not to emit a given log message.

This lets us have our cake and eat it too: instead of being forced to
limit all consoles verbosity based on the speed of the slowest one, we
can "promote" the faster console while still using a conservative system
loglevel setting to avoid disturbing applications.

Cc: Petr Mladek 
Cc: Steven Rostedt 
Cc: Sergey Senozhatsky 
Signed-off-by: Calvin Owens 
---
(V1: https://lkml.org/lkml/2017/4/4/783)

Changes in V2:
* Honor the ignore_loglevel setting in all cases
* Change semantics to use max(global, console) as the loglevel
  for a console, instead of the previous patch where we treated
  the per-console one as a filter downstream of the global one.

 include/linux/console.h |  1 +
 kernel/printk/printk.c  | 38 +++---
 2 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/include/linux/console.h b/include/linux/console.h
index b8920a0..a5b5d79 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -147,6 +147,7 @@ struct console {
int cflag;
void*data;
struct   console *next;
+   int level;
 };
 
 /*
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 512f7c2..3f1675e 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1141,9 +1141,14 @@ module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(ignore_loglevel,
 "ignore loglevel setting (prints all kernel messages to the 
console)");
 
-static bool suppress_message_printing(int level)
+static int effective_loglevel(struct console *con)
 {
-   return (level >= console_loglevel && !ignore_loglevel);
+   return max(console_loglevel, con ? con->level : LOGLEVEL_EMERG);
+}
+
+static bool suppress_message_printing(int level, struct console *con)
+{
+   return (level >= effective_loglevel(con) && !ignore_loglevel);
 }
 
 #ifdef CONFIG_BOOT_PRINTK_DELAY
@@ -1175,7 +1180,7 @@ static void boot_delay_msec(int level)
unsigned long timeout;
 
if ((boot_delay == 0 || system_state >= SYSTEM_RUNNING)
-   || suppress_message_printing(level)) {
+   || suppress_message_printing(level, NULL)) {
return;
}
 
@@ -1549,7 +1554,7 @@ SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, 
int, len)
  * The console_lock must be held.
  */
 static void call_console_drivers(const char *ext_text, size_t ext_len,
-const char *text, size_t len)
+const char *text, size_t len, int level)
 {
struct console *con;
 
@@ -1568,6 +1573,8 @@ static void call_console_drivers(const char *ext_text, 
size_t ext_len,
if (!cpu_online(smp_processor_id()) &&
!(con->flags & CON_ANYTIME))
continue;
+   if (suppress_message_printing(level, con))
+   continue;
if (con->flags & CON_EXTENDED)
con->write(con, ext_text, ext_len);
else
@@ -1856,10 +1863,9 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
  char *dict, size_t dict_len,
  char *text, size_t text_len) { return 0; }
 static void call_console_drivers(const char *ext_text, size_t ext_len,
-const char *text, size_t len) {}
+const char *text, size_t len, int level) {}
 static size_t msg_print_text(const struct printk_log *msg,
 bool syslog, char *buf, size_t size) { ret

[PATCH 3/3] printk: Add ability to set loglevel via "console=" cmdline

2017-09-28 Thread Calvin Owens
This extends the "console=" interface to allow setting the per-console
loglevel by adding "/N" to the string, where N is the desired loglevel
expressed as a base 10 integer. Invalid values are silently ignored.

Cc: Petr Mladek <pmla...@suse.com>
Cc: Steven Rostedt <rost...@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhat...@gmail.com>
Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  6 ++---
 kernel/printk/console_cmdline.h |  1 +
 kernel/printk/printk.c  | 30 -
 3 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 0549662..f22b992 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -607,10 +607,10 @@
ttyS[,options]
ttyUSB0[,options]
Use the specified serial port.  The options are of
-   the form "pnf", where "" is the baud rate,
+   the form "pnf/l", where "" is the baud rate,
"p" is parity ("n", "o", or "e"), "n" is number of
-   bits, and "f" is flow control ("r" for RTS or
-   omit it).  Default is "9600n8".
+   bits, "f" is flow control ("r" for RTS or omit it),
+   and "l" is the loglevel on [0,7]. Default is "9600n8".
 
See Documentation/admin-guide/serial-console.rst for 
more
information.  See
diff --git a/kernel/printk/console_cmdline.h b/kernel/printk/console_cmdline.h
index 2ca4a8b..269e666 100644
--- a/kernel/printk/console_cmdline.h
+++ b/kernel/printk/console_cmdline.h
@@ -5,6 +5,7 @@ struct console_cmdline
 {
charname[16];   /* Name of the driver   */
int index;  /* Minor dev. to use*/
+   int loglevel;   /* Loglevel to use */
char*options;   /* Options for the driver   */
 #ifdef CONFIG_A11Y_BRAILLE_CONSOLE
char*brl_options;   /* Options for braille driver */
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 488bda3..4c14cf2 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1892,7 +1892,7 @@ asmlinkage __visible void early_printk(const char *fmt, 
...)
 #endif
 
 static int __add_preferred_console(char *name, int idx, char *options,
-  char *brl_options)
+  int loglevel, char *brl_options)
 {
struct console_cmdline *c;
int i;
@@ -1918,6 +1918,7 @@ static int __add_preferred_console(char *name, int idx, 
char *options,
c->options = options;
braille_set_options(c, brl_options);
 
+   c->loglevel = loglevel;
c->index = idx;
return 0;
 }
@@ -1928,8 +1929,8 @@ static int __add_preferred_console(char *name, int idx, 
char *options,
 static int __init console_setup(char *str)
 {
char buf[sizeof(console_cmdline[0].name) + 4]; /* 4 for "ttyS" */
-   char *s, *options, *brl_options = NULL;
-   int idx;
+   char *s, *options, *llevel, *brl_options = NULL;
+   int idx, loglevel = LOGLEVEL_EMERG;
 
if (_braille_console_setup(, _options))
return 1;
@@ -1947,6 +1948,14 @@ static int __init console_setup(char *str)
options = strchr(str, ',');
if (options)
*(options++) = 0;
+
+   llevel = strchr(str, '/');
+   if (llevel) {
+   *(llevel++) = 0;
+   if (kstrtoint(llevel, 10, ))
+   loglevel = LOGLEVEL_EMERG;
+   }
+
 #ifdef __sparc__
if (!strcmp(str, "ttya"))
strcpy(buf, "ttyS0");
@@ -1959,7 +1968,7 @@ static int __init console_setup(char *str)
idx = simple_strtoul(s, NULL, 10);
*s = 0;
 
-   __add_preferred_console(buf, idx, options, brl_options);
+   __add_preferred_console(buf, idx, options, loglevel, brl_options);
console_set_on_cmdline = 1;
return 1;
 }
@@ -1980,7 +1989,8 @@ __setup("console=", console_setup);
  */
 int add_preferred_console(char *name, int idx, char *options)
 {
-   return __add_preferred_console(name, idx, options, NULL);
+   return __add_preferred_console(name, idx, options, LOGLEVEL_EMERG,
+  NULL);
 }
 
 bool console_suspend_enabled = true;
@@ -2475,6 +2485,7 @@ void register_console(struct console *newcon)
stru

[PATCH 3/3] printk: Add ability to set loglevel via "console=" cmdline

2017-09-28 Thread Calvin Owens
This extends the "console=" interface to allow setting the per-console
loglevel by adding "/N" to the string, where N is the desired loglevel
expressed as a base 10 integer. Invalid values are silently ignored.

Cc: Petr Mladek 
Cc: Steven Rostedt 
Cc: Sergey Senozhatsky 
Signed-off-by: Calvin Owens 
---
 Documentation/admin-guide/kernel-parameters.txt |  6 ++---
 kernel/printk/console_cmdline.h |  1 +
 kernel/printk/printk.c  | 30 -
 3 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 0549662..f22b992 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -607,10 +607,10 @@
ttyS[,options]
ttyUSB0[,options]
Use the specified serial port.  The options are of
-   the form "pnf", where "" is the baud rate,
+   the form "pnf/l", where "" is the baud rate,
"p" is parity ("n", "o", or "e"), "n" is number of
-   bits, and "f" is flow control ("r" for RTS or
-   omit it).  Default is "9600n8".
+   bits, "f" is flow control ("r" for RTS or omit it),
+   and "l" is the loglevel on [0,7]. Default is "9600n8".
 
See Documentation/admin-guide/serial-console.rst for 
more
information.  See
diff --git a/kernel/printk/console_cmdline.h b/kernel/printk/console_cmdline.h
index 2ca4a8b..269e666 100644
--- a/kernel/printk/console_cmdline.h
+++ b/kernel/printk/console_cmdline.h
@@ -5,6 +5,7 @@ struct console_cmdline
 {
charname[16];   /* Name of the driver   */
int index;  /* Minor dev. to use*/
+   int loglevel;   /* Loglevel to use */
char*options;   /* Options for the driver   */
 #ifdef CONFIG_A11Y_BRAILLE_CONSOLE
char*brl_options;   /* Options for braille driver */
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 488bda3..4c14cf2 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1892,7 +1892,7 @@ asmlinkage __visible void early_printk(const char *fmt, 
...)
 #endif
 
 static int __add_preferred_console(char *name, int idx, char *options,
-  char *brl_options)
+  int loglevel, char *brl_options)
 {
struct console_cmdline *c;
int i;
@@ -1918,6 +1918,7 @@ static int __add_preferred_console(char *name, int idx, 
char *options,
c->options = options;
braille_set_options(c, brl_options);
 
+   c->loglevel = loglevel;
c->index = idx;
return 0;
 }
@@ -1928,8 +1929,8 @@ static int __add_preferred_console(char *name, int idx, 
char *options,
 static int __init console_setup(char *str)
 {
char buf[sizeof(console_cmdline[0].name) + 4]; /* 4 for "ttyS" */
-   char *s, *options, *brl_options = NULL;
-   int idx;
+   char *s, *options, *llevel, *brl_options = NULL;
+   int idx, loglevel = LOGLEVEL_EMERG;
 
if (_braille_console_setup(, _options))
return 1;
@@ -1947,6 +1948,14 @@ static int __init console_setup(char *str)
options = strchr(str, ',');
if (options)
*(options++) = 0;
+
+   llevel = strchr(str, '/');
+   if (llevel) {
+   *(llevel++) = 0;
+   if (kstrtoint(llevel, 10, ))
+   loglevel = LOGLEVEL_EMERG;
+   }
+
 #ifdef __sparc__
if (!strcmp(str, "ttya"))
strcpy(buf, "ttyS0");
@@ -1959,7 +1968,7 @@ static int __init console_setup(char *str)
idx = simple_strtoul(s, NULL, 10);
*s = 0;
 
-   __add_preferred_console(buf, idx, options, brl_options);
+   __add_preferred_console(buf, idx, options, loglevel, brl_options);
console_set_on_cmdline = 1;
return 1;
 }
@@ -1980,7 +1989,8 @@ __setup("console=", console_setup);
  */
 int add_preferred_console(char *name, int idx, char *options)
 {
-   return __add_preferred_console(name, idx, options, NULL);
+   return __add_preferred_console(name, idx, options, LOGLEVEL_EMERG,
+  NULL);
 }
 
 bool console_suspend_enabled = true;
@@ -2475,6 +2485,7 @@ void register_console(struct console *newcon)
struct console *bcon = NULL;
struct console_cmdline *c;
static bool

[PATCH 2/3] printk: Add /sys/consoles/ interface

2017-09-28 Thread Calvin Owens
This adds a new sysfs interface that contains a directory for each
console registered on the system. Each directory contains a single
"loglevel" file for reading and setting the per-console loglevel.

We can let kobject destruction race with console removal: if it does,
loglevel_{show,store}() will safely fail with -ENODEV. This is a little
weird, but avoids embedding the kobject and therefore needing to totally
refactor the way we handle console struct lifetime.

Cc: Petr Mladek <pmla...@suse.com>
Cc: Steven Rostedt <rost...@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhat...@gmail.com>
Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
(V1: https://lkml.org/lkml/2017/4/4/784)

Changes in V2:
* Honor minimum_console_loglevel when setting loglevels
* Added entry in Documentation/ABI/testing

 Documentation/ABI/testing/sysfs-consoles | 13 +
 include/linux/console.h  |  1 +
 kernel/printk/printk.c   | 88 
 3 files changed, 102 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-consoles

diff --git a/Documentation/ABI/testing/sysfs-consoles 
b/Documentation/ABI/testing/sysfs-consoles
new file mode 100644
index 000..6a1593e
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-consoles
@@ -0,0 +1,13 @@
+What:  /sys/consoles/
+Date:  September 2017
+KernelVersion: 4.15
+Contact:   Calvin Owens <calvinow...@fb.com>
+Description:   The /sys/consoles tree contains a directory for each console
+   configured on the system. These directories contain the
+   following attributes:
+
+   * "loglevel"Set the per-console loglevel: the kernel uses
+   max(system_loglevel, perconsole_loglevel) when
+   deciding whether to emit a given message. The
+   default is 0, which means max() always yields
+   the system setting in the kernel.printk sysctl.
diff --git a/include/linux/console.h b/include/linux/console.h
index a5b5d79..76840be 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -148,6 +148,7 @@ struct console {
void*data;
struct   console *next;
int level;
+   struct kobject *kobj;
 };
 
 /*
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 3f1675e..488bda3 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -105,6 +105,8 @@ enum devkmsg_log_masks {
 
 static unsigned int __read_mostly devkmsg_log = DEVKMSG_LOG_MASK_DEFAULT;
 
+static struct kobject *consoles_dir_kobj;
+
 static int __control_devkmsg(char *str)
 {
if (!str)
@@ -2371,6 +2373,82 @@ static int __init keep_bootcon_setup(char *str)
 
 early_param("keep_bootcon", keep_bootcon_setup);
 
+static ssize_t loglevel_show(struct kobject *kobj, struct kobj_attribute *attr,
+char *buf)
+{
+   struct console *con;
+   ssize_t ret = -ENODEV;
+
+   console_lock();
+   for_each_console(con) {
+   if (con->kobj == kobj) {
+   ret = sprintf(buf, "%d\n", con->level);
+   break;
+   }
+   }
+   console_unlock();
+
+   return ret;
+}
+
+static ssize_t loglevel_store(struct kobject *kobj, struct kobj_attribute 
*attr,
+ const char *buf, size_t count)
+{
+   struct console *con;
+   ssize_t ret;
+   int tmp;
+
+   ret = kstrtoint(buf, 10, );
+   if (ret < 0)
+   return ret;
+
+   if (tmp < LOGLEVEL_EMERG)
+   return -ERANGE;
+
+   /*
+* Mimic the behavior of /dev/kmsg with respect to minimum_loglevel
+*/
+   if (tmp < minimum_console_loglevel)
+   tmp = minimum_console_loglevel;
+
+   ret = -ENODEV;
+   console_lock();
+   for_each_console(con) {
+   if (con->kobj == kobj) {
+   con->level = tmp;
+   ret = count;
+   break;
+   }
+   }
+   console_unlock();
+
+   return ret;
+}
+
+static const struct kobj_attribute console_loglevel_attr =
+   __ATTR(loglevel, 0644, loglevel_show, loglevel_store);
+
+static void console_register_sysfs(struct console *newcon)
+{
+   /*
+* We might be called very early from register_console(): in that case,
+* printk_late_init() will take care of this later.
+*/
+   if (!consoles_dir_kobj)
+   return;
+
+   newcon->kobj = kobject_create_and_add(newcon->name, consoles_dir_kobj);
+   if (WARN_ON(!newcon->kobj))
+   return;
+
+   WARN_ON(sysfs_create_file(newcon->kobj, _loglevel_attr.attr));
+}
+
+static void console_unregister_sysfs(struct console *oldcon)
+{
+   kob

[PATCH 2/3] printk: Add /sys/consoles/ interface

2017-09-28 Thread Calvin Owens
This adds a new sysfs interface that contains a directory for each
console registered on the system. Each directory contains a single
"loglevel" file for reading and setting the per-console loglevel.

We can let kobject destruction race with console removal: if it does,
loglevel_{show,store}() will safely fail with -ENODEV. This is a little
weird, but avoids embedding the kobject and therefore needing to totally
refactor the way we handle console struct lifetime.

Cc: Petr Mladek 
Cc: Steven Rostedt 
Cc: Sergey Senozhatsky 
Signed-off-by: Calvin Owens 
---
(V1: https://lkml.org/lkml/2017/4/4/784)

Changes in V2:
* Honor minimum_console_loglevel when setting loglevels
* Added entry in Documentation/ABI/testing

 Documentation/ABI/testing/sysfs-consoles | 13 +
 include/linux/console.h  |  1 +
 kernel/printk/printk.c   | 88 
 3 files changed, 102 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-consoles

diff --git a/Documentation/ABI/testing/sysfs-consoles 
b/Documentation/ABI/testing/sysfs-consoles
new file mode 100644
index 000..6a1593e
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-consoles
@@ -0,0 +1,13 @@
+What:  /sys/consoles/
+Date:  September 2017
+KernelVersion: 4.15
+Contact:       Calvin Owens 
+Description:   The /sys/consoles tree contains a directory for each console
+   configured on the system. These directories contain the
+   following attributes:
+
+   * "loglevel"Set the per-console loglevel: the kernel uses
+   max(system_loglevel, perconsole_loglevel) when
+   deciding whether to emit a given message. The
+   default is 0, which means max() always yields
+   the system setting in the kernel.printk sysctl.
diff --git a/include/linux/console.h b/include/linux/console.h
index a5b5d79..76840be 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -148,6 +148,7 @@ struct console {
void*data;
struct   console *next;
int level;
+   struct kobject *kobj;
 };
 
 /*
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 3f1675e..488bda3 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -105,6 +105,8 @@ enum devkmsg_log_masks {
 
 static unsigned int __read_mostly devkmsg_log = DEVKMSG_LOG_MASK_DEFAULT;
 
+static struct kobject *consoles_dir_kobj;
+
 static int __control_devkmsg(char *str)
 {
if (!str)
@@ -2371,6 +2373,82 @@ static int __init keep_bootcon_setup(char *str)
 
 early_param("keep_bootcon", keep_bootcon_setup);
 
+static ssize_t loglevel_show(struct kobject *kobj, struct kobj_attribute *attr,
+char *buf)
+{
+   struct console *con;
+   ssize_t ret = -ENODEV;
+
+   console_lock();
+   for_each_console(con) {
+   if (con->kobj == kobj) {
+   ret = sprintf(buf, "%d\n", con->level);
+   break;
+   }
+   }
+   console_unlock();
+
+   return ret;
+}
+
+static ssize_t loglevel_store(struct kobject *kobj, struct kobj_attribute 
*attr,
+ const char *buf, size_t count)
+{
+   struct console *con;
+   ssize_t ret;
+   int tmp;
+
+   ret = kstrtoint(buf, 10, );
+   if (ret < 0)
+   return ret;
+
+   if (tmp < LOGLEVEL_EMERG)
+   return -ERANGE;
+
+   /*
+* Mimic the behavior of /dev/kmsg with respect to minimum_loglevel
+*/
+   if (tmp < minimum_console_loglevel)
+   tmp = minimum_console_loglevel;
+
+   ret = -ENODEV;
+   console_lock();
+   for_each_console(con) {
+   if (con->kobj == kobj) {
+   con->level = tmp;
+   ret = count;
+   break;
+   }
+   }
+   console_unlock();
+
+   return ret;
+}
+
+static const struct kobj_attribute console_loglevel_attr =
+   __ATTR(loglevel, 0644, loglevel_show, loglevel_store);
+
+static void console_register_sysfs(struct console *newcon)
+{
+   /*
+* We might be called very early from register_console(): in that case,
+* printk_late_init() will take care of this later.
+*/
+   if (!consoles_dir_kobj)
+   return;
+
+   newcon->kobj = kobject_create_and_add(newcon->name, consoles_dir_kobj);
+   if (WARN_ON(!newcon->kobj))
+   return;
+
+   WARN_ON(sysfs_create_file(newcon->kobj, _loglevel_attr.attr));
+}
+
+static void console_unregister_sysfs(struct console *oldcon)
+{
+   kobject_put(oldcon->kobj);
+}
+
 /*
  * The console driver calls this routine during kernel initialization
  * to register the console printin

Re: [RFC][PATCH 1/2] printk: Introduce per-console filtering of messages by loglevel

2017-04-06 Thread Calvin Owens
On Thursday 04/06 at 16:02 +0200, Petr Mladek wrote:
> On Wed 2017-04-05 17:38:19, Calvin Owens wrote:
> > On Wednesday 04/05 at 17:22 +0200, Petr Mladek wrote:
> > > I think about a reasonable behavior. There seems to be three variables
> > > that are related and are in use:
> > > 
> > >  console_level
> > >  minimum_console_loglevel
> > >  ignore_loglevel
> > > 
> > > The functions seems to be the following:
> > > 
> > >   + console_level defines the current maximum level of
> > > messages that appear on all enabled consoles; it
> > > allows to filter out less important ones
> > > 
> > >   + minimum_console_loglevel defines the minimum
> > > console_loglevel that might be set by userspace
> > > via syslog interface; it prevents userspace from
> > > hiding emergency messages
> > > 
> > >   + ignore_loglevel allows to see all messages
> > > easily; it is used for debugging
> > > 
> > > IMPORTANT: console_level is increased in some special
> > > situations to see everything, e.g. in panic(), oops_begin(),
> > > __handle_sysrq().
> > > 
> > > I guess that people want to see all messages even on the slow
> > > console during panic(), oops(), with ignore_loglevel. It means
> > > that the new per-console setting must not limit it. Also any
> > > console must not go below minimum_console_level.
> > 
> > I can definitely take oops_in_progress and minimum_console_level into
> > account in the drop condition. I can also send a patch to make the sysrq
> > handler reset all the maxlevels to LOGLEVEL_DEBUG if you like.
> 
> Please note that you must not call console_lock() in the sysrq
> handler. The function might sleep and it is irq context.
> By other words, you could not manipulate the console structures
> there.

Sure, I'd punt it to process context somehow.
 
> > > What about doing it the other way and define min_loglevel
> > > for each console. It might be used to make selected consoles
> > > always more verbose (above current console_level) but it
> > > will not limit the more verbose modes.
> > 
> > I think it's more intuitive to let the global sysctl behave as it always
> > has, and allow additional filtering of higher levels downstream. I can
> > definitely see why users might find this a bit confusing, but IMHO
> > stacking two "filters" is more intuitive than a "filter" and a "bypass".
> 
> I do not have strong opinion here. I like the idea of this patch.
> Sadly, the console setting already is pretty confusing.
> 
> I know that many people, including me, have troubles to understand
> the meaning of the 4 numbers in /proc/sys/kernel/printk. They set
> 
>   console_loglevel
>   default_message_loglevel
>   minimum_console_loglevel
>   default_console_loglevel
> 
> And we are going to add another complexity :-(
> 
> 
> > How about a read-only "functional_loglevel" attribute for each console
> > that displays:
> > 
> > max(min(console_level, con->maxlevel), minimum_console_level)
> 
> I like this idea and it inspired me. What about creating the following
> structure under /sys
> 
>   /sys/consoles//loglevel
>  /minimum_loglevel
>  //loglevel
>  /minimum_loglevel
>  /loglevel
>  /minimum_loglevel
> 
> The semantic would be:
> 
>+ global loglevel will show the current default console_loglevel,
>  it must be above the global minimum_console_loglevel
> 
>+ the per-console loglevel will show the loglevel specific
>  for the given console; it must be above the per-console
>  minimum_loglevel while
> 
>+ the per-console minimum_loglevel must be above the global
>  minimum_console_loglevel
> 
>The setting of the global values would affect the per-console
>values but it must respect the above rules.
> 
>It is still the "filter" and "bypass" logic. But we will just
>repeat the existing terms and logic. Also note that
>"ignore_loglevel" and the special modes in sysrq, panic, oops
>use the "bypass" logic as well.

Okay, I see where you're coming from. Let me play with this a bit, I'll
send something concrete in the next day or two.

> > Would that make the semantics more obvious? I'll obviously also send
> > patches for Documentation once there's consensus about the interface.
> 
> Please add also linux-...@vger.kernel.org, especially for the
> patch adding the new toplevel directory under /sys.

Will do.

Thanks,
Calvin

> Best Regards,
> Petr


Re: [RFC][PATCH 1/2] printk: Introduce per-console filtering of messages by loglevel

2017-04-06 Thread Calvin Owens
On Thursday 04/06 at 16:02 +0200, Petr Mladek wrote:
> On Wed 2017-04-05 17:38:19, Calvin Owens wrote:
> > On Wednesday 04/05 at 17:22 +0200, Petr Mladek wrote:
> > > I think about a reasonable behavior. There seems to be three variables
> > > that are related and are in use:
> > > 
> > >  console_level
> > >  minimum_console_loglevel
> > >  ignore_loglevel
> > > 
> > > The functions seems to be the following:
> > > 
> > >   + console_level defines the current maximum level of
> > > messages that appear on all enabled consoles; it
> > > allows to filter out less important ones
> > > 
> > >   + minimum_console_loglevel defines the minimum
> > > console_loglevel that might be set by userspace
> > > via syslog interface; it prevents userspace from
> > > hiding emergency messages
> > > 
> > >   + ignore_loglevel allows to see all messages
> > > easily; it is used for debugging
> > > 
> > > IMPORTANT: console_level is increased in some special
> > > situations to see everything, e.g. in panic(), oops_begin(),
> > > __handle_sysrq().
> > > 
> > > I guess that people want to see all messages even on the slow
> > > console during panic(), oops(), with ignore_loglevel. It means
> > > that the new per-console setting must not limit it. Also any
> > > console must not go below minimum_console_level.
> > 
> > I can definitely take oops_in_progress and minimum_console_level into
> > account in the drop condition. I can also send a patch to make the sysrq
> > handler reset all the maxlevels to LOGLEVEL_DEBUG if you like.
> 
> Please note that you must not call console_lock() in the sysrq
> handler. The function might sleep and it is irq context.
> By other words, you could not manipulate the console structures
> there.

Sure, I'd punt it to process context somehow.
 
> > > What about doing it the other way and define min_loglevel
> > > for each console. It might be used to make selected consoles
> > > always more verbose (above current console_level) but it
> > > will not limit the more verbose modes.
> > 
> > I think it's more intuitive to let the global sysctl behave as it always
> > has, and allow additional filtering of higher levels downstream. I can
> > definitely see why users might find this a bit confusing, but IMHO
> > stacking two "filters" is more intuitive than a "filter" and a "bypass".
> 
> I do not have strong opinion here. I like the idea of this patch.
> Sadly, the console setting already is pretty confusing.
> 
> I know that many people, including me, have troubles to understand
> the meaning of the 4 numbers in /proc/sys/kernel/printk. They set
> 
>   console_loglevel
>   default_message_loglevel
>   minimum_console_loglevel
>   default_console_loglevel
> 
> And we are going to add another complexity :-(
> 
> 
> > How about a read-only "functional_loglevel" attribute for each console
> > that displays:
> > 
> > max(min(console_level, con->maxlevel), minimum_console_level)
> 
> I like this idea and it inspired me. What about creating the following
> structure under /sys
> 
>   /sys/consoles//loglevel
>  /minimum_loglevel
>  //loglevel
>  /minimum_loglevel
>  /loglevel
>  /minimum_loglevel
> 
> The semantic would be:
> 
>+ global loglevel will show the current default console_loglevel,
>  it must be above the global minimum_console_loglevel
> 
>+ the per-console loglevel will show the loglevel specific
>  for the given console; it must be above the per-console
>  minimum_loglevel while
> 
>+ the per-console minimum_loglevel must be above the global
>  minimum_console_loglevel
> 
>The setting of the global values would affect the per-console
>values but it must respect the above rules.
> 
>It is still the "filter" and "bypass" logic. But we will just
>repeat the existing terms and logic. Also note that
>"ignore_loglevel" and the special modes in sysrq, panic, oops
>use the "bypass" logic as well.

Okay, I see where you're coming from. Let me play with this a bit, I'll
send something concrete in the next day or two.

> > Would that make the semantics more obvious? I'll obviously also send
> > patches for Documentation once there's consensus about the interface.
> 
> Please add also linux-...@vger.kernel.org, especially for the
> patch adding the new toplevel directory under /sys.

Will do.

Thanks,
Calvin

> Best Regards,
> Petr


Re: [RFC][PATCH 1/2] printk: Introduce per-console filtering of messages by loglevel

2017-04-05 Thread Calvin Owens
On Wednesday 04/05 at 17:22 +0200, Petr Mladek wrote:
> On Wed 2017-04-05 11:16:28, Sergey Senozhatsky wrote:
> > On (04/05/17 11:08), Sergey Senozhatsky wrote:
> > [..]
> > > > stop_critical_timings();/* don't trace print 
> > > > latency */
> > > > -   call_console_drivers(ext_text, ext_len, text, len);
> > > > +   call_console_drivers(ext_text, ext_len, text, len, 
> > > > msg->level);
> > > > start_critical_timings();
> > > > printk_safe_exit_irqrestore(flags);
> > > 
> > > ok, so the idea is quite clear and reasonable.
> > > 
> > > 
> > > some thoughts,
> > > we have a system-wide suppress_message_printing() loglevel filtering
> > > in console_unlock() loop, which sets a limit on loglevel for all of
> > > the messages - we don't even msg_print_text() if the message has
> > > suppressible loglevel. and this implicitly restricts per-console
> > > maxlevels.
> > > 
> > > console_unlock()
> > > {
> > >   for (;;) {
> > >   ...
> > > skip:
> > > 
> > >   if (suppress_message_printing(msg->level))  // 
> > > console_loglevel
> > >   goto skip;
> > > 
> > >   call_console_drivers(msg->level)
> > >   {
> > >   if (level > con->maxlevel)  // con loglevel
> > >   continue;
> > >   ...
> > >   }
> > >   }
> > > }
> > > 
> > > this can be slightly confusing. what do you think?

I think it makes sense as long as we're clear about the semantics: if a
message would normally be printed to the console according to the global
settings, this allows you to limit the loglevel a specific console will
print.

Petr suggested the opposite approach, I'll address that below.
 
> I think about a reasonable behavior. There seems to be three variables
> that are related and are in use:
> 
>  console_level
>  minimum_console_loglevel
>  ignore_loglevel
> 
> The functions seems to be the following:
> 
>   + console_level defines the current maximum level of
> messages that appear on all enabled consoles; it
> allows to filter out less important ones
> 
>   + minimum_console_loglevel defines the minimum
> console_loglevel that might be set by userspace
> via syslog interface; it prevents userspace from
> hiding emergency messages
> 
>   + ignore_loglevel allows to see all messages
> easily; it is used for debugging
> 
> IMPORTANT: console_level is increased in some special
> situations to see everything, e.g. in panic(), oops_begin(),
> __handle_sysrq().
> 
> I guess that people want to see all messages even on the slow
> console during panic(), oops(), with ignore_loglevel. It means
> that the new per-console setting must not limit it. Also any
> console must not go below minimum_console_level.

I can definitely take oops_in_progress and minimum_console_level into
account in the drop condition. I can also send a patch to make the sysrq
handler reset all the maxlevels to LOGLEVEL_DEBUG if you like.

> What about doing it the other way and define min_loglevel
> for each console. It might be used to make selected consoles
> always more verbose (above current console_level) but it
> will not limit the more verbose modes.

I think it's more intuitive to let the global sysctl behave as it always
has, and allow additional filtering of higher levels downstream. I can
definitely see why users might find this a bit confusing, but IMHO
stacking two "filters" is more intuitive than a "filter" and a "bypass".

How about a read-only "functional_loglevel" attribute for each console
that displays:

max(min(console_level, con->maxlevel), minimum_console_level)

Would that make the semantics more obvious? I'll obviously also send
patches for Documentation once there's consensus about the interface.

Thanks,
Calvin

> Best Regards,
> Petr


Re: [RFC][PATCH 1/2] printk: Introduce per-console filtering of messages by loglevel

2017-04-05 Thread Calvin Owens
On Wednesday 04/05 at 17:22 +0200, Petr Mladek wrote:
> On Wed 2017-04-05 11:16:28, Sergey Senozhatsky wrote:
> > On (04/05/17 11:08), Sergey Senozhatsky wrote:
> > [..]
> > > > stop_critical_timings();/* don't trace print 
> > > > latency */
> > > > -   call_console_drivers(ext_text, ext_len, text, len);
> > > > +   call_console_drivers(ext_text, ext_len, text, len, 
> > > > msg->level);
> > > > start_critical_timings();
> > > > printk_safe_exit_irqrestore(flags);
> > > 
> > > ok, so the idea is quite clear and reasonable.
> > > 
> > > 
> > > some thoughts,
> > > we have a system-wide suppress_message_printing() loglevel filtering
> > > in console_unlock() loop, which sets a limit on loglevel for all of
> > > the messages - we don't even msg_print_text() if the message has
> > > suppressible loglevel. and this implicitly restricts per-console
> > > maxlevels.
> > > 
> > > console_unlock()
> > > {
> > >   for (;;) {
> > >   ...
> > > skip:
> > > 
> > >   if (suppress_message_printing(msg->level))  // 
> > > console_loglevel
> > >   goto skip;
> > > 
> > >   call_console_drivers(msg->level)
> > >   {
> > >   if (level > con->maxlevel)  // con loglevel
> > >   continue;
> > >   ...
> > >   }
> > >   }
> > > }
> > > 
> > > this can be slightly confusing. what do you think?

I think it makes sense as long as we're clear about the semantics: if a
message would normally be printed to the console according to the global
settings, this allows you to limit the loglevel a specific console will
print.

Petr suggested the opposite approach, I'll address that below.
 
> I think about a reasonable behavior. There seems to be three variables
> that are related and are in use:
> 
>  console_level
>  minimum_console_loglevel
>  ignore_loglevel
> 
> The functions seems to be the following:
> 
>   + console_level defines the current maximum level of
> messages that appear on all enabled consoles; it
> allows to filter out less important ones
> 
>   + minimum_console_loglevel defines the minimum
> console_loglevel that might be set by userspace
> via syslog interface; it prevents userspace from
> hiding emergency messages
> 
>   + ignore_loglevel allows to see all messages
> easily; it is used for debugging
> 
> IMPORTANT: console_level is increased in some special
> situations to see everything, e.g. in panic(), oops_begin(),
> __handle_sysrq().
> 
> I guess that people want to see all messages even on the slow
> console during panic(), oops(), with ignore_loglevel. It means
> that the new per-console setting must not limit it. Also any
> console must not go below minimum_console_level.

I can definitely take oops_in_progress and minimum_console_level into
account in the drop condition. I can also send a patch to make the sysrq
handler reset all the maxlevels to LOGLEVEL_DEBUG if you like.

> What about doing it the other way and define min_loglevel
> for each console. It might be used to make selected consoles
> always more verbose (above current console_level) but it
> will not limit the more verbose modes.

I think it's more intuitive to let the global sysctl behave as it always
has, and allow additional filtering of higher levels downstream. I can
definitely see why users might find this a bit confusing, but IMHO
stacking two "filters" is more intuitive than a "filter" and a "bypass".

How about a read-only "functional_loglevel" attribute for each console
that displays:

max(min(console_level, con->maxlevel), minimum_console_level)

Would that make the semantics more obvious? I'll obviously also send
patches for Documentation once there's consensus about the interface.

Thanks,
Calvin

> Best Regards,
> Petr


Re: [RFC][PATCH 2/2] printk: Add /sys/consoles/${con}/ and maxlevel attribute

2017-04-05 Thread Calvin Owens
On Tuesday 04/04 at 23:30 -0400, Steven Rostedt wrote:
> On Tue, 4 Apr 2017 16:03:20 -0700
> Calvin Owens <calvinow...@fb.com> wrote:
> 
> > This does the simplest possible thing: add a directory at the root of
> > sysfs that allows setting the "maxlevel" parameter for each console.
> > 
> > We can let kobject destruction race with console removal: if it does,
> > maxlevel_{show,store}() will safely fail with -ENODEV. This is a little
> > weird, but avoids embedding the kobject and therefore needing to totally
> > refactor the way we handle console struct lifetime.
> 
> Can you also add a patch that allows this to be set on the kernel
> command line, when the consoles are defined.

Absolutely :)

Thanks,
Calvin

 
> -- Steve
> 
> > 
> > Signed-off-by: Calvin Owens <calvinow...@fb.com>


Re: [RFC][PATCH 2/2] printk: Add /sys/consoles/${con}/ and maxlevel attribute

2017-04-05 Thread Calvin Owens
On Tuesday 04/04 at 23:30 -0400, Steven Rostedt wrote:
> On Tue, 4 Apr 2017 16:03:20 -0700
> Calvin Owens  wrote:
> 
> > This does the simplest possible thing: add a directory at the root of
> > sysfs that allows setting the "maxlevel" parameter for each console.
> > 
> > We can let kobject destruction race with console removal: if it does,
> > maxlevel_{show,store}() will safely fail with -ENODEV. This is a little
> > weird, but avoids embedding the kobject and therefore needing to totally
> > refactor the way we handle console struct lifetime.
> 
> Can you also add a patch that allows this to be set on the kernel
> command line, when the consoles are defined.

Absolutely :)

Thanks,
Calvin

 
> -- Steve
> 
> > 
> > Signed-off-by: Calvin Owens 


Re: [RFC][PATCH 1/2] printk: Introduce per-console filtering of messages by loglevel

2017-04-05 Thread Calvin Owens
On Tuesday 04/04 at 23:27 -0400, Steven Rostedt wrote:
> On Wed, 5 Apr 2017 11:16:28 +0900
> Sergey Senozhatsky  wrote:
> 
> 
> > one more thing.
> > 
> > this per-console filtering ignores... the "ignore_loglevel" param.
> > 
> > early_param("ignore_loglevel", ignore_loglevel_setup);
> > module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR);
> > MODULE_PARM_DESC(ignore_loglevel,
> >  "ignore loglevel setting (prints all kernel messages to the 
> > console)");
> > 
> > 
> > my preference would be preserve "ignore_loglevel" behaviour. if
> > we are forced to 'ignore all loglevel filtering' then we should
> > do so.
> 
> Agreed.

Makes sense, I'll add then when I resend.

Thanks,
Calvin
 
> -- Steve
> 


Re: [RFC][PATCH 1/2] printk: Introduce per-console filtering of messages by loglevel

2017-04-05 Thread Calvin Owens
On Tuesday 04/04 at 23:27 -0400, Steven Rostedt wrote:
> On Wed, 5 Apr 2017 11:16:28 +0900
> Sergey Senozhatsky  wrote:
> 
> 
> > one more thing.
> > 
> > this per-console filtering ignores... the "ignore_loglevel" param.
> > 
> > early_param("ignore_loglevel", ignore_loglevel_setup);
> > module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR);
> > MODULE_PARM_DESC(ignore_loglevel,
> >  "ignore loglevel setting (prints all kernel messages to the 
> > console)");
> > 
> > 
> > my preference would be preserve "ignore_loglevel" behaviour. if
> > we are forced to 'ignore all loglevel filtering' then we should
> > do so.
> 
> Agreed.

Makes sense, I'll add then when I resend.

Thanks,
Calvin
 
> -- Steve
> 


[RFC][PATCH 1/2] printk: Introduce per-console filtering of messages by loglevel

2017-04-04 Thread Calvin Owens
Not all consoles are created equal: depending on the actual hardware,
the latency of a printk() call can vary dramatically. The worst examples
are serial consoles, where it can spin for tens of milliseconds banging
the UART to emit a message, which can cause application-level problems
when the kernel spews onto the console.

At Facebook we use netconsole to monitor our fleet, but we still have
serial consoles attached on each host for live debugging, and the latter
has caused problems. An obvious solution is to disable the kernel
console output to ttyS0, but this makes live debugging frustrating,
since crashes become silent and opaque to the ttyS0 user. Enabling it on
the fly when needed isn't feasible, since boxes you need to debug via
serial are likely to be borked in ways that make this impossible.

This puts us between a rock and a hard place: we'd love to set
kernel.printk to KERN_INFO and get all the logs. But while netconsole is
fast enough to permit that without perturbing userspace, ttyS0 is not,
and we're forced to limit console logging to KERN_WARNING and higher.

This patch lets us have our cake and eat it too: instead of being forced
to limit all consoles verbosity based on the speed of the slowest one,
we can limit each based on its own speed. A subsequent patch will
introduce a simple sysfs interface for changing this setting.

Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
 include/linux/console.h |  1 +
 kernel/printk/printk.c  | 13 ++---
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/console.h b/include/linux/console.h
index 5949d18..764a2c0 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -147,6 +147,7 @@ struct console {
int cflag;
void*data;
struct   console *next;
+   int maxlevel;
 };
 
 /*
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 2984fb0..5393928 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1562,7 +1562,7 @@ SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, 
int, len)
  * The console_lock must be held.
  */
 static void call_console_drivers(const char *ext_text, size_t ext_len,
-const char *text, size_t len)
+const char *text, size_t len, int level)
 {
struct console *con;
 
@@ -1581,6 +1581,8 @@ static void call_console_drivers(const char *ext_text, 
size_t ext_len,
if (!cpu_online(smp_processor_id()) &&
!(con->flags & CON_ANYTIME))
continue;
+   if (level > con->maxlevel)
+   continue;
if (con->flags & CON_EXTENDED)
con->write(con, ext_text, ext_len);
else
@@ -1869,7 +1871,7 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
  char *dict, size_t dict_len,
  char *text, size_t text_len) { return 0; }
 static void call_console_drivers(const char *ext_text, size_t ext_len,
-const char *text, size_t len) {}
+const char *text, size_t len, int level) {}
 static size_t msg_print_text(const struct printk_log *msg,
 bool syslog, char *buf, size_t size) { return 0; }
 static bool suppress_message_printing(int level) { return false; }
@@ -2238,7 +2240,7 @@ void console_unlock(void)
raw_spin_unlock(_lock);
 
stop_critical_timings();/* don't trace print latency */
-   call_console_drivers(ext_text, ext_len, text, len);
+   call_console_drivers(ext_text, ext_len, text, len, msg->level);
start_critical_timings();
printk_safe_exit_irqrestore(flags);
 
@@ -2504,6 +2506,11 @@ void register_console(struct console *newcon)
newcon->flags &= ~CON_PRINTBUFFER;
 
/*
+* By default, the per-console loglevel filter permits all messages.
+*/
+   newcon->maxlevel = LOGLEVEL_DEBUG;
+
+   /*
 *  Put this console in the list - keep the
 *  preferred driver at the head of the list.
 */
-- 
2.9.3



[RFC][PATCH 1/2] printk: Introduce per-console filtering of messages by loglevel

2017-04-04 Thread Calvin Owens
Not all consoles are created equal: depending on the actual hardware,
the latency of a printk() call can vary dramatically. The worst examples
are serial consoles, where it can spin for tens of milliseconds banging
the UART to emit a message, which can cause application-level problems
when the kernel spews onto the console.

At Facebook we use netconsole to monitor our fleet, but we still have
serial consoles attached on each host for live debugging, and the latter
has caused problems. An obvious solution is to disable the kernel
console output to ttyS0, but this makes live debugging frustrating,
since crashes become silent and opaque to the ttyS0 user. Enabling it on
the fly when needed isn't feasible, since boxes you need to debug via
serial are likely to be borked in ways that make this impossible.

This puts us between a rock and a hard place: we'd love to set
kernel.printk to KERN_INFO and get all the logs. But while netconsole is
fast enough to permit that without perturbing userspace, ttyS0 is not,
and we're forced to limit console logging to KERN_WARNING and higher.

This patch lets us have our cake and eat it too: instead of being forced
to limit all consoles verbosity based on the speed of the slowest one,
we can limit each based on its own speed. A subsequent patch will
introduce a simple sysfs interface for changing this setting.

Signed-off-by: Calvin Owens 
---
 include/linux/console.h |  1 +
 kernel/printk/printk.c  | 13 ++---
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/console.h b/include/linux/console.h
index 5949d18..764a2c0 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -147,6 +147,7 @@ struct console {
int cflag;
void*data;
struct   console *next;
+   int maxlevel;
 };
 
 /*
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 2984fb0..5393928 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1562,7 +1562,7 @@ SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, 
int, len)
  * The console_lock must be held.
  */
 static void call_console_drivers(const char *ext_text, size_t ext_len,
-const char *text, size_t len)
+const char *text, size_t len, int level)
 {
struct console *con;
 
@@ -1581,6 +1581,8 @@ static void call_console_drivers(const char *ext_text, 
size_t ext_len,
if (!cpu_online(smp_processor_id()) &&
!(con->flags & CON_ANYTIME))
continue;
+   if (level > con->maxlevel)
+   continue;
if (con->flags & CON_EXTENDED)
con->write(con, ext_text, ext_len);
else
@@ -1869,7 +1871,7 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
  char *dict, size_t dict_len,
  char *text, size_t text_len) { return 0; }
 static void call_console_drivers(const char *ext_text, size_t ext_len,
-const char *text, size_t len) {}
+const char *text, size_t len, int level) {}
 static size_t msg_print_text(const struct printk_log *msg,
 bool syslog, char *buf, size_t size) { return 0; }
 static bool suppress_message_printing(int level) { return false; }
@@ -2238,7 +2240,7 @@ void console_unlock(void)
raw_spin_unlock(_lock);
 
stop_critical_timings();/* don't trace print latency */
-   call_console_drivers(ext_text, ext_len, text, len);
+   call_console_drivers(ext_text, ext_len, text, len, msg->level);
start_critical_timings();
printk_safe_exit_irqrestore(flags);
 
@@ -2504,6 +2506,11 @@ void register_console(struct console *newcon)
newcon->flags &= ~CON_PRINTBUFFER;
 
/*
+* By default, the per-console loglevel filter permits all messages.
+*/
+   newcon->maxlevel = LOGLEVEL_DEBUG;
+
+   /*
 *  Put this console in the list - keep the
 *  preferred driver at the head of the list.
 */
-- 
2.9.3



[RFC][PATCH 2/2] printk: Add /sys/consoles/${con}/ and maxlevel attribute

2017-04-04 Thread Calvin Owens
This does the simplest possible thing: add a directory at the root of
sysfs that allows setting the "maxlevel" parameter for each console.

We can let kobject destruction race with console removal: if it does,
maxlevel_{show,store}() will safely fail with -ENODEV. This is a little
weird, but avoids embedding the kobject and therefore needing to totally
refactor the way we handle console struct lifetime.

Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
 include/linux/console.h |  1 +
 kernel/printk/printk.c  | 82 +
 2 files changed, 83 insertions(+)

diff --git a/include/linux/console.h b/include/linux/console.h
index 764a2c0..c76fde0 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -148,6 +148,7 @@ struct console {
void*data;
struct   console *next;
int maxlevel;
+   struct kobject *kobj;
 };
 
 /*
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 5393928..e9d036b 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -105,6 +105,8 @@ enum devkmsg_log_masks {
 
 static unsigned int __read_mostly devkmsg_log = DEVKMSG_LOG_MASK_DEFAULT;
 
+static struct kobject *consoles_dir_kobj;
+
 static int __control_devkmsg(char *str)
 {
if (!str)
@@ -2386,6 +2388,76 @@ static int __init keep_bootcon_setup(char *str)
 
 early_param("keep_bootcon", keep_bootcon_setup);
 
+static ssize_t maxlevel_show(struct kobject *kobj, struct kobj_attribute *attr,
+ char *buf)
+{
+   struct console *con;
+   ssize_t ret = -ENODEV;
+
+   console_lock();
+   for_each_console(con) {
+   if (con->kobj == kobj) {
+   ret = sprintf(buf, "%d\n", con->maxlevel);
+   break;
+   }
+   }
+   console_unlock();
+
+   return ret;
+}
+
+static ssize_t maxlevel_store(struct kobject *kobj, struct kobj_attribute 
*attr,
+  const char *buf, size_t count)
+{
+   struct console *con;
+   ssize_t ret;
+   int tmp;
+
+   ret = kstrtoint(buf, 10, );
+   if (ret < 0)
+   return ret;
+
+   if (tmp < 0 || tmp > LOGLEVEL_DEBUG)
+   return -ERANGE;
+
+   ret = -ENODEV;
+   console_lock();
+   for_each_console(con) {
+   if (con->kobj == kobj) {
+   con->maxlevel = tmp;
+   ret = count;
+   break;
+   }
+   }
+   console_unlock();
+
+   return ret;
+}
+
+static const struct kobj_attribute console_level_attr =
+   __ATTR(maxlevel, 0644, maxlevel_show, maxlevel_store);
+
+static void console_register_sysfs(struct console *newcon)
+{
+   /*
+* We might be called very early from register_console(): in that case,
+* printk_late_init() will take care of this later.
+*/
+   if (!consoles_dir_kobj)
+   return;
+
+   newcon->kobj = kobject_create_and_add(newcon->name, consoles_dir_kobj);
+   if (WARN_ON(!newcon->kobj))
+   return;
+
+   WARN_ON(sysfs_create_file(newcon->kobj, _level_attr.attr));
+}
+
+static void console_unregister_sysfs(struct console *oldcon)
+{
+   kobject_put(oldcon->kobj);
+}
+
 /*
  * The console driver calls this routine during kernel initialization
  * to register the console printing procedure with printk() and to
@@ -2509,6 +2581,7 @@ void register_console(struct console *newcon)
 * By default, the per-console loglevel filter permits all messages.
 */
newcon->maxlevel = LOGLEVEL_DEBUG;
+   newcon->kobj = NULL;
 
/*
 *  Put this console in the list - keep the
@@ -2545,6 +2618,7 @@ void register_console(struct console *newcon)
 */
exclusive_console = newcon;
}
+   console_register_sysfs(newcon);
console_unlock();
console_sysfs_notify();
 
@@ -2611,6 +2685,7 @@ int unregister_console(struct console *console)
console_drivers->flags |= CON_CONSDEV;
 
console->flags &= ~CON_ENABLED;
+   console_unregister_sysfs(console);
console_unlock();
console_sysfs_notify();
return res;
@@ -2656,6 +2731,13 @@ static int __init printk_late_init(void)
ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "printk:online",
console_cpu_notify, NULL);
WARN_ON(ret < 0);
+
+   consoles_dir_kobj = kobject_create_and_add("consoles", NULL);
+   WARN_ON(!consoles_dir_kobj);
+
+   for_each_console(con)
+   console_register_sysfs(con);
+
return 0;
 }
 late_initcall(printk_late_init);
-- 
2.9.3



[RFC][PATCH 2/2] printk: Add /sys/consoles/${con}/ and maxlevel attribute

2017-04-04 Thread Calvin Owens
This does the simplest possible thing: add a directory at the root of
sysfs that allows setting the "maxlevel" parameter for each console.

We can let kobject destruction race with console removal: if it does,
maxlevel_{show,store}() will safely fail with -ENODEV. This is a little
weird, but avoids embedding the kobject and therefore needing to totally
refactor the way we handle console struct lifetime.

Signed-off-by: Calvin Owens 
---
 include/linux/console.h |  1 +
 kernel/printk/printk.c  | 82 +
 2 files changed, 83 insertions(+)

diff --git a/include/linux/console.h b/include/linux/console.h
index 764a2c0..c76fde0 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -148,6 +148,7 @@ struct console {
void*data;
struct   console *next;
int maxlevel;
+   struct kobject *kobj;
 };
 
 /*
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 5393928..e9d036b 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -105,6 +105,8 @@ enum devkmsg_log_masks {
 
 static unsigned int __read_mostly devkmsg_log = DEVKMSG_LOG_MASK_DEFAULT;
 
+static struct kobject *consoles_dir_kobj;
+
 static int __control_devkmsg(char *str)
 {
if (!str)
@@ -2386,6 +2388,76 @@ static int __init keep_bootcon_setup(char *str)
 
 early_param("keep_bootcon", keep_bootcon_setup);
 
+static ssize_t maxlevel_show(struct kobject *kobj, struct kobj_attribute *attr,
+ char *buf)
+{
+   struct console *con;
+   ssize_t ret = -ENODEV;
+
+   console_lock();
+   for_each_console(con) {
+   if (con->kobj == kobj) {
+   ret = sprintf(buf, "%d\n", con->maxlevel);
+   break;
+   }
+   }
+   console_unlock();
+
+   return ret;
+}
+
+static ssize_t maxlevel_store(struct kobject *kobj, struct kobj_attribute 
*attr,
+  const char *buf, size_t count)
+{
+   struct console *con;
+   ssize_t ret;
+   int tmp;
+
+   ret = kstrtoint(buf, 10, );
+   if (ret < 0)
+   return ret;
+
+   if (tmp < 0 || tmp > LOGLEVEL_DEBUG)
+   return -ERANGE;
+
+   ret = -ENODEV;
+   console_lock();
+   for_each_console(con) {
+   if (con->kobj == kobj) {
+   con->maxlevel = tmp;
+   ret = count;
+   break;
+   }
+   }
+   console_unlock();
+
+   return ret;
+}
+
+static const struct kobj_attribute console_level_attr =
+   __ATTR(maxlevel, 0644, maxlevel_show, maxlevel_store);
+
+static void console_register_sysfs(struct console *newcon)
+{
+   /*
+* We might be called very early from register_console(): in that case,
+* printk_late_init() will take care of this later.
+*/
+   if (!consoles_dir_kobj)
+   return;
+
+   newcon->kobj = kobject_create_and_add(newcon->name, consoles_dir_kobj);
+   if (WARN_ON(!newcon->kobj))
+   return;
+
+   WARN_ON(sysfs_create_file(newcon->kobj, _level_attr.attr));
+}
+
+static void console_unregister_sysfs(struct console *oldcon)
+{
+   kobject_put(oldcon->kobj);
+}
+
 /*
  * The console driver calls this routine during kernel initialization
  * to register the console printing procedure with printk() and to
@@ -2509,6 +2581,7 @@ void register_console(struct console *newcon)
 * By default, the per-console loglevel filter permits all messages.
 */
newcon->maxlevel = LOGLEVEL_DEBUG;
+   newcon->kobj = NULL;
 
/*
 *  Put this console in the list - keep the
@@ -2545,6 +2618,7 @@ void register_console(struct console *newcon)
 */
exclusive_console = newcon;
}
+   console_register_sysfs(newcon);
console_unlock();
console_sysfs_notify();
 
@@ -2611,6 +2685,7 @@ int unregister_console(struct console *console)
console_drivers->flags |= CON_CONSDEV;
 
console->flags &= ~CON_ENABLED;
+   console_unregister_sysfs(console);
console_unlock();
console_sysfs_notify();
return res;
@@ -2656,6 +2731,13 @@ static int __init printk_late_init(void)
ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "printk:online",
console_cpu_notify, NULL);
WARN_ON(ret < 0);
+
+   consoles_dir_kobj = kobject_create_and_add("consoles", NULL);
+   WARN_ON(!consoles_dir_kobj);
+
+   for_each_console(con)
+   console_register_sysfs(con);
+
return 0;
 }
 late_initcall(printk_late_init);
-- 
2.9.3



[PATCH v3] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files

2017-03-30 Thread Calvin Owens
When punching past EOF on XFS, fallocate(mode=PUNCH_HOLE|KEEP_SIZE) will
round the file size up to the nearest multiple of PAGE_SIZE:

  calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1
  calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 2048Blocks: 8  IO Block: 4096   regular file
  calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test
  calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 4096Blocks: 8  IO Block: 4096   regular file

Commit 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") replaced
xfs_zero_remaining_bytes() with calls to iomap helpers. The new helpers
don't enforce that [pos,offset) lies strictly on [0,i_size) when being
called from xfs_free_file_space(), so by "leaking" these ranges into
xfs_zero_range() we get this buggy behavior.

Fix this by reintroducing the checks xfs_zero_remaining_bytes() did
against i_size at the bottom of xfs_free_file_space().

Reported-by: Aaron Gao <g...@fb.com>
Fixes: 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes")
Cc: Christoph Hellwig <h...@lst.de>
Cc: Brian Foster <bfos...@redhat.com>
Cc: <sta...@vger.kernel.org> # 4.8+
Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
 fs/xfs/xfs_bmap_util.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 8b75dce..828532c 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1311,8 +1311,16 @@ xfs_free_file_space(
/*
 * Now that we've unmap all full blocks we'll have to zero out any
 * partial block at the beginning and/or end.  xfs_zero_range is
-* smart enough to skip any holes, including those we just created.
+* smart enough to skip any holes, including those we just created,
+* but we must take care not to zero beyond EOF and enlarge i_size.
 */
+
+   if (offset >= XFS_ISIZE(ip))
+   return 0;
+
+   if (offset + len > XFS_ISIZE(ip))
+   len = XFS_ISIZE(ip) - offset;
+
return xfs_zero_range(ip, offset, len, NULL);
 }
 
-- 
2.9.3



[PATCH v3] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files

2017-03-30 Thread Calvin Owens
When punching past EOF on XFS, fallocate(mode=PUNCH_HOLE|KEEP_SIZE) will
round the file size up to the nearest multiple of PAGE_SIZE:

  calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1
  calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 2048Blocks: 8  IO Block: 4096   regular file
  calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test
  calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 4096Blocks: 8  IO Block: 4096   regular file

Commit 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") replaced
xfs_zero_remaining_bytes() with calls to iomap helpers. The new helpers
don't enforce that [pos,offset) lies strictly on [0,i_size) when being
called from xfs_free_file_space(), so by "leaking" these ranges into
xfs_zero_range() we get this buggy behavior.

Fix this by reintroducing the checks xfs_zero_remaining_bytes() did
against i_size at the bottom of xfs_free_file_space().

Reported-by: Aaron Gao 
Fixes: 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes")
Cc: Christoph Hellwig 
Cc: Brian Foster 
Cc:  # 4.8+
Signed-off-by: Calvin Owens 
---
 fs/xfs/xfs_bmap_util.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 8b75dce..828532c 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1311,8 +1311,16 @@ xfs_free_file_space(
/*
 * Now that we've unmap all full blocks we'll have to zero out any
 * partial block at the beginning and/or end.  xfs_zero_range is
-* smart enough to skip any holes, including those we just created.
+* smart enough to skip any holes, including those we just created,
+* but we must take care not to zero beyond EOF and enlarge i_size.
 */
+
+   if (offset >= XFS_ISIZE(ip))
+   return 0;
+
+   if (offset + len > XFS_ISIZE(ip))
+   len = XFS_ISIZE(ip) - offset;
+
return xfs_zero_range(ip, offset, len, NULL);
 }
 
-- 
2.9.3



Re: [PATCH v2] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files

2017-03-21 Thread Calvin Owens

On 03/21/2017 04:39 AM, Brian Foster wrote:

On Sun, Mar 19, 2017 at 09:54:51PM -0700, Calvin Owens wrote:

When punching past EOF on XFS, fallocate(mode=PUNCH_HOLE|KEEP_SIZE) will
round the file size up to the nearest multiple of PAGE_SIZE:

  calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1
  calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 2048Blocks: 8  IO Block: 4096   regular file
  calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test
  calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 4096Blocks: 8  IO Block: 4096   regular file

Commit 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") replaced
xfs_zero_remaining_bytes() with calls to iomap helpers. The new helpers
don't enforce that [pos,offset) lies strictly on [0,i_size) when being
called from xfs_free_file_space(), so by "leaking" these ranges into
xfs_zero_range() we get this buggy behavior.

Fix this by reintroducing the checks xfs_zero_remaining_bytes() did
against i_size at the bottom of xfs_free_file_space().

Reported-by: Aaron Gao <g...@fb.com>
Fixes: 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes")
Cc: Christoph Hellwig <h...@lst.de>
Cc: <sta...@vger.kernel.org> # 4.8+
Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
 fs/xfs/xfs_bmap_util.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 8b75dce..0796ebc 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1309,6 +1309,17 @@ xfs_free_file_space(
}

/*
+* Avoid doing I/O beyond eof - it's not necessary
+* since nothing can read beyond eof.  The space will
+* be zeroed when the file is extended anyway.
+*/


I'd suggest to update the comment below with this information and move
the following bits down below it as well.


Will do.


+   if (offset >= XFS_ISIZE(ip))
+   return 0;
+
+   if ((offset + len) >= XFS_ISIZE(ip))
+   len = XFS_ISIZE(ip) - offset - 1;
+


This looks like an off-by-one. Do you mean the following?

if (offset + len > XFS_ISIZE(ip))
len = XFS_ISIZE(ip) - offset;


It's not an off-by-one (it's self-consistent), but your way makes more
sense, I'll fix it ;)

Thanks,
Calvin


Brian


+   /*
 * Now that we've unmap all full blocks we'll have to zero out any
 * partial block at the beginning and/or end.  xfs_zero_range is
 * smart enough to skip any holes, including those we just created.
--
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




Re: [PATCH v2] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files

2017-03-21 Thread Calvin Owens

On 03/21/2017 04:39 AM, Brian Foster wrote:

On Sun, Mar 19, 2017 at 09:54:51PM -0700, Calvin Owens wrote:

When punching past EOF on XFS, fallocate(mode=PUNCH_HOLE|KEEP_SIZE) will
round the file size up to the nearest multiple of PAGE_SIZE:

  calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1
  calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 2048Blocks: 8  IO Block: 4096   regular file
  calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test
  calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 4096Blocks: 8  IO Block: 4096   regular file

Commit 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") replaced
xfs_zero_remaining_bytes() with calls to iomap helpers. The new helpers
don't enforce that [pos,offset) lies strictly on [0,i_size) when being
called from xfs_free_file_space(), so by "leaking" these ranges into
xfs_zero_range() we get this buggy behavior.

Fix this by reintroducing the checks xfs_zero_remaining_bytes() did
against i_size at the bottom of xfs_free_file_space().

Reported-by: Aaron Gao 
Fixes: 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes")
Cc: Christoph Hellwig 
Cc:  # 4.8+
Signed-off-by: Calvin Owens 
---
 fs/xfs/xfs_bmap_util.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 8b75dce..0796ebc 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1309,6 +1309,17 @@ xfs_free_file_space(
}

/*
+* Avoid doing I/O beyond eof - it's not necessary
+* since nothing can read beyond eof.  The space will
+* be zeroed when the file is extended anyway.
+*/


I'd suggest to update the comment below with this information and move
the following bits down below it as well.


Will do.


+   if (offset >= XFS_ISIZE(ip))
+   return 0;
+
+   if ((offset + len) >= XFS_ISIZE(ip))
+   len = XFS_ISIZE(ip) - offset - 1;
+


This looks like an off-by-one. Do you mean the following?

if (offset + len > XFS_ISIZE(ip))
len = XFS_ISIZE(ip) - offset;


It's not an off-by-one (it's self-consistent), but your way makes more
sense, I'll fix it ;)

Thanks,
Calvin


Brian


+   /*
 * Now that we've unmap all full blocks we'll have to zero out any
 * partial block at the beginning and/or end.  xfs_zero_range is
 * smart enough to skip any holes, including those we just created.
--
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[PATCH v2] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files

2017-03-19 Thread Calvin Owens
When punching past EOF on XFS, fallocate(mode=PUNCH_HOLE|KEEP_SIZE) will
round the file size up to the nearest multiple of PAGE_SIZE:

  calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1
  calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 2048Blocks: 8  IO Block: 4096   regular file
  calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test
  calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 4096Blocks: 8  IO Block: 4096   regular file

Commit 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") replaced
xfs_zero_remaining_bytes() with calls to iomap helpers. The new helpers
don't enforce that [pos,offset) lies strictly on [0,i_size) when being
called from xfs_free_file_space(), so by "leaking" these ranges into
xfs_zero_range() we get this buggy behavior.

Fix this by reintroducing the checks xfs_zero_remaining_bytes() did
against i_size at the bottom of xfs_free_file_space().

Reported-by: Aaron Gao <g...@fb.com>
Fixes: 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes")
Cc: Christoph Hellwig <h...@lst.de>
Cc: <sta...@vger.kernel.org> # 4.8+
Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
 fs/xfs/xfs_bmap_util.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 8b75dce..0796ebc 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1309,6 +1309,17 @@ xfs_free_file_space(
}
 
/*
+* Avoid doing I/O beyond eof - it's not necessary
+* since nothing can read beyond eof.  The space will
+* be zeroed when the file is extended anyway.
+*/
+   if (offset >= XFS_ISIZE(ip))
+   return 0;
+
+   if ((offset + len) >= XFS_ISIZE(ip))
+   len = XFS_ISIZE(ip) - offset - 1;
+
+   /*
 * Now that we've unmap all full blocks we'll have to zero out any
 * partial block at the beginning and/or end.  xfs_zero_range is
 * smart enough to skip any holes, including those we just created.
-- 
2.9.3



[PATCH v2] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files

2017-03-19 Thread Calvin Owens
When punching past EOF on XFS, fallocate(mode=PUNCH_HOLE|KEEP_SIZE) will
round the file size up to the nearest multiple of PAGE_SIZE:

  calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1
  calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 2048Blocks: 8  IO Block: 4096   regular file
  calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test
  calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 4096Blocks: 8  IO Block: 4096   regular file

Commit 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") replaced
xfs_zero_remaining_bytes() with calls to iomap helpers. The new helpers
don't enforce that [pos,offset) lies strictly on [0,i_size) when being
called from xfs_free_file_space(), so by "leaking" these ranges into
xfs_zero_range() we get this buggy behavior.

Fix this by reintroducing the checks xfs_zero_remaining_bytes() did
against i_size at the bottom of xfs_free_file_space().

Reported-by: Aaron Gao 
Fixes: 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes")
Cc: Christoph Hellwig 
Cc:  # 4.8+
Signed-off-by: Calvin Owens 
---
 fs/xfs/xfs_bmap_util.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 8b75dce..0796ebc 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1309,6 +1309,17 @@ xfs_free_file_space(
}
 
/*
+* Avoid doing I/O beyond eof - it's not necessary
+* since nothing can read beyond eof.  The space will
+* be zeroed when the file is extended anyway.
+*/
+   if (offset >= XFS_ISIZE(ip))
+   return 0;
+
+   if ((offset + len) >= XFS_ISIZE(ip))
+   len = XFS_ISIZE(ip) - offset - 1;
+
+   /*
 * Now that we've unmap all full blocks we'll have to zero out any
 * partial block at the beginning and/or end.  xfs_zero_range is
 * smart enough to skip any holes, including those we just created.
-- 
2.9.3



[PATCH] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files

2017-03-17 Thread Calvin Owens
Commit 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") replaced
xfs_zero_remaining_bytes() with calls to iomap helpers.

Unfortunately the new iomap helpers don't enforce that [pos,count) lies
strictly on [0,i_size). This causes fallocate(mode=PUNCH_HOLE|KEEP_SIZE)
calls touching [i_size & ~PAGE_MASK, OFF_T_MAX] to round i_size up to
the nearest multiple of PAGE_SIZE:

  calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1
  calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 2048Blocks: 8  IO Block: 4096   regular file
  calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test
  calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 4096Blocks: 8  IO Block: 4096   regular file

Fix this by reintroducing the checks xfs_zero_remaining_bytes() did
against i_size into xfs_zero_range().

Reported-by: Aaron Gao <g...@fb.com>
Fixes: 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes")
Cc: Christoph Hellwig <h...@lst.de>
Cc: <sta...@vger.kernel.org> # 4.8+
Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
 fs/xfs/xfs_file.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 35703a8..da7cd27 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -58,6 +58,17 @@ xfs_zero_range(
xfs_off_t   count,
bool*did_zero)
 {
+   /*
+* Avoid doing I/O beyond eof - it's not necessary
+* since nothing can read beyond eof.  The space will
+* be zeroed when the file is extended anyway.
+*/
+   if (pos >= XFS_ISIZE(ip))
+   return 0;
+
+   if ((pos + count) >= XFS_ISIZE(ip))
+   count = XFS_ISIZE(ip) - pos - 1;
+
return iomap_zero_range(VFS_I(ip), pos, count, NULL, _iomap_ops);
 }
 
-- 
2.9.3



[PATCH] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files

2017-03-17 Thread Calvin Owens
Commit 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes") replaced
xfs_zero_remaining_bytes() with calls to iomap helpers.

Unfortunately the new iomap helpers don't enforce that [pos,count) lies
strictly on [0,i_size). This causes fallocate(mode=PUNCH_HOLE|KEEP_SIZE)
calls touching [i_size & ~PAGE_MASK, OFF_T_MAX] to round i_size up to
the nearest multiple of PAGE_SIZE:

  calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1
  calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 2048Blocks: 8  IO Block: 4096   regular file
  calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test
  calvinow@vm-disks/generic-xfs-1 ~$ stat test
Size: 4096Blocks: 8  IO Block: 4096   regular file

Fix this by reintroducing the checks xfs_zero_remaining_bytes() did
against i_size into xfs_zero_range().

Reported-by: Aaron Gao 
Fixes: 3c2bdc912a1cc050 ("xfs: kill xfs_zero_remaining_bytes")
Cc: Christoph Hellwig 
Cc:  # 4.8+
Signed-off-by: Calvin Owens 
---
 fs/xfs/xfs_file.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 35703a8..da7cd27 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -58,6 +58,17 @@ xfs_zero_range(
xfs_off_t   count,
bool*did_zero)
 {
+   /*
+* Avoid doing I/O beyond eof - it's not necessary
+* since nothing can read beyond eof.  The space will
+* be zeroed when the file is extended anyway.
+*/
+   if (pos >= XFS_ISIZE(ip))
+   return 0;
+
+   if ((pos + count) >= XFS_ISIZE(ip))
+   count = XFS_ISIZE(ip) - pos - 1;
+
return iomap_zero_range(VFS_I(ip), pos, count, NULL, _iomap_ops);
 }
 
-- 
2.9.3



Re: [PATCH] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files

2017-03-17 Thread Calvin Owens
> Fix this by reintroducing the checks xfs_zero_remaining_bytes() did
> against i_size into xfs_zero_range().

Sorry this is wrong: I missed that xfs_zero_range() has another caller that
depends on the behavior I'm changing. I'll send a v2 with the same hunk at
the bottom of xfs_free_file_space() instead.

Thanks,
Calvin

Re: [PATCH] xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files

2017-03-17 Thread Calvin Owens
> Fix this by reintroducing the checks xfs_zero_remaining_bytes() did
> against i_size into xfs_zero_range().

Sorry this is wrong: I missed that xfs_zero_range() has another caller that
depends on the behavior I'm changing. I'll send a v2 with the same hunk at
the bottom of xfs_free_file_space() instead.

Thanks,
Calvin

Re: [PATCH] fs: Assert on module file_operations without an owner

2016-10-07 Thread Calvin Owens
On Friday 10/07 at 17:18 -0400, Calvin Owens wrote:
> On Friday 10/07 at 21:48 +0100, Al Viro wrote:
> > On Fri, Oct 07, 2016 at 01:35:52PM -0700, Calvin Owens wrote:
> > > Omitting the owner field in file_operations declared in modules is an
> > > easy mistake to make, and can result in crashes when the module is
> > > unloaded while userspace is poking the file.
> > > 
> > > This patch modifies fops_get() to WARN when it encounters a NULL owner,
> > > since in this case it cannot take a reference on the containing module.
> > 
> > NAK.  This is complete crap - we do *NOT* need ->owner on a lot of
> > file_operations.
> 
> This isn't a theoretical issue: I have a proprietary module that makes this
> mistake and crashes when poking a chrdev it exposes in userspace races with
> unloading the module.
> 
> Of course, the bug is in this silly module. I'm not arguing that it isn't. I
> was hesitant to even mention this because I know waving at something in an OOT
> module is a poor argument for changing anything in the proper kernel.
> 
> But what I'm trying to do here is prevent people from making that mistake in
> the future by yelling at them when they do. The implicit ignoring of a NULL
> owner in try_module_get() in fops_get() is not necessarily obvious.

Let's drop this, I should never have sent the patch in the first place.

> > * we do not need that on file_operations of a regular file or
> > directory on a normal filesystem, since that filesystem is not going
> > away until the file has been closed - ->f_path.mnt is holding a reference
> > to vfsmount, which is holding a reference to superblock, which is holding
> > a reference to file_system_type, which is holding a reference to _its_
> > ->owner.
> > * we do not need that on anything on procfs - module removal is
> > legal while a procfs file is opened; its cleanup will be blocked for the
> > duration of ->read(), ->write(), etc. calls.
> 
> I see why this is true, and it's something I considered. But when there is
> zero cost to being explicit and setting ->owner, why not do it?
> 
> > If anything, we would be better off with modifications that would get
> > rid of ->owner on file_operations.  It's not trivial to do, but it might
> > be not impossible.

I'll look into this, I'm interested.

Thanks,
Calvin

> 


Re: [PATCH] fs: Assert on module file_operations without an owner

2016-10-07 Thread Calvin Owens
On Friday 10/07 at 17:18 -0400, Calvin Owens wrote:
> On Friday 10/07 at 21:48 +0100, Al Viro wrote:
> > On Fri, Oct 07, 2016 at 01:35:52PM -0700, Calvin Owens wrote:
> > > Omitting the owner field in file_operations declared in modules is an
> > > easy mistake to make, and can result in crashes when the module is
> > > unloaded while userspace is poking the file.
> > > 
> > > This patch modifies fops_get() to WARN when it encounters a NULL owner,
> > > since in this case it cannot take a reference on the containing module.
> > 
> > NAK.  This is complete crap - we do *NOT* need ->owner on a lot of
> > file_operations.
> 
> This isn't a theoretical issue: I have a proprietary module that makes this
> mistake and crashes when poking a chrdev it exposes in userspace races with
> unloading the module.
> 
> Of course, the bug is in this silly module. I'm not arguing that it isn't. I
> was hesitant to even mention this because I know waving at something in an OOT
> module is a poor argument for changing anything in the proper kernel.
> 
> But what I'm trying to do here is prevent people from making that mistake in
> the future by yelling at them when they do. The implicit ignoring of a NULL
> owner in try_module_get() in fops_get() is not necessarily obvious.

Let's drop this, I should never have sent the patch in the first place.

> > * we do not need that on file_operations of a regular file or
> > directory on a normal filesystem, since that filesystem is not going
> > away until the file has been closed - ->f_path.mnt is holding a reference
> > to vfsmount, which is holding a reference to superblock, which is holding
> > a reference to file_system_type, which is holding a reference to _its_
> > ->owner.
> > * we do not need that on anything on procfs - module removal is
> > legal while a procfs file is opened; its cleanup will be blocked for the
> > duration of ->read(), ->write(), etc. calls.
> 
> I see why this is true, and it's something I considered. But when there is
> zero cost to being explicit and setting ->owner, why not do it?
> 
> > If anything, we would be better off with modifications that would get
> > rid of ->owner on file_operations.  It's not trivial to do, but it might
> > be not impossible.

I'll look into this, I'm interested.

Thanks,
Calvin

> 


Re: [PATCH] fs: Assert on module file_operations without an owner

2016-10-07 Thread Calvin Owens
On Friday 10/07 at 21:48 +0100, Al Viro wrote:
> On Fri, Oct 07, 2016 at 01:35:52PM -0700, Calvin Owens wrote:
> > Omitting the owner field in file_operations declared in modules is an
> > easy mistake to make, and can result in crashes when the module is
> > unloaded while userspace is poking the file.
> > 
> > This patch modifies fops_get() to WARN when it encounters a NULL owner,
> > since in this case it cannot take a reference on the containing module.
> 
> NAK.  This is complete crap - we do *NOT* need ->owner on a lot of
> file_operations.

This isn't a theoretical issue: I have a proprietary module that makes this
mistake and crashes when poking a chrdev it exposes in userspace races with
unloading the module.

Of course, the bug is in this silly module. I'm not arguing that it isn't. I
was hesitant to even mention this because I know waving at something in an OOT
module is a poor argument for changing anything in the proper kernel.

But what I'm trying to do here is prevent people from making that mistake in
the future by yelling at them when they do. The implicit ignoring of a NULL
owner in try_module_get() in fops_get() is not necessarily obvious.

>   * we do not need that on file_operations of a regular file or
> directory on a normal filesystem, since that filesystem is not going
> away until the file has been closed - ->f_path.mnt is holding a reference
> to vfsmount, which is holding a reference to superblock, which is holding
> a reference to file_system_type, which is holding a reference to _its_
> ->owner.
>   * we do not need that on anything on procfs - module removal is
> legal while a procfs file is opened; its cleanup will be blocked for the
> duration of ->read(), ->write(), etc. calls.

I see why this is true, and it's something I considered. But when there is
zero cost to being explicit and setting ->owner, why not do it?

> If anything, we would be better off with modifications that would get
> rid of ->owner on file_operations.  It's not trivial to do, but it might
> be not impossible.




Re: [PATCH] fs: Assert on module file_operations without an owner

2016-10-07 Thread Calvin Owens
On Friday 10/07 at 21:48 +0100, Al Viro wrote:
> On Fri, Oct 07, 2016 at 01:35:52PM -0700, Calvin Owens wrote:
> > Omitting the owner field in file_operations declared in modules is an
> > easy mistake to make, and can result in crashes when the module is
> > unloaded while userspace is poking the file.
> > 
> > This patch modifies fops_get() to WARN when it encounters a NULL owner,
> > since in this case it cannot take a reference on the containing module.
> 
> NAK.  This is complete crap - we do *NOT* need ->owner on a lot of
> file_operations.

This isn't a theoretical issue: I have a proprietary module that makes this
mistake and crashes when poking a chrdev it exposes in userspace races with
unloading the module.

Of course, the bug is in this silly module. I'm not arguing that it isn't. I
was hesitant to even mention this because I know waving at something in an OOT
module is a poor argument for changing anything in the proper kernel.

But what I'm trying to do here is prevent people from making that mistake in
the future by yelling at them when they do. The implicit ignoring of a NULL
owner in try_module_get() in fops_get() is not necessarily obvious.

>   * we do not need that on file_operations of a regular file or
> directory on a normal filesystem, since that filesystem is not going
> away until the file has been closed - ->f_path.mnt is holding a reference
> to vfsmount, which is holding a reference to superblock, which is holding
> a reference to file_system_type, which is holding a reference to _its_
> ->owner.
>   * we do not need that on anything on procfs - module removal is
> legal while a procfs file is opened; its cleanup will be blocked for the
> duration of ->read(), ->write(), etc. calls.

I see why this is true, and it's something I considered. But when there is
zero cost to being explicit and setting ->owner, why not do it?

> If anything, we would be better off with modifications that would get
> rid of ->owner on file_operations.  It's not trivial to do, but it might
> be not impossible.




[PATCH net-next] nfnetlink_log: Use GFP_NOWARN for skb allocation

2016-10-07 Thread Calvin Owens
Since the code explicilty falls back to a smaller allocation when the
large one fails, we shouldn't complain when that happens.

Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
 net/netfilter/nfnetlink_log.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c
index eb086a1..7435505 100644
--- a/net/netfilter/nfnetlink_log.c
+++ b/net/netfilter/nfnetlink_log.c
@@ -330,7 +330,7 @@ nfulnl_alloc_skb(struct net *net, u32 peer_portid, unsigned 
int inst_size,
 * message.  WARNING: has to be <= 128k due to slab restrictions */
 
n = max(inst_size, pkt_size);
-   skb = alloc_skb(n, GFP_ATOMIC);
+   skb = alloc_skb(n, GFP_ATOMIC | __GFP_NOWARN);
if (!skb) {
if (n > pkt_size) {
/* try to allocate only as much as we need for current
-- 
2.9.3



[PATCH net-next] nfnetlink_log: Use GFP_NOWARN for skb allocation

2016-10-07 Thread Calvin Owens
Since the code explicilty falls back to a smaller allocation when the
large one fails, we shouldn't complain when that happens.

Signed-off-by: Calvin Owens 
---
 net/netfilter/nfnetlink_log.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c
index eb086a1..7435505 100644
--- a/net/netfilter/nfnetlink_log.c
+++ b/net/netfilter/nfnetlink_log.c
@@ -330,7 +330,7 @@ nfulnl_alloc_skb(struct net *net, u32 peer_portid, unsigned 
int inst_size,
 * message.  WARNING: has to be <= 128k due to slab restrictions */
 
n = max(inst_size, pkt_size);
-   skb = alloc_skb(n, GFP_ATOMIC);
+   skb = alloc_skb(n, GFP_ATOMIC | __GFP_NOWARN);
if (!skb) {
if (n > pkt_size) {
/* try to allocate only as much as we need for current
-- 
2.9.3



[PATCH] fs: Assert on module file_operations without an owner

2016-10-07 Thread Calvin Owens
Omitting the owner field in file_operations declared in modules is an
easy mistake to make, and can result in crashes when the module is
unloaded while userspace is poking the file.

This patch modifies fops_get() to WARN when it encounters a NULL owner,
since in this case it cannot take a reference on the containing module.

Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
 include/linux/fs.h | 13 -
 kernel/module.c|  1 +
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 901e25d..fafda9e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2081,10 +2081,21 @@ extern struct dentry *mount_pseudo(struct 
file_system_type *, char *,
unsigned long);
 
 /* Alas, no aliases. Too much hassle with bringing module.h everywhere */
-#define fops_get(fops) \
+#define __fops_get(fops) \
(((fops) && try_module_get((fops)->owner) ? (fops) : NULL))
 #define fops_put(fops) \
do { if (fops) module_put((fops)->owner); } while(0)
+
+#define unowned_fmt "No fops owner at %p in [%s]\n"
+#define fops_unowned(fops) \
+   (is_module_address((unsigned long)(fops)) && !(fops)->owner)
+#define fops_modname(fops) \
+   __module_address((unsigned long)(fops))->name
+#define fops_warn_unowned(fops) \
+   WARN(fops_unowned(fops), unowned_fmt, (fops), fops_modname(fops))
+#define fops_get(fops) \
+   ({ fops_warn_unowned(fops); __fops_get(fops); })
+
 /*
  * This one is to be used *ONLY* from ->open() instances.
  * fops must be non-NULL, pinned down *and* module dependencies
diff --git a/kernel/module.c b/kernel/module.c
index 529efae..4443727 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -4181,6 +4181,7 @@ bool is_module_address(unsigned long addr)
 
return ret;
 }
+EXPORT_SYMBOL_GPL(is_module_address);
 
 /*
  * __module_address - get the module which contains an address.
-- 
2.9.3



[PATCH] fs: Assert on module file_operations without an owner

2016-10-07 Thread Calvin Owens
Omitting the owner field in file_operations declared in modules is an
easy mistake to make, and can result in crashes when the module is
unloaded while userspace is poking the file.

This patch modifies fops_get() to WARN when it encounters a NULL owner,
since in this case it cannot take a reference on the containing module.

Signed-off-by: Calvin Owens 
---
 include/linux/fs.h | 13 -
 kernel/module.c|  1 +
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 901e25d..fafda9e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2081,10 +2081,21 @@ extern struct dentry *mount_pseudo(struct 
file_system_type *, char *,
unsigned long);
 
 /* Alas, no aliases. Too much hassle with bringing module.h everywhere */
-#define fops_get(fops) \
+#define __fops_get(fops) \
(((fops) && try_module_get((fops)->owner) ? (fops) : NULL))
 #define fops_put(fops) \
do { if (fops) module_put((fops)->owner); } while(0)
+
+#define unowned_fmt "No fops owner at %p in [%s]\n"
+#define fops_unowned(fops) \
+   (is_module_address((unsigned long)(fops)) && !(fops)->owner)
+#define fops_modname(fops) \
+   __module_address((unsigned long)(fops))->name
+#define fops_warn_unowned(fops) \
+   WARN(fops_unowned(fops), unowned_fmt, (fops), fops_modname(fops))
+#define fops_get(fops) \
+   ({ fops_warn_unowned(fops); __fops_get(fops); })
+
 /*
  * This one is to be used *ONLY* from ->open() instances.
  * fops must be non-NULL, pinned down *and* module dependencies
diff --git a/kernel/module.c b/kernel/module.c
index 529efae..4443727 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -4181,6 +4181,7 @@ bool is_module_address(unsigned long addr)
 
return ret;
 }
+EXPORT_SYMBOL_GPL(is_module_address);
 
 /*
  * __module_address - get the module which contains an address.
-- 
2.9.3



[PATCH v2 net-next] mlx5: Add ndo_poll_controller() implementation

2016-09-28 Thread Calvin Owens
This implements ndo_poll_controller in net_device_ops callbacks for mlx5,
which is necessary to use netconsole with this driver.

Acked-By: Saeed Mahameed <sae...@mellanox.com>
Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
Changes in v2:
* Only iterate channels to avoid redundant napi_schedule() calls

drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index b58cfe3..7eaf380 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3188,6 +3188,20 @@ static int mlx5e_xdp(struct net_device *dev, struct 
netdev_xdp *xdp)
}
 }
 
+#ifdef CONFIG_NET_POLL_CONTROLLER
+/* Fake "interrupt" called by netpoll (eg netconsole) to send skbs without
+ * reenabling interrupts.
+ */
+static void mlx5e_netpoll(struct net_device *dev)
+{
+   struct mlx5e_priv *priv = netdev_priv(dev);
+   int i;
+
+   for (i = 0; i < priv->params.num_channels; i++)
+   napi_schedule(>channel[i]->napi);
+}
+#endif
+
 static const struct net_device_ops mlx5e_netdev_ops_basic = {
.ndo_open= mlx5e_open,
.ndo_stop= mlx5e_close,
@@ -3208,6 +3222,9 @@ static const struct net_device_ops mlx5e_netdev_ops_basic 
= {
 #endif
.ndo_tx_timeout  = mlx5e_tx_timeout,
.ndo_xdp = mlx5e_xdp,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+   .ndo_poll_controller = mlx5e_netpoll,
+#endif
 };
 
 static const struct net_device_ops mlx5e_netdev_ops_sriov = {
@@ -3240,6 +3257,9 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov 
= {
.ndo_get_vf_stats= mlx5e_get_vf_stats,
.ndo_tx_timeout  = mlx5e_tx_timeout,
.ndo_xdp = mlx5e_xdp,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+   .ndo_poll_controller = mlx5e_netpoll,
+#endif
 };
 
 static int mlx5e_check_required_hca_cap(struct mlx5_core_dev *mdev)
-- 
2.9.3



[PATCH v2 net-next] mlx5: Add ndo_poll_controller() implementation

2016-09-28 Thread Calvin Owens
This implements ndo_poll_controller in net_device_ops callbacks for mlx5,
which is necessary to use netconsole with this driver.

Acked-By: Saeed Mahameed 
Signed-off-by: Calvin Owens 
---
Changes in v2:
* Only iterate channels to avoid redundant napi_schedule() calls

drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index b58cfe3..7eaf380 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3188,6 +3188,20 @@ static int mlx5e_xdp(struct net_device *dev, struct 
netdev_xdp *xdp)
}
 }
 
+#ifdef CONFIG_NET_POLL_CONTROLLER
+/* Fake "interrupt" called by netpoll (eg netconsole) to send skbs without
+ * reenabling interrupts.
+ */
+static void mlx5e_netpoll(struct net_device *dev)
+{
+   struct mlx5e_priv *priv = netdev_priv(dev);
+   int i;
+
+   for (i = 0; i < priv->params.num_channels; i++)
+   napi_schedule(>channel[i]->napi);
+}
+#endif
+
 static const struct net_device_ops mlx5e_netdev_ops_basic = {
.ndo_open= mlx5e_open,
.ndo_stop= mlx5e_close,
@@ -3208,6 +3222,9 @@ static const struct net_device_ops mlx5e_netdev_ops_basic 
= {
 #endif
.ndo_tx_timeout  = mlx5e_tx_timeout,
.ndo_xdp = mlx5e_xdp,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+   .ndo_poll_controller = mlx5e_netpoll,
+#endif
 };
 
 static const struct net_device_ops mlx5e_netdev_ops_sriov = {
@@ -3240,6 +3257,9 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov 
= {
.ndo_get_vf_stats= mlx5e_get_vf_stats,
.ndo_tx_timeout  = mlx5e_tx_timeout,
.ndo_xdp = mlx5e_xdp,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+   .ndo_poll_controller = mlx5e_netpoll,
+#endif
 };
 
 static int mlx5e_check_required_hca_cap(struct mlx5_core_dev *mdev)
-- 
2.9.3



[PATCH v2] mlx5: Add ndo_poll_controller() implementation

2016-09-27 Thread Calvin Owens
This implements ndo_poll_controller in net_device_ops callback for mlx5,
which is necessary to use netconsole with this driver.

Cc: Saeed Mahameed <sae...@dev.mellanox.co.il>
Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
Changes in v2:
* Only iterate channels to avoid redundant napi_schedule() calls

drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 2459c7f..830b8d0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2786,6 +2786,20 @@ static void mlx5e_tx_timeout(struct net_device *dev)
schedule_work(>tx_timeout_work);
 }
 
+#ifdef CONFIG_NET_POLL_CONTROLLER
+/* Fake "interrupt" called by netpoll (eg netconsole) to send skbs without
+ * reenabling interrupts.
+ */
+static void mlx5e_netpoll(struct net_device *dev)
+{
+   struct mlx5e_priv *priv = netdev_priv(dev);
+   int i;
+
+   for (i = 0; i < priv->params.num_channels; i++)
+   napi_schedule(>channel[i]->napi);
+}
+#endif
+
 static const struct net_device_ops mlx5e_netdev_ops_basic = {
.ndo_open= mlx5e_open,
.ndo_stop= mlx5e_close,
@@ -2805,6 +2819,9 @@ static const struct net_device_ops mlx5e_netdev_ops_basic 
= {
.ndo_rx_flow_steer   = mlx5e_rx_flow_steer,
 #endif
.ndo_tx_timeout  = mlx5e_tx_timeout,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+   .ndo_poll_controller = mlx5e_netpoll,
+#endif
 };
 
 static const struct net_device_ops mlx5e_netdev_ops_sriov = {
@@ -2836,6 +2853,9 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov 
= {
.ndo_set_vf_link_state   = mlx5e_set_vf_link_state,
.ndo_get_vf_stats= mlx5e_get_vf_stats,
.ndo_tx_timeout  = mlx5e_tx_timeout,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+   .ndo_poll_controller = mlx5e_netpoll,
+#endif
 };
 
 static int mlx5e_check_required_hca_cap(struct mlx5_core_dev *mdev)
-- 
2.9.3



[PATCH v2] mlx5: Add ndo_poll_controller() implementation

2016-09-27 Thread Calvin Owens
This implements ndo_poll_controller in net_device_ops callback for mlx5,
which is necessary to use netconsole with this driver.

Cc: Saeed Mahameed 
Signed-off-by: Calvin Owens 
---
Changes in v2:
* Only iterate channels to avoid redundant napi_schedule() calls

drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 2459c7f..830b8d0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2786,6 +2786,20 @@ static void mlx5e_tx_timeout(struct net_device *dev)
schedule_work(>tx_timeout_work);
 }
 
+#ifdef CONFIG_NET_POLL_CONTROLLER
+/* Fake "interrupt" called by netpoll (eg netconsole) to send skbs without
+ * reenabling interrupts.
+ */
+static void mlx5e_netpoll(struct net_device *dev)
+{
+   struct mlx5e_priv *priv = netdev_priv(dev);
+   int i;
+
+   for (i = 0; i < priv->params.num_channels; i++)
+   napi_schedule(>channel[i]->napi);
+}
+#endif
+
 static const struct net_device_ops mlx5e_netdev_ops_basic = {
.ndo_open= mlx5e_open,
.ndo_stop= mlx5e_close,
@@ -2805,6 +2819,9 @@ static const struct net_device_ops mlx5e_netdev_ops_basic 
= {
.ndo_rx_flow_steer   = mlx5e_rx_flow_steer,
 #endif
.ndo_tx_timeout  = mlx5e_tx_timeout,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+   .ndo_poll_controller = mlx5e_netpoll,
+#endif
 };
 
 static const struct net_device_ops mlx5e_netdev_ops_sriov = {
@@ -2836,6 +2853,9 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov 
= {
.ndo_set_vf_link_state   = mlx5e_set_vf_link_state,
.ndo_get_vf_stats= mlx5e_get_vf_stats,
.ndo_tx_timeout  = mlx5e_tx_timeout,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+   .ndo_poll_controller = mlx5e_netpoll,
+#endif
 };
 
 static int mlx5e_check_required_hca_cap(struct mlx5_core_dev *mdev)
-- 
2.9.3



[PATCH] mlx5: Add ndo_poll_controller() implementation

2016-09-23 Thread Calvin Owens
This implements ndo_poll_controller in net_device_ops for mlx5, which is
necessary to use netconsole with this driver.

Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 2459c7f..439476f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2786,6 +2786,20 @@ static void mlx5e_tx_timeout(struct net_device *dev)
schedule_work(>tx_timeout_work);
 }
 
+#ifdef CONFIG_NET_POLL_CONTROLLER
+/* Fake "interrupt" called by netpoll (eg netconsole) to send skbs without
+ * reenabling interrupts.
+ */
+static void mlx5e_netpoll(struct net_device *dev)
+{
+   struct mlx5e_priv *priv = netdev_priv(dev);
+   int i, nr_sq = priv->params.num_channels * priv->params.num_tc;
+
+   for (i = 0; i < nr_sq; i++)
+   napi_schedule(priv->txq_to_sq_map[i]->cq.napi);
+}
+#endif
+
 static const struct net_device_ops mlx5e_netdev_ops_basic = {
.ndo_open= mlx5e_open,
.ndo_stop= mlx5e_close,
@@ -2805,6 +2819,9 @@ static const struct net_device_ops mlx5e_netdev_ops_basic 
= {
.ndo_rx_flow_steer   = mlx5e_rx_flow_steer,
 #endif
.ndo_tx_timeout  = mlx5e_tx_timeout,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+   .ndo_poll_controller = mlx5e_netpoll,
+#endif
 };
 
 static const struct net_device_ops mlx5e_netdev_ops_sriov = {
@@ -2836,6 +2853,9 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov 
= {
.ndo_set_vf_link_state   = mlx5e_set_vf_link_state,
.ndo_get_vf_stats= mlx5e_get_vf_stats,
.ndo_tx_timeout  = mlx5e_tx_timeout,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+   .ndo_poll_controller = mlx5e_netpoll,
+#endif
 };
 
 static int mlx5e_check_required_hca_cap(struct mlx5_core_dev *mdev)
-- 
2.9.3



[PATCH] mlx5: Add ndo_poll_controller() implementation

2016-09-23 Thread Calvin Owens
This implements ndo_poll_controller in net_device_ops for mlx5, which is
necessary to use netconsole with this driver.

Signed-off-by: Calvin Owens 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 2459c7f..439476f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2786,6 +2786,20 @@ static void mlx5e_tx_timeout(struct net_device *dev)
schedule_work(>tx_timeout_work);
 }
 
+#ifdef CONFIG_NET_POLL_CONTROLLER
+/* Fake "interrupt" called by netpoll (eg netconsole) to send skbs without
+ * reenabling interrupts.
+ */
+static void mlx5e_netpoll(struct net_device *dev)
+{
+   struct mlx5e_priv *priv = netdev_priv(dev);
+   int i, nr_sq = priv->params.num_channels * priv->params.num_tc;
+
+   for (i = 0; i < nr_sq; i++)
+   napi_schedule(priv->txq_to_sq_map[i]->cq.napi);
+}
+#endif
+
 static const struct net_device_ops mlx5e_netdev_ops_basic = {
.ndo_open= mlx5e_open,
.ndo_stop= mlx5e_close,
@@ -2805,6 +2819,9 @@ static const struct net_device_ops mlx5e_netdev_ops_basic 
= {
.ndo_rx_flow_steer   = mlx5e_rx_flow_steer,
 #endif
.ndo_tx_timeout  = mlx5e_tx_timeout,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+   .ndo_poll_controller = mlx5e_netpoll,
+#endif
 };
 
 static const struct net_device_ops mlx5e_netdev_ops_sriov = {
@@ -2836,6 +2853,9 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov 
= {
.ndo_set_vf_link_state   = mlx5e_set_vf_link_state,
.ndo_get_vf_stats= mlx5e_get_vf_stats,
.ndo_tx_timeout  = mlx5e_tx_timeout,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+   .ndo_poll_controller = mlx5e_netpoll,
+#endif
 };
 
 static int mlx5e_check_required_hca_cap(struct mlx5_core_dev *mdev)
-- 
2.9.3



[PATCH 2/3] mpt3sas: Eliminate dead sleep_flag code

2016-07-28 Thread Calvin Owens
With the exception of a single call to wait_for_doorbell_int(), all
this conditional sleeping code is dead. So delete it.

Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
 drivers/scsi/mpt3sas/mpt3sas_base.c  | 241 +--
 drivers/scsi/mpt3sas/mpt3sas_base.h  |   6 +-
 drivers/scsi/mpt3sas/mpt3sas_config.c|   3 +-
 drivers/scsi/mpt3sas/mpt3sas_ctl.c   |  15 +-
 drivers/scsi/mpt3sas/mpt3sas_scsih.c |  21 +--
 drivers/scsi/mpt3sas/mpt3sas_transport.c |  12 +-
 6 files changed, 120 insertions(+), 178 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c 
b/drivers/scsi/mpt3sas/mpt3sas_base.c
index 751f13e..0956183 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
@@ -98,7 +98,7 @@ MODULE_PARM_DESC(mpt3sas_fwfault_debug,
" enable detection of firmware fault and halt firmware - (default=0)");
 
 static int
-_base_get_ioc_facts(struct MPT3SAS_ADAPTER *ioc, int sleep_flag);
+_base_get_ioc_facts(struct MPT3SAS_ADAPTER *ioc);
 
 /**
  * _scsih_set_fwfault_debug - global setting of ioc->fwfault_debug.
@@ -218,8 +218,7 @@ _base_fault_reset_work(struct work_struct *work)
ioc->non_operational_loop = 0;
 
if ((doorbell & MPI2_IOC_STATE_MASK) != MPI2_IOC_STATE_OPERATIONAL) {
-   rc = mpt3sas_base_hard_reset_handler(ioc, CAN_SLEEP,
-   FORCE_BIG_HAMMER);
+   rc = mpt3sas_base_hard_reset_handler(ioc, FORCE_BIG_HAMMER);
pr_warn(MPT3SAS_FMT "%s: hard reset: %s\n", ioc->name,
__func__, (rc == 0) ? "success" : "failed");
doorbell = mpt3sas_base_get_iocstate(ioc, 0);
@@ -2145,7 +2144,7 @@ mpt3sas_base_map_resources(struct MPT3SAS_ADAPTER *ioc)
 
_base_mask_interrupts(ioc);
 
-   r = _base_get_ioc_facts(ioc, CAN_SLEEP);
+   r = _base_get_ioc_facts(ioc);
if (r)
goto out_fail;
 
@@ -3172,12 +3171,11 @@ _base_release_memory_pools(struct MPT3SAS_ADAPTER *ioc)
 /**
  * _base_allocate_memory_pools - allocate start of day memory pools
  * @ioc: per adapter object
- * @sleep_flag: CAN_SLEEP or NO_SLEEP
  *
  * Returns 0 success, anything else error
  */
 static int
-_base_allocate_memory_pools(struct MPT3SAS_ADAPTER *ioc,  int sleep_flag)
+_base_allocate_memory_pools(struct MPT3SAS_ADAPTER *ioc)
 {
struct mpt3sas_facts *facts;
u16 max_sge_elements;
@@ -3647,29 +3645,25 @@ mpt3sas_base_get_iocstate(struct MPT3SAS_ADAPTER *ioc, 
int cooked)
  * _base_wait_on_iocstate - waiting on a particular ioc state
  * @ioc_state: controller state { READY, OPERATIONAL, or RESET }
  * @timeout: timeout in second
- * @sleep_flag: CAN_SLEEP or NO_SLEEP
  *
  * Returns 0 for success, non-zero for failure.
  */
 static int
-_base_wait_on_iocstate(struct MPT3SAS_ADAPTER *ioc, u32 ioc_state, int timeout,
-   int sleep_flag)
+_base_wait_on_iocstate(struct MPT3SAS_ADAPTER *ioc, u32 ioc_state, int timeout)
 {
u32 count, cntdn;
u32 current_state;
 
count = 0;
-   cntdn = (sleep_flag == CAN_SLEEP) ? 1000*timeout : 2000*timeout;
+   cntdn = 1000 * timeout;
do {
current_state = mpt3sas_base_get_iocstate(ioc, 1);
if (current_state == ioc_state)
return 0;
if (count && current_state == MPI2_IOC_STATE_FAULT)
break;
-   if (sleep_flag == CAN_SLEEP)
-   usleep_range(1000, 1500);
-   else
-   udelay(500);
+
+   usleep_range(1000, 1500);
count++;
} while (--cntdn);
 
@@ -3681,24 +3675,22 @@ _base_wait_on_iocstate(struct MPT3SAS_ADAPTER *ioc, u32 
ioc_state, int timeout,
  * a write to the doorbell)
  * @ioc: per adapter object
  * @timeout: timeout in second
- * @sleep_flag: CAN_SLEEP or NO_SLEEP
  *
  * Returns 0 for success, non-zero for failure.
  *
  * Notes: MPI2_HIS_IOC2SYS_DB_STATUS - set to one when IOC writes to doorbell.
  */
 static int
-_base_diag_reset(struct MPT3SAS_ADAPTER *ioc, int sleep_flag);
+_base_diag_reset(struct MPT3SAS_ADAPTER *ioc);
 
 static int
-_base_wait_for_doorbell_int(struct MPT3SAS_ADAPTER *ioc, int timeout,
-   int sleep_flag)
+_base_wait_for_doorbell_int(struct MPT3SAS_ADAPTER *ioc, int timeout)
 {
u32 cntdn, count;
u32 int_status;
 
count = 0;
-   cntdn = (sleep_flag == CAN_SLEEP) ? 1000*timeout : 2000*timeout;
+   cntdn = 1000 * timeout;
do {
int_status = readl(>chip->HostInterruptStatus);
if (int_status & MPI2_HIS_IOC2SYS_DB_STATUS) {
@@ -3707,10 +3699,35 @@ _base_wait_for_doorbell_int(struct MPT3SAS_ADAPTER 
*ioc, int timeout,
ioc->name, __func__, count, timeout));
return 0;
}
-   

[PATCH 2/3] mpt3sas: Eliminate dead sleep_flag code

2016-07-28 Thread Calvin Owens
With the exception of a single call to wait_for_doorbell_int(), all
this conditional sleeping code is dead. So delete it.

Signed-off-by: Calvin Owens 
---
 drivers/scsi/mpt3sas/mpt3sas_base.c  | 241 +--
 drivers/scsi/mpt3sas/mpt3sas_base.h  |   6 +-
 drivers/scsi/mpt3sas/mpt3sas_config.c|   3 +-
 drivers/scsi/mpt3sas/mpt3sas_ctl.c   |  15 +-
 drivers/scsi/mpt3sas/mpt3sas_scsih.c |  21 +--
 drivers/scsi/mpt3sas/mpt3sas_transport.c |  12 +-
 6 files changed, 120 insertions(+), 178 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c 
b/drivers/scsi/mpt3sas/mpt3sas_base.c
index 751f13e..0956183 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
@@ -98,7 +98,7 @@ MODULE_PARM_DESC(mpt3sas_fwfault_debug,
" enable detection of firmware fault and halt firmware - (default=0)");
 
 static int
-_base_get_ioc_facts(struct MPT3SAS_ADAPTER *ioc, int sleep_flag);
+_base_get_ioc_facts(struct MPT3SAS_ADAPTER *ioc);
 
 /**
  * _scsih_set_fwfault_debug - global setting of ioc->fwfault_debug.
@@ -218,8 +218,7 @@ _base_fault_reset_work(struct work_struct *work)
ioc->non_operational_loop = 0;
 
if ((doorbell & MPI2_IOC_STATE_MASK) != MPI2_IOC_STATE_OPERATIONAL) {
-   rc = mpt3sas_base_hard_reset_handler(ioc, CAN_SLEEP,
-   FORCE_BIG_HAMMER);
+   rc = mpt3sas_base_hard_reset_handler(ioc, FORCE_BIG_HAMMER);
pr_warn(MPT3SAS_FMT "%s: hard reset: %s\n", ioc->name,
__func__, (rc == 0) ? "success" : "failed");
doorbell = mpt3sas_base_get_iocstate(ioc, 0);
@@ -2145,7 +2144,7 @@ mpt3sas_base_map_resources(struct MPT3SAS_ADAPTER *ioc)
 
_base_mask_interrupts(ioc);
 
-   r = _base_get_ioc_facts(ioc, CAN_SLEEP);
+   r = _base_get_ioc_facts(ioc);
if (r)
goto out_fail;
 
@@ -3172,12 +3171,11 @@ _base_release_memory_pools(struct MPT3SAS_ADAPTER *ioc)
 /**
  * _base_allocate_memory_pools - allocate start of day memory pools
  * @ioc: per adapter object
- * @sleep_flag: CAN_SLEEP or NO_SLEEP
  *
  * Returns 0 success, anything else error
  */
 static int
-_base_allocate_memory_pools(struct MPT3SAS_ADAPTER *ioc,  int sleep_flag)
+_base_allocate_memory_pools(struct MPT3SAS_ADAPTER *ioc)
 {
struct mpt3sas_facts *facts;
u16 max_sge_elements;
@@ -3647,29 +3645,25 @@ mpt3sas_base_get_iocstate(struct MPT3SAS_ADAPTER *ioc, 
int cooked)
  * _base_wait_on_iocstate - waiting on a particular ioc state
  * @ioc_state: controller state { READY, OPERATIONAL, or RESET }
  * @timeout: timeout in second
- * @sleep_flag: CAN_SLEEP or NO_SLEEP
  *
  * Returns 0 for success, non-zero for failure.
  */
 static int
-_base_wait_on_iocstate(struct MPT3SAS_ADAPTER *ioc, u32 ioc_state, int timeout,
-   int sleep_flag)
+_base_wait_on_iocstate(struct MPT3SAS_ADAPTER *ioc, u32 ioc_state, int timeout)
 {
u32 count, cntdn;
u32 current_state;
 
count = 0;
-   cntdn = (sleep_flag == CAN_SLEEP) ? 1000*timeout : 2000*timeout;
+   cntdn = 1000 * timeout;
do {
current_state = mpt3sas_base_get_iocstate(ioc, 1);
if (current_state == ioc_state)
return 0;
if (count && current_state == MPI2_IOC_STATE_FAULT)
break;
-   if (sleep_flag == CAN_SLEEP)
-   usleep_range(1000, 1500);
-   else
-   udelay(500);
+
+   usleep_range(1000, 1500);
count++;
} while (--cntdn);
 
@@ -3681,24 +3675,22 @@ _base_wait_on_iocstate(struct MPT3SAS_ADAPTER *ioc, u32 
ioc_state, int timeout,
  * a write to the doorbell)
  * @ioc: per adapter object
  * @timeout: timeout in second
- * @sleep_flag: CAN_SLEEP or NO_SLEEP
  *
  * Returns 0 for success, non-zero for failure.
  *
  * Notes: MPI2_HIS_IOC2SYS_DB_STATUS - set to one when IOC writes to doorbell.
  */
 static int
-_base_diag_reset(struct MPT3SAS_ADAPTER *ioc, int sleep_flag);
+_base_diag_reset(struct MPT3SAS_ADAPTER *ioc);
 
 static int
-_base_wait_for_doorbell_int(struct MPT3SAS_ADAPTER *ioc, int timeout,
-   int sleep_flag)
+_base_wait_for_doorbell_int(struct MPT3SAS_ADAPTER *ioc, int timeout)
 {
u32 cntdn, count;
u32 int_status;
 
count = 0;
-   cntdn = (sleep_flag == CAN_SLEEP) ? 1000*timeout : 2000*timeout;
+   cntdn = 1000 * timeout;
do {
int_status = readl(>chip->HostInterruptStatus);
if (int_status & MPI2_HIS_IOC2SYS_DB_STATUS) {
@@ -3707,10 +3699,35 @@ _base_wait_for_doorbell_int(struct MPT3SAS_ADAPTER 
*ioc, int timeout,
ioc->name, __func__, count, timeout));
return 0;
}
-   if (sleep_flag == CAN_SLE

[PATCH 1/3] mpt3sas: Eliminate conditional locking in mpt3sas_scsih_issue_tm()

2016-07-28 Thread Calvin Owens
This flag that conditionally acquires the mutex is confusing and prone
to bugginess: refactor it into two separate function calls, and make
the unlocked one complain if it's called outside the mutex.

Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
 drivers/scsi/mpt3sas/mpt3sas_base.h  | 16 +++--
 drivers/scsi/mpt3sas/mpt3sas_ctl.c   |  5 ++-
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 66 +---
 3 files changed, 38 insertions(+), 49 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.h 
b/drivers/scsi/mpt3sas/mpt3sas_base.h
index eb7f5b0..f0baafd 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.h
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.h
@@ -794,16 +794,6 @@ struct reply_post_struct {
dma_addr_t  reply_post_free_dma;
 };
 
-/**
- * enum mutex_type - task management mutex type
- * @TM_MUTEX_OFF: mutex is not required becuase calling function is acquiring 
it
- * @TM_MUTEX_ON: mutex is required
- */
-enum mutex_type {
-   TM_MUTEX_OFF = 0,
-   TM_MUTEX_ON = 1,
-};
-
 typedef void (*MPT3SAS_FLUSH_RUNNING_CMDS)(struct MPT3SAS_ADAPTER *ioc);
 /**
  * struct MPT3SAS_ADAPTER - per adapter struct
@@ -1291,7 +1281,11 @@ void mpt3sas_scsih_reset_handler(struct MPT3SAS_ADAPTER 
*ioc, int reset_phase);
 
 int mpt3sas_scsih_issue_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle,
uint channel, uint id, uint lun, u8 type, u16 smid_task,
-   ulong timeout, enum mutex_type m_type);
+   ulong timeout);
+int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle,
+   uint channel, uint id, uint lun, u8 type, u16 smid_task,
+   ulong timeout);
+
 void mpt3sas_scsih_set_tm_flag(struct MPT3SAS_ADAPTER *ioc, u16 handle);
 void mpt3sas_scsih_clear_tm_flag(struct MPT3SAS_ADAPTER *ioc, u16 handle);
 void mpt3sas_expander_remove(struct MPT3SAS_ADAPTER *ioc, u64 sas_address);
diff --git a/drivers/scsi/mpt3sas/mpt3sas_ctl.c 
b/drivers/scsi/mpt3sas/mpt3sas_ctl.c
index 7d00f09..75ae533 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_ctl.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_ctl.c
@@ -1001,10 +1001,9 @@ _ctl_do_mpt_command(struct MPT3SAS_ADAPTER *ioc, struct 
mpt3_ioctl_command karg,
ioc->name,
le16_to_cpu(mpi_request->FunctionDependent1));
mpt3sas_halt_firmware(ioc);
-   mpt3sas_scsih_issue_tm(ioc,
+   mpt3sas_scsih_issue_locked_tm(ioc,
le16_to_cpu(mpi_request->FunctionDependent1), 0, 0,
-   0, MPI2_SCSITASKMGMT_TASKTYPE_TARGET_RESET, 0, 30,
-   TM_MUTEX_ON);
+   0, MPI2_SCSITASKMGMT_TASKTYPE_TARGET_RESET, 0, 30);
} else
mpt3sas_base_hard_reset_handler(ioc, CAN_SLEEP,
FORCE_BIG_HAMMER);
diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index acabe48..c93a7ba 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -2201,7 +2201,6 @@ mpt3sas_scsih_clear_tm_flag(struct MPT3SAS_ADAPTER *ioc, 
u16 handle)
  * @type: MPI2_SCSITASKMGMT_TASKTYPE__XXX (defined in mpi2_init.h)
  * @smid_task: smid assigned to the task
  * @timeout: timeout in seconds
- * @m_type: TM_MUTEX_ON or TM_MUTEX_OFF
  * Context: user
  *
  * A generic API for sending task management requests to firmware.
@@ -2212,8 +2211,7 @@ mpt3sas_scsih_clear_tm_flag(struct MPT3SAS_ADAPTER *ioc, 
u16 handle)
  */
 int
 mpt3sas_scsih_issue_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, uint channel,
-   uint id, uint lun, u8 type, u16 smid_task, ulong timeout,
-   enum mutex_type m_type)
+   uint id, uint lun, u8 type, u16 smid_task, ulong timeout)
 {
Mpi2SCSITaskManagementRequest_t *mpi_request;
Mpi2SCSITaskManagementReply_t *mpi_reply;
@@ -2224,21 +,19 @@ mpt3sas_scsih_issue_tm(struct MPT3SAS_ADAPTER *ioc, u16 
handle, uint channel,
int rc;
u16 msix_task = 0;
 
-   if (m_type == TM_MUTEX_ON)
-   mutex_lock(>tm_cmds.mutex);
+   lockdep_assert_held(>tm_cmds.mutex);
+
if (ioc->tm_cmds.status != MPT3_CMD_NOT_USED) {
pr_info(MPT3SAS_FMT "%s: tm_cmd busy!!!\n",
__func__, ioc->name);
-   rc = FAILED;
-   goto err_out;
+   return FAILED;
}
 
if (ioc->shost_recovery || ioc->remove_host ||
ioc->pci_error_recovery) {
pr_info(MPT3SAS_FMT "%s: host reset in progress!\n",
__func__, ioc->name);
-   rc = FAILED;
-   goto err_out;
+   return FAILED;
}
 
ioc_state = mpt3sas_base_get_iocstate(ioc, 0);
@@ -2247,8 +2243,7 @@ mpt3sas_scsih_issue_tm(struct MPT3SAS_ADAPTER *ioc, u16 
handle, uint channel,
   

[PATCH 3/3] mpt3sas: Fix warnings exposed by W=1

2016-07-28 Thread Calvin Owens
Trivial non-functional changes for a couple annoying things:

  1) Functions local to files are not declared static, which is
  frustrating when reading the code because it's non-obvious at first
  glance what's actually called from other files.

  2) Set-but-unused variables abound, presumably to mask -Wunused-result
  errors in the past. None of these are flagged today though (with one
  exception noted below), so remove them.

Fixing (2) exposed the fact that we improperly ignore the return value of
scsi_device_reprobe() in _scsih_reprobe_lun(). Fixing the calling code to
deal with the potential error is non-trivial, so for now just WARN().

Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
 drivers/scsi/mpt3sas/mpt3sas_base.c  | 18 +++-
 drivers/scsi/mpt3sas/mpt3sas_config.c|  4 +-
 drivers/scsi/mpt3sas/mpt3sas_ctl.c   | 29 ++---
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 70 +++-
 drivers/scsi/mpt3sas/mpt3sas_transport.c | 16 ++--
 5 files changed, 56 insertions(+), 81 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c 
b/drivers/scsi/mpt3sas/mpt3sas_base.c
index 0956183..df95d1a 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
@@ -2039,7 +2039,7 @@ _base_enable_msix(struct MPT3SAS_ADAPTER *ioc)
  * mpt3sas_base_unmap_resources - free controller resources
  * @ioc: per adapter object
  */
-void
+static void
 mpt3sas_base_unmap_resources(struct MPT3SAS_ADAPTER *ioc)
 {
struct pci_dev *pdev = ioc->pdev;
@@ -3884,7 +3884,6 @@ _base_handshake_req_reply_wait(struct MPT3SAS_ADAPTER 
*ioc, int request_bytes,
MPI2DefaultReply_t *default_reply = (MPI2DefaultReply_t *)reply;
int i;
u8 failed;
-   u16 dummy;
__le32 *mfp;
 
/* make sure doorbell is not in use */
@@ -3964,7 +3963,7 @@ _base_handshake_req_reply_wait(struct MPT3SAS_ADAPTER 
*ioc, int request_bytes,
return -EFAULT;
}
if (i >=  reply_bytes/2) /* overflow case */
-   dummy = readl(>chip->Doorbell);
+   readl(>chip->Doorbell);
else
reply[i] = le16_to_cpu(readl(>chip->Doorbell)
& MPI2_DOORBELL_DATA_MASK);
@@ -4009,7 +4008,6 @@ mpt3sas_base_sas_iounit_control(struct MPT3SAS_ADAPTER 
*ioc,
 {
u16 smid;
u32 ioc_state;
-   unsigned long timeleft;
bool issue_reset = false;
int rc;
void *request;
@@ -4062,7 +4060,7 @@ mpt3sas_base_sas_iounit_control(struct MPT3SAS_ADAPTER 
*ioc,
ioc->ioc_link_reset_in_progress = 1;
init_completion(>base_cmds.done);
mpt3sas_base_put_smid_default(ioc, smid);
-   timeleft = wait_for_completion_timeout(>base_cmds.done,
+   wait_for_completion_timeout(>base_cmds.done,
msecs_to_jiffies(1));
if ((mpi_request->Operation == MPI2_SAS_OP_PHY_HARD_RESET ||
mpi_request->Operation == MPI2_SAS_OP_PHY_LINK_RESET) &&
@@ -4112,7 +4110,6 @@ mpt3sas_base_scsi_enclosure_processor(struct 
MPT3SAS_ADAPTER *ioc,
 {
u16 smid;
u32 ioc_state;
-   unsigned long timeleft;
bool issue_reset = false;
int rc;
void *request;
@@ -4163,7 +4160,7 @@ mpt3sas_base_scsi_enclosure_processor(struct 
MPT3SAS_ADAPTER *ioc,
memcpy(request, mpi_request, sizeof(Mpi2SepReply_t));
init_completion(>base_cmds.done);
mpt3sas_base_put_smid_default(ioc, smid);
-   timeleft = wait_for_completion_timeout(>base_cmds.done,
+   wait_for_completion_timeout(>base_cmds.done,
msecs_to_jiffies(1));
if (!(ioc->base_cmds.status & MPT3_CMD_COMPLETE)) {
pr_err(MPT3SAS_FMT "%s: timeout\n",
@@ -4548,7 +4545,6 @@ _base_send_port_enable(struct MPT3SAS_ADAPTER *ioc)
 {
Mpi2PortEnableRequest_t *mpi_request;
Mpi2PortEnableReply_t *mpi_reply;
-   unsigned long timeleft;
int r = 0;
u16 smid;
u16 ioc_status;
@@ -4576,8 +4572,7 @@ _base_send_port_enable(struct MPT3SAS_ADAPTER *ioc)
 
init_completion(>port_enable_cmds.done);
mpt3sas_base_put_smid_default(ioc, smid);
-   timeleft = wait_for_completion_timeout(>port_enable_cmds.done,
-   300*HZ);
+   wait_for_completion_timeout(>port_enable_cmds.done, 300*HZ);
if (!(ioc->port_enable_cmds.status & MPT3_CMD_COMPLETE)) {
pr_err(MPT3SAS_FMT "%s: timeout\n",
ioc->name, __func__);
@@ -4728,7 +4723,6 @@ static int
 _base_event_notification(struct MPT3SAS_ADAPTER *ioc)
 {
Mpi2EventNotificationRequest_t *mpi_request;
-   unsigned long timeleft;
u16 smid;
int r = 0;
int i;
@@ -4760,7 +4754,7 @@ _base_event_notification(struct 

[PATCH 1/3] mpt3sas: Eliminate conditional locking in mpt3sas_scsih_issue_tm()

2016-07-28 Thread Calvin Owens
This flag that conditionally acquires the mutex is confusing and prone
to bugginess: refactor it into two separate function calls, and make
the unlocked one complain if it's called outside the mutex.

Signed-off-by: Calvin Owens 
---
 drivers/scsi/mpt3sas/mpt3sas_base.h  | 16 +++--
 drivers/scsi/mpt3sas/mpt3sas_ctl.c   |  5 ++-
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 66 +---
 3 files changed, 38 insertions(+), 49 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.h 
b/drivers/scsi/mpt3sas/mpt3sas_base.h
index eb7f5b0..f0baafd 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.h
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.h
@@ -794,16 +794,6 @@ struct reply_post_struct {
dma_addr_t  reply_post_free_dma;
 };
 
-/**
- * enum mutex_type - task management mutex type
- * @TM_MUTEX_OFF: mutex is not required becuase calling function is acquiring 
it
- * @TM_MUTEX_ON: mutex is required
- */
-enum mutex_type {
-   TM_MUTEX_OFF = 0,
-   TM_MUTEX_ON = 1,
-};
-
 typedef void (*MPT3SAS_FLUSH_RUNNING_CMDS)(struct MPT3SAS_ADAPTER *ioc);
 /**
  * struct MPT3SAS_ADAPTER - per adapter struct
@@ -1291,7 +1281,11 @@ void mpt3sas_scsih_reset_handler(struct MPT3SAS_ADAPTER 
*ioc, int reset_phase);
 
 int mpt3sas_scsih_issue_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle,
uint channel, uint id, uint lun, u8 type, u16 smid_task,
-   ulong timeout, enum mutex_type m_type);
+   ulong timeout);
+int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle,
+   uint channel, uint id, uint lun, u8 type, u16 smid_task,
+   ulong timeout);
+
 void mpt3sas_scsih_set_tm_flag(struct MPT3SAS_ADAPTER *ioc, u16 handle);
 void mpt3sas_scsih_clear_tm_flag(struct MPT3SAS_ADAPTER *ioc, u16 handle);
 void mpt3sas_expander_remove(struct MPT3SAS_ADAPTER *ioc, u64 sas_address);
diff --git a/drivers/scsi/mpt3sas/mpt3sas_ctl.c 
b/drivers/scsi/mpt3sas/mpt3sas_ctl.c
index 7d00f09..75ae533 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_ctl.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_ctl.c
@@ -1001,10 +1001,9 @@ _ctl_do_mpt_command(struct MPT3SAS_ADAPTER *ioc, struct 
mpt3_ioctl_command karg,
ioc->name,
le16_to_cpu(mpi_request->FunctionDependent1));
mpt3sas_halt_firmware(ioc);
-   mpt3sas_scsih_issue_tm(ioc,
+   mpt3sas_scsih_issue_locked_tm(ioc,
le16_to_cpu(mpi_request->FunctionDependent1), 0, 0,
-   0, MPI2_SCSITASKMGMT_TASKTYPE_TARGET_RESET, 0, 30,
-   TM_MUTEX_ON);
+   0, MPI2_SCSITASKMGMT_TASKTYPE_TARGET_RESET, 0, 30);
} else
mpt3sas_base_hard_reset_handler(ioc, CAN_SLEEP,
FORCE_BIG_HAMMER);
diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index acabe48..c93a7ba 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -2201,7 +2201,6 @@ mpt3sas_scsih_clear_tm_flag(struct MPT3SAS_ADAPTER *ioc, 
u16 handle)
  * @type: MPI2_SCSITASKMGMT_TASKTYPE__XXX (defined in mpi2_init.h)
  * @smid_task: smid assigned to the task
  * @timeout: timeout in seconds
- * @m_type: TM_MUTEX_ON or TM_MUTEX_OFF
  * Context: user
  *
  * A generic API for sending task management requests to firmware.
@@ -2212,8 +2211,7 @@ mpt3sas_scsih_clear_tm_flag(struct MPT3SAS_ADAPTER *ioc, 
u16 handle)
  */
 int
 mpt3sas_scsih_issue_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, uint channel,
-   uint id, uint lun, u8 type, u16 smid_task, ulong timeout,
-   enum mutex_type m_type)
+   uint id, uint lun, u8 type, u16 smid_task, ulong timeout)
 {
Mpi2SCSITaskManagementRequest_t *mpi_request;
Mpi2SCSITaskManagementReply_t *mpi_reply;
@@ -2224,21 +,19 @@ mpt3sas_scsih_issue_tm(struct MPT3SAS_ADAPTER *ioc, u16 
handle, uint channel,
int rc;
u16 msix_task = 0;
 
-   if (m_type == TM_MUTEX_ON)
-   mutex_lock(>tm_cmds.mutex);
+   lockdep_assert_held(>tm_cmds.mutex);
+
if (ioc->tm_cmds.status != MPT3_CMD_NOT_USED) {
pr_info(MPT3SAS_FMT "%s: tm_cmd busy!!!\n",
__func__, ioc->name);
-   rc = FAILED;
-   goto err_out;
+   return FAILED;
}
 
if (ioc->shost_recovery || ioc->remove_host ||
ioc->pci_error_recovery) {
pr_info(MPT3SAS_FMT "%s: host reset in progress!\n",
__func__, ioc->name);
-   rc = FAILED;
-   goto err_out;
+   return FAILED;
}
 
ioc_state = mpt3sas_base_get_iocstate(ioc, 0);
@@ -2247,8 +2243,7 @@ mpt3sas_scsih_issue_tm(struct MPT3SAS_ADAPTER *ioc, u16 
handle, uint channel,
"unexp

[PATCH 3/3] mpt3sas: Fix warnings exposed by W=1

2016-07-28 Thread Calvin Owens
Trivial non-functional changes for a couple annoying things:

  1) Functions local to files are not declared static, which is
  frustrating when reading the code because it's non-obvious at first
  glance what's actually called from other files.

  2) Set-but-unused variables abound, presumably to mask -Wunused-result
  errors in the past. None of these are flagged today though (with one
  exception noted below), so remove them.

Fixing (2) exposed the fact that we improperly ignore the return value of
scsi_device_reprobe() in _scsih_reprobe_lun(). Fixing the calling code to
deal with the potential error is non-trivial, so for now just WARN().

Signed-off-by: Calvin Owens 
---
 drivers/scsi/mpt3sas/mpt3sas_base.c  | 18 +++-
 drivers/scsi/mpt3sas/mpt3sas_config.c|  4 +-
 drivers/scsi/mpt3sas/mpt3sas_ctl.c   | 29 ++---
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 70 +++-
 drivers/scsi/mpt3sas/mpt3sas_transport.c | 16 ++--
 5 files changed, 56 insertions(+), 81 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c 
b/drivers/scsi/mpt3sas/mpt3sas_base.c
index 0956183..df95d1a 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
@@ -2039,7 +2039,7 @@ _base_enable_msix(struct MPT3SAS_ADAPTER *ioc)
  * mpt3sas_base_unmap_resources - free controller resources
  * @ioc: per adapter object
  */
-void
+static void
 mpt3sas_base_unmap_resources(struct MPT3SAS_ADAPTER *ioc)
 {
struct pci_dev *pdev = ioc->pdev;
@@ -3884,7 +3884,6 @@ _base_handshake_req_reply_wait(struct MPT3SAS_ADAPTER 
*ioc, int request_bytes,
MPI2DefaultReply_t *default_reply = (MPI2DefaultReply_t *)reply;
int i;
u8 failed;
-   u16 dummy;
__le32 *mfp;
 
/* make sure doorbell is not in use */
@@ -3964,7 +3963,7 @@ _base_handshake_req_reply_wait(struct MPT3SAS_ADAPTER 
*ioc, int request_bytes,
return -EFAULT;
}
if (i >=  reply_bytes/2) /* overflow case */
-   dummy = readl(>chip->Doorbell);
+   readl(>chip->Doorbell);
else
reply[i] = le16_to_cpu(readl(>chip->Doorbell)
& MPI2_DOORBELL_DATA_MASK);
@@ -4009,7 +4008,6 @@ mpt3sas_base_sas_iounit_control(struct MPT3SAS_ADAPTER 
*ioc,
 {
u16 smid;
u32 ioc_state;
-   unsigned long timeleft;
bool issue_reset = false;
int rc;
void *request;
@@ -4062,7 +4060,7 @@ mpt3sas_base_sas_iounit_control(struct MPT3SAS_ADAPTER 
*ioc,
ioc->ioc_link_reset_in_progress = 1;
init_completion(>base_cmds.done);
mpt3sas_base_put_smid_default(ioc, smid);
-   timeleft = wait_for_completion_timeout(>base_cmds.done,
+   wait_for_completion_timeout(>base_cmds.done,
msecs_to_jiffies(1));
if ((mpi_request->Operation == MPI2_SAS_OP_PHY_HARD_RESET ||
mpi_request->Operation == MPI2_SAS_OP_PHY_LINK_RESET) &&
@@ -4112,7 +4110,6 @@ mpt3sas_base_scsi_enclosure_processor(struct 
MPT3SAS_ADAPTER *ioc,
 {
u16 smid;
u32 ioc_state;
-   unsigned long timeleft;
bool issue_reset = false;
int rc;
void *request;
@@ -4163,7 +4160,7 @@ mpt3sas_base_scsi_enclosure_processor(struct 
MPT3SAS_ADAPTER *ioc,
memcpy(request, mpi_request, sizeof(Mpi2SepReply_t));
init_completion(>base_cmds.done);
mpt3sas_base_put_smid_default(ioc, smid);
-   timeleft = wait_for_completion_timeout(>base_cmds.done,
+   wait_for_completion_timeout(>base_cmds.done,
msecs_to_jiffies(1));
if (!(ioc->base_cmds.status & MPT3_CMD_COMPLETE)) {
pr_err(MPT3SAS_FMT "%s: timeout\n",
@@ -4548,7 +4545,6 @@ _base_send_port_enable(struct MPT3SAS_ADAPTER *ioc)
 {
Mpi2PortEnableRequest_t *mpi_request;
Mpi2PortEnableReply_t *mpi_reply;
-   unsigned long timeleft;
int r = 0;
u16 smid;
u16 ioc_status;
@@ -4576,8 +4572,7 @@ _base_send_port_enable(struct MPT3SAS_ADAPTER *ioc)
 
init_completion(>port_enable_cmds.done);
mpt3sas_base_put_smid_default(ioc, smid);
-   timeleft = wait_for_completion_timeout(>port_enable_cmds.done,
-   300*HZ);
+   wait_for_completion_timeout(>port_enable_cmds.done, 300*HZ);
if (!(ioc->port_enable_cmds.status & MPT3_CMD_COMPLETE)) {
pr_err(MPT3SAS_FMT "%s: timeout\n",
ioc->name, __func__);
@@ -4728,7 +4723,6 @@ static int
 _base_event_notification(struct MPT3SAS_ADAPTER *ioc)
 {
Mpi2EventNotificationRequest_t *mpi_request;
-   unsigned long timeleft;
u16 smid;
int r = 0;
int i;
@@ -4760,7 +4754,7 @@ _base_event_notification(struct MPT3SAS_ADAPTER *ioc)
cpu_

[PATCH] mpt3sas: Ensure the connector_name string is NUL-terminated

2016-07-27 Thread Calvin Owens
We blindly trust the hardware to give us NUL-terminated strings, which
is a bad idea because it doesn't always do that. For example:

  [  481.184784] mpt3sas_cm0:   enclosure level(0x), connector name( 
\x3)

In this case, connector_name is four spaces. We got lucky here because
the 2nd byte beyond our character array happens to be a NUL. Fix this
by explicitly writing '\0' to the end of the string to ensure we don't
run off the edge of the world in printk().

Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
 drivers/scsi/mpt3sas/mpt3sas_base.h  |  2 +-
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 10 ++
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.h 
b/drivers/scsi/mpt3sas/mpt3sas_base.h
index 892c9be..eb7f5b0 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.h
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.h
@@ -478,7 +478,7 @@ struct _sas_device {
u8  pfa_led_on;
u8  pend_sas_rphy_add;
u8  enclosure_level;
-   u8  connector_name[4];
+   u8  connector_name[5];
struct kref refcount;
 };
 
diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index cd91a68..acabe48 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -5380,8 +5380,9 @@ _scsih_check_device(struct MPT3SAS_ADAPTER *ioc,
 MPI2_SAS_DEVICE0_FLAGS_ENCL_LEVEL_VALID) {
sas_device->enclosure_level =
le16_to_cpu(sas_device_pg0.EnclosureLevel);
-   memcpy(_device->connector_name[0],
-   _device_pg0.ConnectorName[0], 4);
+   memcpy(sas_device->connector_name,
+   sas_device_pg0.ConnectorName, 4);
+   sas_device->connector_name[4] = '\0';
} else {
sas_device->enclosure_level = 0;
sas_device->connector_name[0] = '\0';
@@ -5508,8 +5509,9 @@ _scsih_add_device(struct MPT3SAS_ADAPTER *ioc, u16 
handle, u8 phy_num,
if (sas_device_pg0.Flags & MPI2_SAS_DEVICE0_FLAGS_ENCL_LEVEL_VALID) {
sas_device->enclosure_level =
le16_to_cpu(sas_device_pg0.EnclosureLevel);
-   memcpy(_device->connector_name[0],
-   _device_pg0.ConnectorName[0], 4);
+   memcpy(sas_device->connector_name,
+   sas_device_pg0.ConnectorName, 4);
+   sas_device->connector_name[4] = '\0';
} else {
sas_device->enclosure_level = 0;
sas_device->connector_name[0] = '\0';
-- 
2.8.0.rc2



[PATCH] mpt3sas: Ensure the connector_name string is NUL-terminated

2016-07-27 Thread Calvin Owens
We blindly trust the hardware to give us NUL-terminated strings, which
is a bad idea because it doesn't always do that. For example:

  [  481.184784] mpt3sas_cm0:   enclosure level(0x), connector name( 
\x3)

In this case, connector_name is four spaces. We got lucky here because
the 2nd byte beyond our character array happens to be a NUL. Fix this
by explicitly writing '\0' to the end of the string to ensure we don't
run off the edge of the world in printk().

Signed-off-by: Calvin Owens 
---
 drivers/scsi/mpt3sas/mpt3sas_base.h  |  2 +-
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 10 ++
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.h 
b/drivers/scsi/mpt3sas/mpt3sas_base.h
index 892c9be..eb7f5b0 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.h
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.h
@@ -478,7 +478,7 @@ struct _sas_device {
u8  pfa_led_on;
u8  pend_sas_rphy_add;
u8  enclosure_level;
-   u8  connector_name[4];
+   u8  connector_name[5];
struct kref refcount;
 };
 
diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index cd91a68..acabe48 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -5380,8 +5380,9 @@ _scsih_check_device(struct MPT3SAS_ADAPTER *ioc,
 MPI2_SAS_DEVICE0_FLAGS_ENCL_LEVEL_VALID) {
sas_device->enclosure_level =
le16_to_cpu(sas_device_pg0.EnclosureLevel);
-   memcpy(_device->connector_name[0],
-   _device_pg0.ConnectorName[0], 4);
+   memcpy(sas_device->connector_name,
+   sas_device_pg0.ConnectorName, 4);
+   sas_device->connector_name[4] = '\0';
} else {
sas_device->enclosure_level = 0;
sas_device->connector_name[0] = '\0';
@@ -5508,8 +5509,9 @@ _scsih_add_device(struct MPT3SAS_ADAPTER *ioc, u16 
handle, u8 phy_num,
if (sas_device_pg0.Flags & MPI2_SAS_DEVICE0_FLAGS_ENCL_LEVEL_VALID) {
sas_device->enclosure_level =
le16_to_cpu(sas_device_pg0.EnclosureLevel);
-   memcpy(_device->connector_name[0],
-   _device_pg0.ConnectorName[0], 4);
+   memcpy(sas_device->connector_name,
+   sas_device_pg0.ConnectorName, 4);
+   sas_device->connector_name[4] = '\0';
} else {
sas_device->enclosure_level = 0;
sas_device->connector_name[0] = '\0';
-- 
2.8.0.rc2



Re: [PATCH] ses: Fix racy cleanup of /sys in remove_dev()

2016-07-27 Thread Calvin Owens

On 06/15/2016 01:24 PM, Calvin Owens wrote:

On Thursday 06/02 at 15:50 -0700, Calvin Owens wrote:

On 05/13/2016 01:28 PM, Calvin Owens wrote:

Currently we free the resources backing the enclosure device before we
call device_unregister(). This is racy: during rmmod of low-level SCSI
drivers that hook into enclosure, we end up with a small window of time
during which writing to /sys can OOPS. Example trace with mpt3sas:


Ping?


Any thoughts? Squinting at this more it still seems racy, but a narrow race
is surely better than just blatantly freeing everything while the file is
still exposed in /sys? Is there a better way you'd prefer I accomplish this?

(I have boxes that OOPS all the time from monitoring code reading the /sys
files, with this patch I haven't seen a single one.)

Thanks,
Calvin


Ping? Thoughts, comments?


   general protection fault:  [#1] SMP KASAN
   Modules linked in: mpt3sas(-) <...>
   RIP: [] ses_get_page2_descriptor.isra.6+0x38/0x220 [ses]
   Call Trace:
[] ses_set_fault+0xf4/0x400 [ses]
[] set_component_fault+0xa9/0xf0 [enclosure]
[] dev_attr_store+0x3c/0x70
[] sysfs_kf_write+0x115/0x180
[] kernfs_fop_write+0x275/0x3a0
[] __vfs_write+0xe0/0x3e0
[] vfs_write+0x13f/0x4a0
[] SyS_write+0x111/0x230
[] entry_SYSCALL_64_fastpath+0x13/0x94

Fortunately the solution is extremely simple: call device_unregister()
before we free the resources, and the race no longer exists. The driver
core holds a reference over ->remove_dev(), so AFAICT this is safe.

Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
  drivers/scsi/ses.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c
index 53ef1cb..0e8601a 100644
--- a/drivers/scsi/ses.c
+++ b/drivers/scsi/ses.c
@@ -778,6 +778,8 @@ static void ses_intf_remove_enclosure(struct scsi_device 
*sdev)
if (!edev)
return;

+   enclosure_unregister(edev);
+
ses_dev = edev->scratch;
edev->scratch = NULL;

@@ -789,7 +791,6 @@ static void ses_intf_remove_enclosure(struct scsi_device 
*sdev)
kfree(edev->component[0].scratch);

put_device(>edev);
-   enclosure_unregister(edev);
  }

  static void ses_intf_remove(struct device *cdev,







Re: [PATCH] ses: Fix racy cleanup of /sys in remove_dev()

2016-07-27 Thread Calvin Owens

On 06/15/2016 01:24 PM, Calvin Owens wrote:

On Thursday 06/02 at 15:50 -0700, Calvin Owens wrote:

On 05/13/2016 01:28 PM, Calvin Owens wrote:

Currently we free the resources backing the enclosure device before we
call device_unregister(). This is racy: during rmmod of low-level SCSI
drivers that hook into enclosure, we end up with a small window of time
during which writing to /sys can OOPS. Example trace with mpt3sas:


Ping?


Any thoughts? Squinting at this more it still seems racy, but a narrow race
is surely better than just blatantly freeing everything while the file is
still exposed in /sys? Is there a better way you'd prefer I accomplish this?

(I have boxes that OOPS all the time from monitoring code reading the /sys
files, with this patch I haven't seen a single one.)

Thanks,
Calvin


Ping? Thoughts, comments?


   general protection fault:  [#1] SMP KASAN
   Modules linked in: mpt3sas(-) <...>
   RIP: [] ses_get_page2_descriptor.isra.6+0x38/0x220 [ses]
   Call Trace:
[] ses_set_fault+0xf4/0x400 [ses]
[] set_component_fault+0xa9/0xf0 [enclosure]
[] dev_attr_store+0x3c/0x70
[] sysfs_kf_write+0x115/0x180
[] kernfs_fop_write+0x275/0x3a0
[] __vfs_write+0xe0/0x3e0
[] vfs_write+0x13f/0x4a0
[] SyS_write+0x111/0x230
[] entry_SYSCALL_64_fastpath+0x13/0x94

Fortunately the solution is extremely simple: call device_unregister()
before we free the resources, and the race no longer exists. The driver
core holds a reference over ->remove_dev(), so AFAICT this is safe.

Signed-off-by: Calvin Owens 
---
  drivers/scsi/ses.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c
index 53ef1cb..0e8601a 100644
--- a/drivers/scsi/ses.c
+++ b/drivers/scsi/ses.c
@@ -778,6 +778,8 @@ static void ses_intf_remove_enclosure(struct scsi_device 
*sdev)
if (!edev)
return;

+   enclosure_unregister(edev);
+
ses_dev = edev->scratch;
edev->scratch = NULL;

@@ -789,7 +791,6 @@ static void ses_intf_remove_enclosure(struct scsi_device 
*sdev)
kfree(edev->component[0].scratch);

put_device(>edev);
-   enclosure_unregister(edev);
  }

  static void ses_intf_remove(struct device *cdev,







Re: [BUG] Slab corruption during XFS writeback under memory pressure

2016-07-19 Thread Calvin Owens

On 07/18/2016 07:05 PM, Calvin Owens wrote:

On 07/17/2016 11:02 PM, Dave Chinner wrote:

On Sun, Jul 17, 2016 at 10:00:03AM +1000, Dave Chinner wrote:

On Fri, Jul 15, 2016 at 05:18:02PM -0700, Calvin Owens wrote:

Hello all,

I've found a nasty source of slab corruption. Based on seeing similar symptoms
on boxes at Facebook, I suspect it's been around since at least 3.10.

It only reproduces under memory pressure so far as I can tell: the issue seems
to be that XFS reclaims pages from buffers that are still in use by
scsi/block. I'm not sure which side the bug lies on, but I've only observed it
with XFS.

[]

But this indicates that the page is under writeback at this point,
so that tends to indicate that the above freeing was incorrect.

Hmmm - it's clear we've got direct reclaim involved here, and the
suspicion of a dirty page that has had it's bufferheads cleared.
Are there any other warnings in the log from XFS prior to kasan
throwing the error?


Can you try the patch below?


Thanks for getting this out so quickly :)

So far so good: I booted Linus' tree as of this morning and reproduced the ASAN
splat. After applying your patch I haven't triggered it.

I'm a bit wary since it was hard to trigger reliably in the first place... so I
lined up a few dozen boxes to run the test case overnight. I'll confirm in the
morning (-0700) they look good.


All right, my testcase ran 2099 times overnight without triggering anything.

For the overnight tests, I booted the boxes with "mem=" to artificially limit 
RAM,
which makes my repro *much* more reliable (I feel silly for not thinking of that
in the first place). With that setup, I hit the ASAN splat 21 times in 98 runs 
on
vanilla 4.7-rc7. So I'm sold.

Tested-by: Calvin Owens <calvinow...@fb.com>

Again, really appreciate the quick response :)

Thanks,
Calvin


Thanks,
Calvin


-Dave.




Re: [BUG] Slab corruption during XFS writeback under memory pressure

2016-07-19 Thread Calvin Owens

On 07/18/2016 07:05 PM, Calvin Owens wrote:

On 07/17/2016 11:02 PM, Dave Chinner wrote:

On Sun, Jul 17, 2016 at 10:00:03AM +1000, Dave Chinner wrote:

On Fri, Jul 15, 2016 at 05:18:02PM -0700, Calvin Owens wrote:

Hello all,

I've found a nasty source of slab corruption. Based on seeing similar symptoms
on boxes at Facebook, I suspect it's been around since at least 3.10.

It only reproduces under memory pressure so far as I can tell: the issue seems
to be that XFS reclaims pages from buffers that are still in use by
scsi/block. I'm not sure which side the bug lies on, but I've only observed it
with XFS.

[]

But this indicates that the page is under writeback at this point,
so that tends to indicate that the above freeing was incorrect.

Hmmm - it's clear we've got direct reclaim involved here, and the
suspicion of a dirty page that has had it's bufferheads cleared.
Are there any other warnings in the log from XFS prior to kasan
throwing the error?


Can you try the patch below?


Thanks for getting this out so quickly :)

So far so good: I booted Linus' tree as of this morning and reproduced the ASAN
splat. After applying your patch I haven't triggered it.

I'm a bit wary since it was hard to trigger reliably in the first place... so I
lined up a few dozen boxes to run the test case overnight. I'll confirm in the
morning (-0700) they look good.


All right, my testcase ran 2099 times overnight without triggering anything.

For the overnight tests, I booted the boxes with "mem=" to artificially limit 
RAM,
which makes my repro *much* more reliable (I feel silly for not thinking of that
in the first place). With that setup, I hit the ASAN splat 21 times in 98 runs 
on
vanilla 4.7-rc7. So I'm sold.

Tested-by: Calvin Owens 

Again, really appreciate the quick response :)

Thanks,
Calvin


Thanks,
Calvin


-Dave.




Re: [BUG] Slab corruption during XFS writeback under memory pressure

2016-07-18 Thread Calvin Owens

On 07/17/2016 11:02 PM, Dave Chinner wrote:

On Sun, Jul 17, 2016 at 10:00:03AM +1000, Dave Chinner wrote:

On Fri, Jul 15, 2016 at 05:18:02PM -0700, Calvin Owens wrote:

Hello all,

I've found a nasty source of slab corruption. Based on seeing similar symptoms
on boxes at Facebook, I suspect it's been around since at least 3.10.

It only reproduces under memory pressure so far as I can tell: the issue seems
to be that XFS reclaims pages from buffers that are still in use by
scsi/block. I'm not sure which side the bug lies on, but I've only observed it
with XFS.

[]

But this indicates that the page is under writeback at this point,
so that tends to indicate that the above freeing was incorrect.

Hmmm - it's clear we've got direct reclaim involved here, and the
suspicion of a dirty page that has had it's bufferheads cleared.
Are there any other warnings in the log from XFS prior to kasan
throwing the error?


Can you try the patch below?


Thanks for getting this out so quickly :)

So far so good: I booted Linus' tree as of this morning and reproduced the ASAN
splat. After applying your patch I haven't triggered it.

I'm a bit wary since it was hard to trigger reliably in the first place... so I
lined up a few dozen boxes to run the test case overnight. I'll confirm in the
morning (-0700) they look good.

Thanks,
Calvin


-Dave.


Re: [BUG] Slab corruption during XFS writeback under memory pressure

2016-07-18 Thread Calvin Owens

On 07/17/2016 11:02 PM, Dave Chinner wrote:

On Sun, Jul 17, 2016 at 10:00:03AM +1000, Dave Chinner wrote:

On Fri, Jul 15, 2016 at 05:18:02PM -0700, Calvin Owens wrote:

Hello all,

I've found a nasty source of slab corruption. Based on seeing similar symptoms
on boxes at Facebook, I suspect it's been around since at least 3.10.

It only reproduces under memory pressure so far as I can tell: the issue seems
to be that XFS reclaims pages from buffers that are still in use by
scsi/block. I'm not sure which side the bug lies on, but I've only observed it
with XFS.

[]

But this indicates that the page is under writeback at this point,
so that tends to indicate that the above freeing was incorrect.

Hmmm - it's clear we've got direct reclaim involved here, and the
suspicion of a dirty page that has had it's bufferheads cleared.
Are there any other warnings in the log from XFS prior to kasan
throwing the error?


Can you try the patch below?


Thanks for getting this out so quickly :)

So far so good: I booted Linus' tree as of this morning and reproduced the ASAN
splat. After applying your patch I haven't triggered it.

I'm a bit wary since it was hard to trigger reliably in the first place... so I
lined up a few dozen boxes to run the test case overnight. I'll confirm in the
morning (-0700) they look good.

Thanks,
Calvin


-Dave.


[BUG] Slab corruption during XFS writeback under memory pressure

2016-07-15 Thread Calvin Owens

Hello all,

I've found a nasty source of slab corruption. Based on seeing similar symptoms
on boxes at Facebook, I suspect it's been around since at least 3.10.

It only reproduces under memory pressure so far as I can tell: the issue seems
to be that XFS reclaims pages from buffers that are still in use by
scsi/block. I'm not sure which side the bug lies on, but I've only observed it
with XFS.

[67203.776421] 
==
[67203.792521] BUG: KASAN: use-after-free in xfs_destroy_ioend+0x3bf/0x4c0 at 
addr 8804cf466288
[67203.812036] Read of size 8 by task python2.7/22913
[67203.822713] 
=
[67203.840917] BUG buffer_head (Not tainted): kasan: bad access detected
[67203.855253] 
-
[67203.855253]
[67203.876727] Disabling lock debugging due to kernel taint
[67203.888575] INFO: Allocated in 0x8804cf465d40 age=18437180719206552994 
cpu=2191548261 pid=-1
[67203.908139]  alloc_buffer_head+0x22/0xd0
[67203.916903]  ___slab_alloc+0x4e0/0x520
[67203.925286]  __slab_alloc+0x43/0x70
[67203.933087]  kmem_cache_alloc+0x228/0x2c0
[67203.942042]  alloc_buffer_head+0x22/0xd0
[67203.950782]  alloc_page_buffers+0xa9/0x1f0
[67203.959936]  create_empty_buffers+0x30/0x420
[67203.969495]  create_page_buffers+0x120/0x1b0
[67203.979029]  __block_write_begin+0x16b/0x1010
[67203.988756]  xfs_vm_write_begin+0x55/0x1b0
[67203.997884]  generic_perform_write+0x288/0x510
[67204.007771]  xfs_file_buffered_aio_write+0x316/0x780
[67204.018811]  xfs_file_write_iter+0x26f/0x6c0
[67204.028313]  __vfs_write+0x2a0/0x620
[67204.036276]  vfs_write+0x159/0x4c0
[67204.043855]  SyS_write+0xd2/0x1b0
[67204.051245] INFO: Freed in 0x103fc80ec age=18446651500051355200 
cpu=2165122683 pid=-1
[67204.068634]  free_buffer_head+0x41/0x90
[67204.077175]  __slab_free+0x1ed/0x340
[67204.085138]  kmem_cache_free+0x270/0x300
[67204.093867]  free_buffer_head+0x41/0x90
[67204.102422]  try_to_free_buffers+0x171/0x240
[67204.111925]  xfs_vm_releasepage+0xcb/0x3b0
[67204.121101]  try_to_release_page+0x106/0x190
[67204.130602]  shrink_page_list+0x118e/0x1a10
[67204.139910]  shrink_inactive_list+0x42c/0xdf0
[67204.149600]  shrink_zone_memcg+0xa09/0xfa0
[67204.158715]  shrink_zone+0x2c3/0xbc0
[67204.166679]  do_try_to_free_pages+0x42a/0x12f0
[67204.176562]  try_to_free_pages+0x1a3/0x5d0
[67204.185709]  __alloc_pages_nodemask+0xbeb/0x20d0
[67204.195979]  alloc_pages_vma+0x11b/0x5e0
[67204.204709]  handle_mm_fault+0x2c27/0x47d0
[67204.213823] INFO: Slab 0xea00133d1900 objects=37 used=14 
fp=0x8804cf464530 flags=0x20004080
[67204.235439] INFO: Object 0x8804cf466260 @offset=8800 
fp=0x
[67204.235439]
[67204.455817] CPU: 1 PID: 22913 Comm: python2.7 Tainted: GB   
4.7.0-rc7-calvinowens-1468357363-1-gcaa3dc6 #1
[67204.480313] Hardware name: Wiwynn   HoneyBadger/PantherPlus, BIOS HBM6.71 
02/03/2016
[67204.497509]  88075e99f480 88075ec87a30 81e8b8e4 
8804cf464000
[67204.514224]  8804cf466260 88075ec87a60 8153a995 
88075e99f480
[67204.530924]  ea00133d1900 8804cf466260 dc00 
88075ec87a88
[67204.547624] Call Trace:
[67204.553086][] dump_stack+0x68/0x94
[67204.565946]  [] print_trailer+0x115/0x1a0
[67204.578334]  [] object_err+0x34/0x40
[67204.589762]  [] kasan_report_error+0x217/0x530
[67204.616847]  [] __asan_report_load8_noabort+0x43/0x50
[67204.645085]  [] xfs_destroy_ioend+0x3bf/0x4c0
[67204.658243]  [] xfs_end_bio+0x154/0x220
[67204.685362]  [] bio_endio+0x158/0x1b0
[67204.696983]  [] blk_update_request+0x18b/0xb80
[67204.710334]  [] scsi_end_request+0x97/0x5a0
[67204.723108]  [] scsi_io_completion+0x438/0x1690
[67204.807293]  [] scsi_finish_command+0x375/0x4e0
[67204.820838]  [] scsi_softirq_done+0x280/0x340
[67204.848884]  [] blk_done_softirq+0x1ff/0x360
[67204.875074]  [] __do_softirq+0x22d/0x8d7
[67204.887270]  [] irq_exit+0x15c/0x190
[67204.898697]  [] smp_apic_timer_interrupt+0x83/0xa0
[67204.912815]  [] apic_timer_interrupt+0x89/0x90
[67205.029113] 
==

Another ASAN trace:

[10856.599645] 
==
[10856.614109] BUG: KASAN: use-after-free in xfs_destroy_ioend+0x3b5/0x4c0 at 
addr 88006be5db90
[10856.631696] Read of size 8 by task kworker/13:1/314
[10856.641464] 
=
[10856.657836] BUG buffer_head (Tainted: GB  ): kasan: bad access 
detected
[10856.673158] 
-
[10856.673158]
[10856.692477] INFO: Allocated in 0x88006be5c378 age=18445973393378446689 
cpu=2191548517 pid=-1
[10856.710062]  alloc_buffer_head+0x22/0xd0
[10856.717928]  ___slab_alloc+0x4e0/0x520

[BUG] Slab corruption during XFS writeback under memory pressure

2016-07-15 Thread Calvin Owens

Hello all,

I've found a nasty source of slab corruption. Based on seeing similar symptoms
on boxes at Facebook, I suspect it's been around since at least 3.10.

It only reproduces under memory pressure so far as I can tell: the issue seems
to be that XFS reclaims pages from buffers that are still in use by
scsi/block. I'm not sure which side the bug lies on, but I've only observed it
with XFS.

[67203.776421] 
==
[67203.792521] BUG: KASAN: use-after-free in xfs_destroy_ioend+0x3bf/0x4c0 at 
addr 8804cf466288
[67203.812036] Read of size 8 by task python2.7/22913
[67203.822713] 
=
[67203.840917] BUG buffer_head (Not tainted): kasan: bad access detected
[67203.855253] 
-
[67203.855253]
[67203.876727] Disabling lock debugging due to kernel taint
[67203.888575] INFO: Allocated in 0x8804cf465d40 age=18437180719206552994 
cpu=2191548261 pid=-1
[67203.908139]  alloc_buffer_head+0x22/0xd0
[67203.916903]  ___slab_alloc+0x4e0/0x520
[67203.925286]  __slab_alloc+0x43/0x70
[67203.933087]  kmem_cache_alloc+0x228/0x2c0
[67203.942042]  alloc_buffer_head+0x22/0xd0
[67203.950782]  alloc_page_buffers+0xa9/0x1f0
[67203.959936]  create_empty_buffers+0x30/0x420
[67203.969495]  create_page_buffers+0x120/0x1b0
[67203.979029]  __block_write_begin+0x16b/0x1010
[67203.988756]  xfs_vm_write_begin+0x55/0x1b0
[67203.997884]  generic_perform_write+0x288/0x510
[67204.007771]  xfs_file_buffered_aio_write+0x316/0x780
[67204.018811]  xfs_file_write_iter+0x26f/0x6c0
[67204.028313]  __vfs_write+0x2a0/0x620
[67204.036276]  vfs_write+0x159/0x4c0
[67204.043855]  SyS_write+0xd2/0x1b0
[67204.051245] INFO: Freed in 0x103fc80ec age=18446651500051355200 
cpu=2165122683 pid=-1
[67204.068634]  free_buffer_head+0x41/0x90
[67204.077175]  __slab_free+0x1ed/0x340
[67204.085138]  kmem_cache_free+0x270/0x300
[67204.093867]  free_buffer_head+0x41/0x90
[67204.102422]  try_to_free_buffers+0x171/0x240
[67204.111925]  xfs_vm_releasepage+0xcb/0x3b0
[67204.121101]  try_to_release_page+0x106/0x190
[67204.130602]  shrink_page_list+0x118e/0x1a10
[67204.139910]  shrink_inactive_list+0x42c/0xdf0
[67204.149600]  shrink_zone_memcg+0xa09/0xfa0
[67204.158715]  shrink_zone+0x2c3/0xbc0
[67204.166679]  do_try_to_free_pages+0x42a/0x12f0
[67204.176562]  try_to_free_pages+0x1a3/0x5d0
[67204.185709]  __alloc_pages_nodemask+0xbeb/0x20d0
[67204.195979]  alloc_pages_vma+0x11b/0x5e0
[67204.204709]  handle_mm_fault+0x2c27/0x47d0
[67204.213823] INFO: Slab 0xea00133d1900 objects=37 used=14 
fp=0x8804cf464530 flags=0x20004080
[67204.235439] INFO: Object 0x8804cf466260 @offset=8800 
fp=0x
[67204.235439]
[67204.455817] CPU: 1 PID: 22913 Comm: python2.7 Tainted: GB   
4.7.0-rc7-calvinowens-1468357363-1-gcaa3dc6 #1
[67204.480313] Hardware name: Wiwynn   HoneyBadger/PantherPlus, BIOS HBM6.71 
02/03/2016
[67204.497509]  88075e99f480 88075ec87a30 81e8b8e4 
8804cf464000
[67204.514224]  8804cf466260 88075ec87a60 8153a995 
88075e99f480
[67204.530924]  ea00133d1900 8804cf466260 dc00 
88075ec87a88
[67204.547624] Call Trace:
[67204.553086][] dump_stack+0x68/0x94
[67204.565946]  [] print_trailer+0x115/0x1a0
[67204.578334]  [] object_err+0x34/0x40
[67204.589762]  [] kasan_report_error+0x217/0x530
[67204.616847]  [] __asan_report_load8_noabort+0x43/0x50
[67204.645085]  [] xfs_destroy_ioend+0x3bf/0x4c0
[67204.658243]  [] xfs_end_bio+0x154/0x220
[67204.685362]  [] bio_endio+0x158/0x1b0
[67204.696983]  [] blk_update_request+0x18b/0xb80
[67204.710334]  [] scsi_end_request+0x97/0x5a0
[67204.723108]  [] scsi_io_completion+0x438/0x1690
[67204.807293]  [] scsi_finish_command+0x375/0x4e0
[67204.820838]  [] scsi_softirq_done+0x280/0x340
[67204.848884]  [] blk_done_softirq+0x1ff/0x360
[67204.875074]  [] __do_softirq+0x22d/0x8d7
[67204.887270]  [] irq_exit+0x15c/0x190
[67204.898697]  [] smp_apic_timer_interrupt+0x83/0xa0
[67204.912815]  [] apic_timer_interrupt+0x89/0x90
[67205.029113] 
==

Another ASAN trace:

[10856.599645] 
==
[10856.614109] BUG: KASAN: use-after-free in xfs_destroy_ioend+0x3b5/0x4c0 at 
addr 88006be5db90
[10856.631696] Read of size 8 by task kworker/13:1/314
[10856.641464] 
=
[10856.657836] BUG buffer_head (Tainted: GB  ): kasan: bad access 
detected
[10856.673158] 
-
[10856.673158]
[10856.692477] INFO: Allocated in 0x88006be5c378 age=18445973393378446689 
cpu=2191548517 pid=-1
[10856.710062]  alloc_buffer_head+0x22/0xd0
[10856.717928]  ___slab_alloc+0x4e0/0x520

Re: slab-out-of-bounds in rpc/nfs

2016-06-17 Thread Calvin Owens
On Friday 06/17 at 09:38 -0400, Benjamin Coddington wrote:
> On 16 Jun 2016, at 13:52, Calvin Owens wrote:
> 
> > On Tuesday 03/08 at 11:37 +0100, Dmitry Vyukov wrote:
> > > On Tue, Mar 8, 2016 at 11:27 AM, Benjamin Coddington
> > > <bcodd...@redhat.com> wrote:
> > > > Adding linux-...@vger.kernel.org ..
> > > > 
> > > > On Mon, 7 Mar 2016, Alexei Starovoitov wrote:
> > > > 
> > > > > seeing on ton of these errors on net-next with kasan on.
> > > > > Likely old bug though.
> > > > > 
> > > > > [  373.705691] BUG: KASAN: slab-out-of-bounds in
> > > > > memcpy+0x28/0x40 at
> > > > > addr 8811ada62cb0
> > > > > [  373.707137] Write of size 28 by task bash/7059
> > > > > [  373.708177] 
> > > > > =
> > > > > [  373.709711] BUG kmalloc-4096 (Tainted: GW  ): kasan:
> > > > > bad access detected
> > > > > [  373.711185] 
> > > > > -
> > > > > [  373.711185]
> > > > > [  373.721461] INFO: Allocated in rpc_malloc+0x58/0xd0
> > > > > age=21 cpu=5 pid=7059
> > > > > [  373.727158] ___slab_alloc+0x4e2/0x500
> > > > > [  373.728469] __slab_alloc+0x43/0x70
> > > > > [  373.729222] __kmalloc+0x286/0x350
> > > > > [  373.729978] rpc_malloc+0x58/0xd0
> > > > > [  373.730590] call_allocate+0x333/0x690
> > > > > [  373.731428] __rpc_execute+0x187/0xad0
> > > > > [  373.734395] rpc_execute+0xe1/0x2c0
> > > > > [  373.735020] rpc_run_task+0x1ce/0x250
> > > > > [  373.735706] rpc_call_sync+0x93/0x150
> > > > > [  373.736387] nfs3_rpc_wrapper.constprop.12+0x9b/0x240
> > > > > [  373.742818] nfs3_proc_readdir+0x230/0x390
> > > > > [  373.750157] nfs_readdir_xdr_to_array+0x501/0x9b0
> > > > > [  373.753520] nfs_readdir_filler+0x68/0x160
> > > > > [  373.758455] do_read_cache_page+0x8c/0x3c0
> > > > > [  373.761745] read_cache_page+0x46/0x70
> > > > > [  373.763269] nfs_readdir+0x420/0x1380
> > > > > [  373.764078] INFO: Freed in rpc_free+0x41/0x70 age=64
> > > > > cpu=5 pid=7059
> > > > > [  373.765335] __slab_free+0x175/0x280
> > > > > [  373.766106] kfree+0x25c/0x2a0
> > > > > [  373.766809] rpc_free+0x41/0x70
> > > > > [  373.767629] xprt_release+0x2c5/0x8f0
> > > > > [  373.768430] rpc_release_resources_task+0x14/0x80
> > > > > [  373.769403] __rpc_execute+0x547/0xad0
> > > > > [  373.770249] rpc_execute+0xe1/0x2c0
> > > > > [  373.770995] rpc_run_task+0x1ce/0x250
> > > > > [  373.771786] rpc_call_sync+0x93/0x150
> > > > > [  373.772672] nfs3_rpc_wrapper.constprop.12+0x9b/0x240
> > > > > [  373.773704] nfs3_proc_access+0x1f1/0x330
> > > > > [  373.774544] nfs_do_access+0x94f/0x12d0
> > > > > [  373.775572] nfs_permission+0x469/0x580
> > > > > [  373.776465] __inode_permission+0x151/0x230
> > > > > [  373.780764] inode_permission+0x21/0xf0
> > > > > [  373.791392] may_open+0x14b/0x260
> > > > > 
> > > 
> > > The report misses the most interesting part -- the out-of-bounds
> > > access stack. It should be at the bottom of the report. If you still
> > > have the full report, please post it.
> > 
> > I'm triggering this as well on 4.7-rc3. I can reproduce it as far back
> > as 4.0,
> > can't easily test any further back because that's when KASAN was merged.
> > 
> > Logs and Kconfig follow. I can trigger this 100% of the time.
> 
> Hi Calvin, how are you triggering this?  I would guess this is getdents or a
> readdir that's been signaled before the server replies..

Unfortunately my current repro is "boot a specific server type at Facebook", 
I'll
drill down and see if I can get a minimal repro to send along.

Thanks,
Calvin


Re: slab-out-of-bounds in rpc/nfs

2016-06-17 Thread Calvin Owens
On Friday 06/17 at 09:38 -0400, Benjamin Coddington wrote:
> On 16 Jun 2016, at 13:52, Calvin Owens wrote:
> 
> > On Tuesday 03/08 at 11:37 +0100, Dmitry Vyukov wrote:
> > > On Tue, Mar 8, 2016 at 11:27 AM, Benjamin Coddington
> > >  wrote:
> > > > Adding linux-...@vger.kernel.org ..
> > > > 
> > > > On Mon, 7 Mar 2016, Alexei Starovoitov wrote:
> > > > 
> > > > > seeing on ton of these errors on net-next with kasan on.
> > > > > Likely old bug though.
> > > > > 
> > > > > [  373.705691] BUG: KASAN: slab-out-of-bounds in
> > > > > memcpy+0x28/0x40 at
> > > > > addr 8811ada62cb0
> > > > > [  373.707137] Write of size 28 by task bash/7059
> > > > > [  373.708177] 
> > > > > =
> > > > > [  373.709711] BUG kmalloc-4096 (Tainted: GW  ): kasan:
> > > > > bad access detected
> > > > > [  373.711185] 
> > > > > -
> > > > > [  373.711185]
> > > > > [  373.721461] INFO: Allocated in rpc_malloc+0x58/0xd0
> > > > > age=21 cpu=5 pid=7059
> > > > > [  373.727158] ___slab_alloc+0x4e2/0x500
> > > > > [  373.728469] __slab_alloc+0x43/0x70
> > > > > [  373.729222] __kmalloc+0x286/0x350
> > > > > [  373.729978] rpc_malloc+0x58/0xd0
> > > > > [  373.730590] call_allocate+0x333/0x690
> > > > > [  373.731428] __rpc_execute+0x187/0xad0
> > > > > [  373.734395] rpc_execute+0xe1/0x2c0
> > > > > [  373.735020] rpc_run_task+0x1ce/0x250
> > > > > [  373.735706] rpc_call_sync+0x93/0x150
> > > > > [  373.736387] nfs3_rpc_wrapper.constprop.12+0x9b/0x240
> > > > > [  373.742818] nfs3_proc_readdir+0x230/0x390
> > > > > [  373.750157] nfs_readdir_xdr_to_array+0x501/0x9b0
> > > > > [  373.753520] nfs_readdir_filler+0x68/0x160
> > > > > [  373.758455] do_read_cache_page+0x8c/0x3c0
> > > > > [  373.761745] read_cache_page+0x46/0x70
> > > > > [  373.763269] nfs_readdir+0x420/0x1380
> > > > > [  373.764078] INFO: Freed in rpc_free+0x41/0x70 age=64
> > > > > cpu=5 pid=7059
> > > > > [  373.765335] __slab_free+0x175/0x280
> > > > > [  373.766106] kfree+0x25c/0x2a0
> > > > > [  373.766809] rpc_free+0x41/0x70
> > > > > [  373.767629] xprt_release+0x2c5/0x8f0
> > > > > [  373.768430] rpc_release_resources_task+0x14/0x80
> > > > > [  373.769403] __rpc_execute+0x547/0xad0
> > > > > [  373.770249] rpc_execute+0xe1/0x2c0
> > > > > [  373.770995] rpc_run_task+0x1ce/0x250
> > > > > [  373.771786] rpc_call_sync+0x93/0x150
> > > > > [  373.772672] nfs3_rpc_wrapper.constprop.12+0x9b/0x240
> > > > > [  373.773704] nfs3_proc_access+0x1f1/0x330
> > > > > [  373.774544] nfs_do_access+0x94f/0x12d0
> > > > > [  373.775572] nfs_permission+0x469/0x580
> > > > > [  373.776465] __inode_permission+0x151/0x230
> > > > > [  373.780764] inode_permission+0x21/0xf0
> > > > > [  373.791392] may_open+0x14b/0x260
> > > > > 
> > > 
> > > The report misses the most interesting part -- the out-of-bounds
> > > access stack. It should be at the bottom of the report. If you still
> > > have the full report, please post it.
> > 
> > I'm triggering this as well on 4.7-rc3. I can reproduce it as far back
> > as 4.0,
> > can't easily test any further back because that's when KASAN was merged.
> > 
> > Logs and Kconfig follow. I can trigger this 100% of the time.
> 
> Hi Calvin, how are you triggering this?  I would guess this is getdents or a
> readdir that's been signaled before the server replies..

Unfortunately my current repro is "boot a specific server type at Facebook", 
I'll
drill down and see if I can get a minimal repro to send along.

Thanks,
Calvin


Re: [PATCH] ses: Fix racy cleanup of /sys in remove_dev()

2016-06-15 Thread Calvin Owens
On Thursday 06/02 at 15:50 -0700, Calvin Owens wrote:
> On 05/13/2016 01:28 PM, Calvin Owens wrote:
> > Currently we free the resources backing the enclosure device before we
> > call device_unregister(). This is racy: during rmmod of low-level SCSI
> > drivers that hook into enclosure, we end up with a small window of time
> > during which writing to /sys can OOPS. Example trace with mpt3sas:
> 
> Ping?

Any thoughts? Squinting at this more it still seems racy, but a narrow race
is surely better than just blatantly freeing everything while the file is
still exposed in /sys? Is there a better way you'd prefer I accomplish this?

(I have boxes that OOPS all the time from monitoring code reading the /sys
files, with this patch I haven't seen a single one.)

Thanks,
Calvin

> >general protection fault:  [#1] SMP KASAN
> >Modules linked in: mpt3sas(-) <...>
> >RIP: [] ses_get_page2_descriptor.isra.6+0x38/0x220 
> > [ses]
> >Call Trace:
> > [] ses_set_fault+0xf4/0x400 [ses]
> > [] set_component_fault+0xa9/0xf0 [enclosure]
> > [] dev_attr_store+0x3c/0x70
> > [] sysfs_kf_write+0x115/0x180
> > [] kernfs_fop_write+0x275/0x3a0
> > [] __vfs_write+0xe0/0x3e0
> > [] vfs_write+0x13f/0x4a0
> > [] SyS_write+0x111/0x230
> > [] entry_SYSCALL_64_fastpath+0x13/0x94
> > 
> > Fortunately the solution is extremely simple: call device_unregister()
> > before we free the resources, and the race no longer exists. The driver
> > core holds a reference over ->remove_dev(), so AFAICT this is safe.
> > 
> > Signed-off-by: Calvin Owens <calvinow...@fb.com>
> > ---
> >   drivers/scsi/ses.c | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c
> > index 53ef1cb..0e8601a 100644
> > --- a/drivers/scsi/ses.c
> > +++ b/drivers/scsi/ses.c
> > @@ -778,6 +778,8 @@ static void ses_intf_remove_enclosure(struct 
> > scsi_device *sdev)
> > if (!edev)
> > return;
> > 
> > +   enclosure_unregister(edev);
> > +
> > ses_dev = edev->scratch;
> > edev->scratch = NULL;
> > 
> > @@ -789,7 +791,6 @@ static void ses_intf_remove_enclosure(struct 
> > scsi_device *sdev)
> > kfree(edev->component[0].scratch);
> > 
> > put_device(>edev);
> > -   enclosure_unregister(edev);
> >   }
> > 
> >   static void ses_intf_remove(struct device *cdev,
> > 
> 


Re: [PATCH] ses: Fix racy cleanup of /sys in remove_dev()

2016-06-15 Thread Calvin Owens
On Thursday 06/02 at 15:50 -0700, Calvin Owens wrote:
> On 05/13/2016 01:28 PM, Calvin Owens wrote:
> > Currently we free the resources backing the enclosure device before we
> > call device_unregister(). This is racy: during rmmod of low-level SCSI
> > drivers that hook into enclosure, we end up with a small window of time
> > during which writing to /sys can OOPS. Example trace with mpt3sas:
> 
> Ping?

Any thoughts? Squinting at this more it still seems racy, but a narrow race
is surely better than just blatantly freeing everything while the file is
still exposed in /sys? Is there a better way you'd prefer I accomplish this?

(I have boxes that OOPS all the time from monitoring code reading the /sys
files, with this patch I haven't seen a single one.)

Thanks,
Calvin

> >general protection fault:  [#1] SMP KASAN
> >Modules linked in: mpt3sas(-) <...>
> >RIP: [] ses_get_page2_descriptor.isra.6+0x38/0x220 
> > [ses]
> >Call Trace:
> > [] ses_set_fault+0xf4/0x400 [ses]
> > [] set_component_fault+0xa9/0xf0 [enclosure]
> > [] dev_attr_store+0x3c/0x70
> > [] sysfs_kf_write+0x115/0x180
> > [] kernfs_fop_write+0x275/0x3a0
> > [] __vfs_write+0xe0/0x3e0
> > [] vfs_write+0x13f/0x4a0
> > [] SyS_write+0x111/0x230
> > [] entry_SYSCALL_64_fastpath+0x13/0x94
> > 
> > Fortunately the solution is extremely simple: call device_unregister()
> > before we free the resources, and the race no longer exists. The driver
> > core holds a reference over ->remove_dev(), so AFAICT this is safe.
> > 
> > Signed-off-by: Calvin Owens 
> > ---
> >   drivers/scsi/ses.c | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c
> > index 53ef1cb..0e8601a 100644
> > --- a/drivers/scsi/ses.c
> > +++ b/drivers/scsi/ses.c
> > @@ -778,6 +778,8 @@ static void ses_intf_remove_enclosure(struct 
> > scsi_device *sdev)
> > if (!edev)
> > return;
> > 
> > +   enclosure_unregister(edev);
> > +
> > ses_dev = edev->scratch;
> > edev->scratch = NULL;
> > 
> > @@ -789,7 +791,6 @@ static void ses_intf_remove_enclosure(struct 
> > scsi_device *sdev)
> > kfree(edev->component[0].scratch);
> > 
> > put_device(>edev);
> > -   enclosure_unregister(edev);
> >   }
> > 
> >   static void ses_intf_remove(struct device *cdev,
> > 
> 


Re: [PATCH] ses: Fix racy cleanup of /sys in remove_dev()

2016-06-02 Thread Calvin Owens

On 05/13/2016 01:28 PM, Calvin Owens wrote:

Currently we free the resources backing the enclosure device before we
call device_unregister(). This is racy: during rmmod of low-level SCSI
drivers that hook into enclosure, we end up with a small window of time
during which writing to /sys can OOPS. Example trace with mpt3sas:


Ping?


   general protection fault:  [#1] SMP KASAN
   Modules linked in: mpt3sas(-) <...>
   RIP: [] ses_get_page2_descriptor.isra.6+0x38/0x220 [ses]
   Call Trace:
[] ses_set_fault+0xf4/0x400 [ses]
[] set_component_fault+0xa9/0xf0 [enclosure]
[] dev_attr_store+0x3c/0x70
[] sysfs_kf_write+0x115/0x180
[] kernfs_fop_write+0x275/0x3a0
[] __vfs_write+0xe0/0x3e0
[] vfs_write+0x13f/0x4a0
[] SyS_write+0x111/0x230
[] entry_SYSCALL_64_fastpath+0x13/0x94

Fortunately the solution is extremely simple: call device_unregister()
before we free the resources, and the race no longer exists. The driver
core holds a reference over ->remove_dev(), so AFAICT this is safe.

Signed-off-by: Calvin Owens <calvinow...@fb.com>
---
  drivers/scsi/ses.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c
index 53ef1cb..0e8601a 100644
--- a/drivers/scsi/ses.c
+++ b/drivers/scsi/ses.c
@@ -778,6 +778,8 @@ static void ses_intf_remove_enclosure(struct scsi_device 
*sdev)
if (!edev)
return;

+   enclosure_unregister(edev);
+
ses_dev = edev->scratch;
edev->scratch = NULL;

@@ -789,7 +791,6 @@ static void ses_intf_remove_enclosure(struct scsi_device 
*sdev)
kfree(edev->component[0].scratch);

put_device(>edev);
-   enclosure_unregister(edev);
  }

  static void ses_intf_remove(struct device *cdev,





Re: [PATCH] ses: Fix racy cleanup of /sys in remove_dev()

2016-06-02 Thread Calvin Owens

On 05/13/2016 01:28 PM, Calvin Owens wrote:

Currently we free the resources backing the enclosure device before we
call device_unregister(). This is racy: during rmmod of low-level SCSI
drivers that hook into enclosure, we end up with a small window of time
during which writing to /sys can OOPS. Example trace with mpt3sas:


Ping?


   general protection fault:  [#1] SMP KASAN
   Modules linked in: mpt3sas(-) <...>
   RIP: [] ses_get_page2_descriptor.isra.6+0x38/0x220 [ses]
   Call Trace:
[] ses_set_fault+0xf4/0x400 [ses]
[] set_component_fault+0xa9/0xf0 [enclosure]
[] dev_attr_store+0x3c/0x70
[] sysfs_kf_write+0x115/0x180
[] kernfs_fop_write+0x275/0x3a0
[] __vfs_write+0xe0/0x3e0
[] vfs_write+0x13f/0x4a0
[] SyS_write+0x111/0x230
[] entry_SYSCALL_64_fastpath+0x13/0x94

Fortunately the solution is extremely simple: call device_unregister()
before we free the resources, and the race no longer exists. The driver
core holds a reference over ->remove_dev(), so AFAICT this is safe.

Signed-off-by: Calvin Owens 
---
  drivers/scsi/ses.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c
index 53ef1cb..0e8601a 100644
--- a/drivers/scsi/ses.c
+++ b/drivers/scsi/ses.c
@@ -778,6 +778,8 @@ static void ses_intf_remove_enclosure(struct scsi_device 
*sdev)
if (!edev)
return;

+   enclosure_unregister(edev);
+
ses_dev = edev->scratch;
edev->scratch = NULL;

@@ -789,7 +791,6 @@ static void ses_intf_remove_enclosure(struct scsi_device 
*sdev)
kfree(edev->component[0].scratch);

put_device(>edev);
-   enclosure_unregister(edev);
  }

  static void ses_intf_remove(struct device *cdev,





  1   2   3   4   >