Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}

2018-10-19 Thread Peter Zijlstra
On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote:
> diff --git a/tools/include/linux/ring_buffer.h 
> b/tools/include/linux/ring_buffer.h
> new file mode 100644
> index 000..48200e0
> --- /dev/null
> +++ b/tools/include/linux/ring_buffer.h
> @@ -0,0 +1,69 @@
> +#ifndef _TOOLS_LINUX_RING_BUFFER_H_
> +#define _TOOLS_LINUX_RING_BUFFER_H_
> +
> +#include 
> +#include 
> +
> +/*
> + * Below barriers pair as follows (kernel/events/ring_buffer.c):
> + *
> + * Since the mmap() consumer (userspace) can run on a different CPU:
> + *
> + *   kernel user
> + *
> + *   if (LOAD ->data_tail) {LOAD ->data_head
> + *  (A) smp_rmb()   (C)
> + *  STORE $data LOAD $data
> + *  smp_wmb()   (B) smp_mb()(D)
> + *  STORE ->data_head   STORE ->data_tail
> + *   }
> + *
> + * Where A pairs with D, and B pairs with C.
> + *
> + * In our case A is a control dependency that separates the load
> + * of the ->data_tail and the stores of $data. In case ->data_tail
> + * indicates there is no room in the buffer to store $data we do not.
> + *
> + * D needs to be a full barrier since it separates the data READ
> + * from the tail WRITE.
> + *
> + * For B a WMB is sufficient since it separates two WRITEs, and for
> + * C an RMB is sufficient since it separates two READs.
> + */
> +
> +/*
> + * Note, instead of B, C, D we could also use smp_store_release()
> + * in B and D as well as smp_load_acquire() in C. However, this
> + * optimization makes sense not for all architectures since it
> + * would resolve into READ_ONCE() + smp_mb() pair for smp_load_acquire()
> + * and smp_mb() + WRITE_ONCE() pair for smp_store_release(), thus
> + * for those smp_wmb() in B and smp_rmb() in C would still be less
> + * expensive. For the case of D this has either the same cost or
> + * is less expensive. For example, due to TSO (total store order),
> + * x86 can avoid the CPU barrier entirely.
> + */
> +
> +static inline u64 ring_buffer_read_head(struct perf_event_mmap_page *base)
> +{
> +/*
> + * Architectures where smp_load_acquire() does not fallback to
> + * READ_ONCE() + smp_mb() pair.
> + */
> +#if defined(__x86_64__) || defined(__aarch64__) || defined(__powerpc64__) || 
> \
> +defined(__ia64__) || defined(__sparc__) && defined(__arch64__)
> + return smp_load_acquire(>data_head);
> +#else
> + u64 head = READ_ONCE(base->data_head);
> +
> + smp_rmb();
> + return head;
> +#endif
> +}
> +
> +static inline void ring_buffer_write_tail(struct perf_event_mmap_page *base,
> +   u64 tail)
> +{
> + smp_store_release(>data_tail, tail);
> +}
> +
> +#endif /* _TOOLS_LINUX_RING_BUFFER_H_ */

(for the whole patch, but in particular the above)

Acked-by: Peter Zijlstra (Intel) 


Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}

2018-10-19 Thread Peter Zijlstra
On Thu, Oct 18, 2018 at 08:33:09AM -0700, Alexei Starovoitov wrote:
> On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote:
> >  #endif /* _TOOLS_LINUX_ASM_IA64_BARRIER_H */
> > diff --git a/tools/arch/powerpc/include/asm/barrier.h 
> > b/tools/arch/powerpc/include/asm/barrier.h
> > index a634da0..905a2c6 100644
> > --- a/tools/arch/powerpc/include/asm/barrier.h
> > +++ b/tools/arch/powerpc/include/asm/barrier.h
> > @@ -27,4 +27,20 @@
> >  #define rmb()  __asm__ __volatile__ ("sync" : : : "memory")
> >  #define wmb()  __asm__ __volatile__ ("sync" : : : "memory")
> > 
> > +#if defined(__powerpc64__)
> > +#define smp_lwsync()   __asm__ __volatile__ ("lwsync" : : : "memory")
> > +
> > +#define smp_store_release(p, v)\
> > +do {   \
> > +   smp_lwsync();   \
> > +   WRITE_ONCE(*p, v);  \
> > +} while (0)
> > +
> > +#define smp_load_acquire(p)\
> > +({ \
> > +   typeof(*p) ___p1 = READ_ONCE(*p);   \
> > +   smp_lwsync();   \
> > +   ___p1;  \
> 
> I don't like this proliferation of asm.
> Why do we think that we can do better job than compiler?
> can we please use gcc builtins instead?
> https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> __atomic_load_n(ptr, __ATOMIC_ACQUIRE);
> __atomic_store_n(ptr, val, __ATOMIC_RELEASE);
> are done specifically for this use case if I'm not mistaken.
> I think it pays to learn what compiler provides.

My problem with using the C11 stuff for this is that we're then limited
to compilers that actually support that. The kernel has a minimum of
gcc-4.6 (and thus perf does too I think) and gcc-4.6 does not have C11.

What Daniel writes is also true; the kernel and C11 memory models don't
align; but you're right in that for this purpose the C11 load-acquire
and store-release would indeed suffice.


Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}

2018-10-18 Thread Peter Zijlstra
On Thu, Oct 18, 2018 at 01:10:15AM +0200, Daniel Borkmann wrote:

> Wouldn't this then also allow the kernel side to use smp_store_release()
> when it updates the head? We'd be pretty much at the model as described
> in Documentation/core-api/circular-buffers.rst.
> 
> Meaning, rough pseudo-code diff would look as:
> 
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 5d3cf40..3d96275 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -84,8 +84,9 @@ static void perf_output_put_handle(struct 
> perf_output_handle *handle)
>*
>* See perf_output_begin().
>*/
> - smp_wmb(); /* B, matches C */
> - rb->user_page->data_head = head;
> +
> + /* B, matches C */
> + smp_store_release(>user_page->data_head, head);

Yes, this would be correct.

The reason we didn't do this is because smp_store_release() ends up
being smp_mb() + WRITE_ONCE() for a fair number of platforms, even if
they have a cheaper smp_wmb(). Most notably ARM.

(ARM64 OTOH would like to have smp_store_release() there I imagine;
while x86 doesn't care either way around).

A similar concern exists for the smp_load_acquire() I proposed for the
userspace side, ARM would have to resort to smp_mb() in that situation,
instead of the cheaper smp_rmb().

The smp_store_release() on the userspace side will actually be of equal
cost or cheaper, since it already has an smp_mb(). Most notably, x86 can
avoid barrier entirely, because TSO doesn't allow the LOAD-STORE reorder
(it only allows the STORE-LOAD reorder). And PowerPC can use LWSYNC
instead of SYNC.




Re: [PATCH bpf-next 3/3] bpf, libbpf: use proper barriers in perf ring buffer walk

2018-10-17 Thread Peter Zijlstra
On Wed, Oct 17, 2018 at 04:41:56PM +0200, Daniel Borkmann wrote:
> +static __u64 bpf_perf_read_head(struct perf_event_mmap_page *header)
> +{
> + __u64 data_head = READ_ONCE(header->data_head);
> +
> + smp_rmb();
> + return data_head;
> +}
> +
> +static void bpf_perf_write_tail(struct perf_event_mmap_page *header,
> + __u64 data_tail)
> +{
> + smp_mb();
> + header->data_tail = data_tail;
> +}

Same coments, either smp_load_acquire()/smp_store_release() or at the
very least a WRITE_ONCE() there.


Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}

2018-10-17 Thread Peter Zijlstra
On Wed, Oct 17, 2018 at 04:41:55PM +0200, Daniel Borkmann wrote:
> @@ -73,7 +73,8 @@ static inline u64 perf_mmap__read_head(struct perf_mmap *mm)
>  {
>   struct perf_event_mmap_page *pc = mm->base;
>   u64 head = READ_ONCE(pc->data_head);
> - rmb();
> +
> + smp_rmb();
>   return head;
>  }
>  
> @@ -84,7 +85,7 @@ static inline void perf_mmap__write_tail(struct perf_mmap 
> *md, u64 tail)
>   /*
>* ensure all reads are done before we write the tail out.
>*/
> - mb();
> + smp_mb();
>   pc->data_tail = tail;

Ideally that would be a WRITE_ONCE() to avoid store tearing.

Alternatively, I think we can use smp_store_release() here, all we care
about is that the prior loads stay prior.

Similarly, I suppose, we could use smp_load_acquire() for the data_head
load above.

>  }
>  
> -- 
> 2.9.5
> 


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-21 Thread Peter Zijlstra
On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote:
> There is another longstanding TODO list entry: PERF_RECORD_MMAP records
> should include a build-id

I throught the problem was that the kernel doesn't have the build-id in
the first place. So it cannot hand them out.



Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-21 Thread Peter Zijlstra
On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote:
> > I consider synthetic perf events to be non-ABI. Meaning they're
> > emitted by perf user space into perf.data and there is a convention
> > on names, but it's not a kernel abi. Like RECORD_MMAP with
> > event.filename == "[module_name]" is an indication for perf report
> > to parse elf/build-id of dso==module_name.
> > There is no such support in the kernel. Kernel doesn't emit
> > such events for module load/unload. If in the future
> > we decide to extend kernel with such events they don't have
> > to match what user space perf does today.
> 
> Right, that is another unfortunate state of affairs, kernel module
> load/unload should already be supported, reported by the kernel via a
> proper PERF_RECORD_MODULE_LOAD/UNLOAD

Just wondering, is anyone actually doing enough module loading for this
to matter? (asks the CONFIG_MODULES=n guy).

I thought that was all a relatively static affair; you boot, you get
loadead a few modules for present hardware, the end.

Anyway, no real objection, just wonder if it's worth it.


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-20 Thread Peter Zijlstra
On Thu, Sep 20, 2018 at 10:25:45AM -0300, Arnaldo Carvalho de Melo wrote:
> PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due
> to having to cope with munmapping parts of existing mmaps, etc.
> 
> I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for
> now it would be used just in this clean case for undoing a
> PERF_RECORD_MMAP for a BPF program.
> 
> The ABI is already complicated, starting to use something called
> PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever,
> I think.

Agreed, the PERF_RECORD_MUNMAP patch was fairly trivial, the difficult
part was getting the perf tool to dtrt for that use-case. But if we need
unmap events, doing the unmap record now is the right thing.




Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-20 Thread Peter Zijlstra
On Thu, Sep 20, 2018 at 10:44:24AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 19, 2018 at 03:39:34PM -0700, Alexei Starovoitov wrote:
> >  void bpf_prog_kallsyms_del(struct bpf_prog *fp)
> >  {
> > +   unsigned long symbol_start, symbol_end;
> > +   /* mmap_record.filename cannot be NULL and has to be u64 aligned */
> > +   char buf[sizeof(u64)] = {};
> > +
> > if (!bpf_prog_kallsyms_candidate(fp))
> > return;
> >  
> > spin_lock_bh(_lock);
> > bpf_prog_ksym_node_del(fp->aux);
> > spin_unlock_bh(_lock);
> > +   bpf_get_prog_addr_region(fp, _start, _end);
> > +   perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start,
> > +buf, sizeof(buf));
> >  }
> 
> So perf doesn't normally issue unmap events.. We've talked about doing
> that, but so far it's never really need needed I think.
> 
> I feels a bit weird to start issuing unmap events for this.

FWIW:

  
https://lkml.kernel.org/r/20170127130702.gi6...@twins.programming.kicks-ass.net

has talk of PERF_RECORD_MUNMAP


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-20 Thread Peter Zijlstra
On Wed, Sep 19, 2018 at 03:39:34PM -0700, Alexei Starovoitov wrote:
>  void bpf_prog_kallsyms_del(struct bpf_prog *fp)
>  {
> + unsigned long symbol_start, symbol_end;
> + /* mmap_record.filename cannot be NULL and has to be u64 aligned */
> + char buf[sizeof(u64)] = {};
> +
>   if (!bpf_prog_kallsyms_candidate(fp))
>   return;
>  
>   spin_lock_bh(_lock);
>   bpf_prog_ksym_node_del(fp->aux);
>   spin_unlock_bh(_lock);
> + bpf_get_prog_addr_region(fp, _start, _end);
> + perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start,
> +  buf, sizeof(buf));
>  }

So perf doesn't normally issue unmap events.. We've talked about doing
that, but so far it's never really need needed I think.

I feels a bit weird to start issuing unmap events for this.


Re: [net-next, v6, 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue

2018-07-19 Thread Peter Zijlstra
On Wed, Jul 18, 2018 at 11:22:36AM -0700, Andrei Vagin wrote:
> > > [1.085679]   lock(cpu_hotplug_lock.rw_sem);
> > > [1.085753]   lock(cpu_hotplug_lock.rw_sem);
> > > [1.085828] 
> > > [1.085828]  *** DEADLOCK ***

> Peter and Ingo, maybe you could explain why it isn't safe to take one
> reader lock twice?

Very simple, because rwsems are fair and !recursive.

So if another CPU were to issue a write-lock in between these, then the
second would block because there is a writer pending. But because we
then already have a reader we're deadlocked.


Re: [PATCH bpf-next v2 1/7] perf/core: add perf_get_event() to return perf_event given a struct file

2018-05-18 Thread Peter Zijlstra
On Thu, May 17, 2018 at 10:32:53PM -0700, Yonghong Song wrote:
> A new extern function, perf_get_event(), is added to return a perf event
> given a struct file. This function will be used in later patches.

Can't you do a narrower interface? Like return the prog. I'm not too
keen on random !perf code frobbing around inside the event.


Re: [PATCH bpf-next 2/7] bpf: introduce bpf subcommand BPF_PERF_EVENT_QUERY

2018-05-16 Thread Peter Zijlstra
On Tue, May 15, 2018 at 04:45:16PM -0700, Yonghong Song wrote:
> Currently, suppose a userspace application has loaded a bpf program
> and attached it to a tracepoint/kprobe/uprobe, and a bpf
> introspection tool, e.g., bpftool, wants to show which bpf program
> is attached to which tracepoint/kprobe/uprobe. Such attachment
> information will be really useful to understand the overall bpf
> deployment in the system.
> 
> There is a name field (16 bytes) for each program, which could
> be used to encode the attachment point. There are some drawbacks
> for this approaches. First, bpftool user (e.g., an admin) may not
> really understand the association between the name and the
> attachment point. Second, if one program is attached to multiple
> places, encoding a proper name which can imply all these
> attachments becomes difficult.
> 
> This patch introduces a new bpf subcommand BPF_PERF_EVENT_QUERY.
> Given a pid and fd, if the  is associated with a
> tracepoint/kprobe/uprobea perf event, BPF_PERF_EVENT_QUERY will return
>. prog_id
>. tracepoint name, or
>. k[ret]probe funcname + offset or kernel addr, or
>. u[ret]probe filename + offset
> to the userspace.
> The user can use "bpftool prog" to find more information about
> bpf program itself with prog_id.
> 
> Signed-off-by: Yonghong Song 
> ---
>  include/linux/trace_events.h |  15 ++
>  include/uapi/linux/bpf.h |  25 ++
>  kernel/bpf/syscall.c | 113 
> +++
>  kernel/trace/bpf_trace.c |  53 
>  kernel/trace/trace_kprobe.c  |  29 +++
>  kernel/trace/trace_uprobe.c  |  22 +
>  6 files changed, 257 insertions(+)

Why is the command called *_PERF_EVENT_* ? Are there not a lot of !perf
places to attach BPF proglets?


Re: [PATCH bpf v3] x86/cpufeature: bpf hack for clang not supporting asm goto

2018-05-10 Thread Peter Zijlstra
On Thu, May 03, 2018 at 08:31:19PM -0700, Yonghong Song wrote:

> This approach is preferred since the already deployed bcc scripts, or
> any other bpf applicaitons utilizing LLVM JIT compilation functionality,
> will continue work with the new kernel without re-compilation and
> re-deployment.

So I really hate this and would much rather see the BPF build
environment changed. It not consistenyly having __BPF__ defined really
smells like a bug on your end.

Sometimes you just need to update tools... Is it really too hard to do
-D__BPF__ in the bpf build process that we need to mollest the kernel
for it?

> Note that this is a hack in the kernel to workaround bpf compilation issue.
> The hack will be removed once clang starts to support asm goto.

Note that that ^^ already mandates people re-deploy their bpf tools, so
why is llvm supporting asm-goto a better point to re-deploy than fixing
a consistent __BPF__ define for the bpf build environment?

> diff --git a/Makefile b/Makefile
> index 83b6c54..cfd8759 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -504,6 +504,7 @@ export RETPOLINE_CFLAGS
>  ifeq ($(call shell-cached,$(CONFIG_SHELL) $(srctree)/scripts/gcc-goto.sh 
> $(CC) $(KBUILD_CFLAGS)), y)
>CC_HAVE_ASM_GOTO := 1
>KBUILD_CFLAGS += -DCC_HAVE_ASM_GOTO
> +  KBUILD_CFLAGS += -D__NO_CLANG_BPF_HACK
>KBUILD_AFLAGS += -DCC_HAVE_ASM_GOTO
>  endif

I really think this is the wrong thing to do; but if the x86 maintainers
are willing to take this, I'll grudingly shut up.

Ingo, Thomas?

> diff --git a/arch/x86/include/asm/cpufeature.h 
> b/arch/x86/include/asm/cpufeature.h
> index b27da96..42edd5d 100644
> --- a/arch/x86/include/asm/cpufeature.h
> +++ b/arch/x86/include/asm/cpufeature.h
> @@ -140,6 +140,8 @@ extern void clear_cpu_cap(struct cpuinfo_x86 *c, unsigned 
> int bit);
>  
>  #define setup_force_cpu_bug(bit) setup_force_cpu_cap(bit)
>  
> +/* this macro is a temporary hack for bpf until clang gains asm-goto support 
> */
> +#ifdef __NO_CLANG_BPF_HACK
>  /*
>   * Static testing of CPU features.  Used the same as boot_cpu_has().
>   * These will statically patch the target code for additional
> @@ -195,6 +197,9 @@ static __always_inline __pure bool _static_cpu_has(u16 
> bit)
>   boot_cpu_has(bit) : \
>   _static_cpu_has(bit)\
>  )
> +#else
> +#define static_cpu_has(bit)  boot_cpu_has(bit)
> +#endif
>  
>  #define cpu_has_bug(c, bit)  cpu_has(c, (bit))
>  #define set_cpu_bug(c, bit)  set_cpu_cap(c, (bit))
> -- 
> 2.9.5
> 


Re: [RFC PATCH 2/2] net: mac808211: mac802154: use lockdep_assert_in_softirq() instead own warning

2018-05-04 Thread Peter Zijlstra
On Fri, May 04, 2018 at 09:07:35PM +0200, Sebastian Andrzej Siewior wrote:
> On 2018-05-04 20:51:32 [+0200], Peter Zijlstra wrote:

> > softirqs disabled, ack that is exactly what it checks.
> > 
> > But afaict the assertion you introduced tests that we are _in_ softirq
> > context, which is not the same.
> 
> indeed, now it clicked. Given what I wrote in the cover letter would you
> be in favour of (a proper) lockdep_assert_BH_disabled() or the cheaper
> local_bh_enable() (assuming the network folks don't mind the cheaper
> version)?

Depends a bit on what the code wants I suppose. If the code is in fact
fine with the stronger in-softirq assertion, that'd be best. Otherwise I
don't object to a lockdep_assert_bh_disabled() to accompany the
lockdep_assert_irq_disabled() we already have either.


Re: [RFC PATCH 2/2] net: mac808211: mac802154: use lockdep_assert_in_softirq() instead own warning

2018-05-04 Thread Peter Zijlstra
On Fri, May 04, 2018 at 08:45:39PM +0200, Sebastian Andrzej Siewior wrote:
> On 2018-05-04 20:32:49 [+0200], Peter Zijlstra wrote:
> > On Fri, May 04, 2018 at 07:51:44PM +0200, Sebastian Andrzej Siewior wrote:
> > > From: Anna-Maria Gleixner <anna-ma...@linutronix.de>
> > > 
> > > The warning in ieee802154_rx() and ieee80211_rx_napi() is there to ensure
> > > the softirq context for the subsequent netif_receive_skb() call. 
> > 
> > That's not in fact what it does though; so while that might indeed be
> > the intent that's not what it does.
> 
> It was introduced in commit d20ef63d3246 ("mac80211: document
> ieee80211_rx() context requirement"):
> 
> mac80211: document ieee80211_rx() context requirement
> 
> ieee80211_rx() must be called with softirqs disabled

softirqs disabled, ack that is exactly what it checks.

But afaict the assertion you introduced tests that we are _in_ softirq
context, which is not the same.




Re: [RFC PATCH 2/2] net: mac808211: mac802154: use lockdep_assert_in_softirq() instead own warning

2018-05-04 Thread Peter Zijlstra
On Fri, May 04, 2018 at 07:51:44PM +0200, Sebastian Andrzej Siewior wrote:
> From: Anna-Maria Gleixner 
> 
> The warning in ieee802154_rx() and ieee80211_rx_napi() is there to ensure
> the softirq context for the subsequent netif_receive_skb() call. 

That's not in fact what it does though; so while that might indeed be
the intent that's not what it does.


Re: [PATCH bpf-next 1/2] bpf: enable stackmap with build_id in nmi context

2018-05-02 Thread Peter Zijlstra
On Wed, May 02, 2018 at 04:48:32PM +, Song Liu wrote:
> > It's broken though, I've bet you've never actually ran this with lockdep
> > enabled for example.
> 
> I am not following here. I just run the new selftest with CONFIG_LOCKDEP on, 
> and got no warning for this. 

Weird, I would be expecting complaints about releasing an unheld lock.

nmi_enter(),nmi_exit() have lockdep_off(),lockdep_on() resp. Which means
that the down_trylock() will not be recorded. The up, which is done from
IRQ context, will not be so supressed and should hit
print_unlock_imbalance_bug().

> > Also, you set work->sem before you do trylock, if the trylock fails you
> > return early and keep work->sem set, which will thereafter always result
> > in irq_work_busy.
> 
> work->sem was set after down_read_trylock(). I guess you misread the patch?

Argh, yes indeed. Sorry.


Re: [PATCH bpf-next 1/2] bpf: enable stackmap with build_id in nmi context

2018-05-02 Thread Peter Zijlstra
On Tue, May 01, 2018 at 05:02:19PM -0700, Song Liu wrote:
> @@ -267,17 +285,27 @@ static void stack_map_get_build_id_offset(struct 
> bpf_stack_build_id *id_offs,
>  {
>   int i;
>   struct vm_area_struct *vma;
> + bool in_nmi_ctx = in_nmi();
> + bool irq_work_busy = false;
> + struct stack_map_irq_work *work;
> +
> + if (in_nmi_ctx) {
> + work = this_cpu_ptr(_work);
> + if (work->sem)
> + /* cannot queue more up_read, fallback */
> + irq_work_busy = true;
> + }
>  
>   /*
> +  * We cannot do up_read() in nmi context. To do build_id lookup
> +  * in nmi context, we need to run up_read() in irq_work. We use
> +  * a percpu variable to do the irq_work. If the irq_work is
> +  * already used by another lookup, we fall back to report ips.
>*
>* Same fallback is used for kernel stack (!user) on a stackmap
>* with build_id.
>*/
> + if (!user || !current || !current->mm || irq_work_busy ||
>   down_read_trylock(>mm->mmap_sem) == 0) {
>   /* cannot access current->mm, fall back to ips */
>   for (i = 0; i < trace_nr; i++) {
> @@ -299,7 +327,13 @@ static void stack_map_get_build_id_offset(struct 
> bpf_stack_build_id *id_offs,
>   - vma->vm_start;
>   id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
>   }
> +
> + if (!in_nmi_ctx)
> + up_read(>mm->mmap_sem);
> + else {
> + work->sem = >mm->mmap_sem;
> + irq_work_queue(>work);
> + }
>  }

This is disguisting.. :-)

It's broken though, I've bet you've never actually ran this with lockdep
enabled for example.

Also, you set work->sem before you do trylock, if the trylock fails you
return early and keep work->sem set, which will thereafter always result
in irq_work_busy.




Re: [PATCH] x86/cpufeature: guard asm_volatile_goto usage with CC_HAVE_ASM_GOTO

2018-04-14 Thread Peter Zijlstra
On Fri, Apr 13, 2018 at 01:42:14PM -0700, Alexei Starovoitov wrote:
> On 4/13/18 11:19 AM, Peter Zijlstra wrote:
> > On Tue, Apr 10, 2018 at 02:28:04PM -0700, Alexei Starovoitov wrote:
> > > Instead of
> > > #ifdef CC_HAVE_ASM_GOTO
> > > we can replace it with
> > > #ifndef __BPF__
> > > or some other name,
> > 
> > I would prefer the BPF specific hack; otherwise we might be encouraging
> > people to build the kernel proper without asm-goto.
> > 
> 
> I don't understand this concern.

The thing is; this will be a (temporary) BPF specific hack. Hiding it
behind something that looks 'normal' (CC_HAVE_ASM_GOTO) is just not
right.


Re: linux-next: build failure after merge of the tip tree

2018-04-03 Thread Peter Zijlstra
On Tue, Apr 03, 2018 at 01:39:08PM +0100, David Howells wrote:
> Peter Zijlstra <pet...@infradead.org> wrote:
> 
> > I figured that since there were only a handful of users it wasn't a
> > popular API, also David very much knew of those patches changing it so
> > could easily have pulled in the special tip/sched/wait branch :/
> 
> I'm not sure I could, since I have to base on net-next.  I'm not sure what
> DaveM's policy on that is.
> 
> Also, it might've been better not to simply erase the atomic_t wait API
> immediately, but substitute wrappers for it to be removed one iteration hence.

Yeah, I know, but I really wasn't expecting new users of this thing, it
seemed like quite an exotic API with very limited users.

A well..



Re: linux-next: build failure after merge of the tip tree

2018-04-03 Thread Peter Zijlstra
On Tue, Apr 03, 2018 at 03:41:22PM +1000, Stephen Rothwell wrote:

> Caused by commit
> 
>   9b8cce52c4b5 ("sched/wait: Remove the wait_on_atomic_t() API")
> 
> interacting with commits
> 
>   d3be4d244330 ("xrpc: Fix potential call vs socket/net destruction race")
>   31f5f9a1691e ("rxrpc: Fix apparent leak of rxrpc_local objects")
> 
> from the net-next tree.
> 
> Haven't we figured out how to remove/change APIs yet? :-(

I figured that since there were only a handful of users it wasn't a
popular API, also David very much knew of those patches changing it so
could easily have pulled in the special tip/sched/wait branch :/



Re: [PATCH bpf-next v4 1/2] bpf: extend stackmap to save binary_build_id+offset instead of address

2018-03-12 Thread Peter Zijlstra
On Mon, Mar 12, 2018 at 01:39:56PM -0700, Song Liu wrote:
> +static void stack_map_get_build_id_offset(struct bpf_map *map,
> +   struct stack_map_bucket *bucket,
> +   u64 *ips, u32 trace_nr)
> +{
> + int i;
> + struct vm_area_struct *vma;
> + struct bpf_stack_build_id *id_offs;
> +
> + bucket->nr = trace_nr;
> + id_offs = (struct bpf_stack_build_id *)bucket->data;
> +
> + if (!current || !current->mm ||
> + down_read_trylock(>mm->mmap_sem) == 0) {

You probably want an in_nmi() before the down_read_trylock(). Doing
up_read() is an absolute no-no from NMI context.

And IIUC its 'trivial' to use this stuff with hardware counters.

> + /* cannot access current->mm, fall back to ips */
> + for (i = 0; i < trace_nr; i++) {
> + id_offs[i].status = BPF_STACK_BUILD_ID_IP;
> + id_offs[i].ip = ips[i];
> + }
> + return;
> + }
> +
> + for (i = 0; i < trace_nr; i++) {
> + vma = find_vma(current->mm, ips[i]);
> + if (!vma || stack_map_get_build_id(vma, id_offs[i].build_id)) {
> + /* per entry fall back to ips */
> + id_offs[i].status = BPF_STACK_BUILD_ID_IP;
> + id_offs[i].ip = ips[i];
> + continue;
> + }
> + id_offs[i].offset = (vma->vm_pgoff << PAGE_SHIFT) + ips[i]
> + - vma->vm_start;
> + id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
> + }
> + up_read(>mm->mmap_sem);
> +}


Re: [PATCH v2 1/2] bpf: extend stackmap to save binary_build_id+offset instead of address

2018-03-06 Thread Peter Zijlstra
On Tue, Mar 06, 2018 at 10:09:13AM -0800, Song Liu wrote:
> +/* Parse build ID of ELF file mapped to vma */
> +static int stack_map_get_build_id(struct vm_area_struct *vma,
> +   unsigned char *build_id)
> +{
> + Elf32_Ehdr *ehdr;
> + struct page *page;
> + int ret;
> +
> + /*
> +  * vm_start is user memory, so we need to be careful with it.
> +  * We don't want too many copy_from_user to reduce overhead.
> +  * Most ELF file is at least one page long, and the build_id
> +  * is stored in the first page. Therefore, we limit the search of
> +  * build_id to the first page only. This can be made safe with
> +  * get_user_pages_fast(). If the file is smaller than PAGE_SIZE,
> +  * or the build_id is not in the first page, the look up fails.
> +  */
> + if (vma->vm_end - vma->vm_start < PAGE_SIZE ||
> + vma->vm_start & (PAGE_SIZE - 1))/* page aligned */
> + return -EINVAL;
> +
> + if (get_user_pages_fast(vma->vm_start, 1, 0, ) != 1)
> + return -EFAULT;
> +
> + ret = -EINVAL;
> + ehdr = (Elf32_Ehdr *)vma->vm_start;
> +

You're still directly accessing a user pointer afaict. This will
insta-fault on SMAP enabled hardware due to the lack of STAC (or PAN on
ARM).

You _really_ cannot just access user pointers without the proper APIs.

Did you maybe mean to use:

ehdr = (Elf32_Ehdr *)page_address(page);

?

> + /* compare magic x7f "ELF" */
> + if (memcmp(ehdr->e_ident, ELFMAG, SELFMAG) != 0)
> + goto out;
> +
> + /* only support executable file and shared object file */
> + if (ehdr->e_type != ET_EXEC && ehdr->e_type != ET_DYN)
> + goto out;
> +
> + if (ehdr->e_ident[EI_CLASS] == 1)
> + ret = stack_map_get_build_id_32(vma, build_id);
> + else if (ehdr->e_ident[EI_CLASS] == 2)
> + ret = stack_map_get_build_id_64(vma, build_id);
> +out:
> + put_page(page);
> + return ret;
> +}


Re: [PATCH bpf-next 0/5] bpf, tracing: introduce bpf raw tracepoints

2018-03-06 Thread Peter Zijlstra
On Mon, Mar 05, 2018 at 02:36:07PM +0100, Daniel Borkmann wrote:
> On 03/01/2018 05:19 AM, Alexei Starovoitov wrote:
> > This patch set is a different way to address the pressing need to access
> > task_struct pointers in sched tracepoints from bpf programs.
> > 
> > The first approach simply added these pointers to sched tracepoints:
> > https://lkml.org/lkml/2017/12/14/753
> > which Peter nacked.
> > Few options were discussed and eventually the discussion converged on
> > doing bpf specific tracepoint_probe_register() probe functions.
> > Details here:
> > https://lkml.org/lkml/2017/12/20/929
> 
> Ping, Peter/Steven. If you have a chance, please review the series.

This series doesn't really touch anything I maintain, but the general
appraoch seems sane to me. I like the first patch that ensures
structures are passed by reference.

The rest is all tracepoint/bpf glue and I never really got into the bpf
internals, so I don't think I've got anything useful to say there.


Re: [PATCH bpf-next 1/2] bpf: extend stackmap to save binary_build_id+offset instead of address

2018-03-05 Thread Peter Zijlstra
On Mon, Feb 26, 2018 at 09:49:22AM -0800, Song Liu wrote:

> +/* Parse build ID of ELF file mapped to vma */
> +static int stack_map_get_build_id(struct vm_area_struct *vma,
> +   unsigned char *build_id)
> +{
> + Elf32_Ehdr *ehdr = (Elf32_Ehdr *)vma->vm_start;

How does this work? Isn't vma->vm_start a _user_ address?

> +
> + /* we need at least 1 file header and 1 program header */
> + if (vma->vm_end - vma->vm_start <
> + sizeof(Elf32_Ehdr) + sizeof(Elf32_Phdr))
> + return -EINVAL;
> +
> + /* compare magic x7f "ELF" */
> + if (memcmp(ehdr->e_ident, ELFMAG, SELFMAG) != 0)
> + return -EINVAL;

So how come you can just dereference it here, without
get_user()/copy_from_user() etc.. ?

> +
> + /* only support executable file and shared object file */
> + if (ehdr->e_type != ET_EXEC && ehdr->e_type != ET_DYN)
> + return -EINVAL;
> +
> + if (ehdr->e_ident[EI_CLASS] == 1)
> + return stack_map_get_build_id_32(vma, build_id);
> + else if (ehdr->e_ident[EI_CLASS] == 2)
> + return stack_map_get_build_id_64(vma, build_id);
> + return -EINVAL;
> +}


Re: Serious performance degradation in Linux 4.15

2018-02-16 Thread Peter Zijlstra
On Fri, Feb 16, 2018 at 02:38:39PM +, Matt Fleming wrote:
> On Wed, 14 Feb, at 10:46:20PM, Matt Fleming wrote:

> > Here's some more numbers. This is with RETPOLINE=y but you'll see it
> > doesn't make much of a difference. Oh, this is also with powersave
> > cpufreq governor.
> 
> Feh, I was wrong. The differences in performance I see are entirely
> due to CONFIG_RETPOLINE and CONFIG_PAGE_TABLE_ISOLATION being enabled.

OK, so I'm not the one crazy person not being able to reproduce this :-)

Lets wait for more details from the original submitter.


Re: Serious performance degradation in Linux 4.15

2018-02-16 Thread Peter Zijlstra
On Wed, Feb 14, 2018 at 10:46:20PM +, Matt Fleming wrote:
>  3. ./run-mmtests.sh 
> --config=configs/config-global-dhp__network-netperf-unbound `uname -r`

Not a success.. firstly it attempts to install packages without asking
and then horribly fails at it..


root@ivb-ep:/usr/local/src/mmtests# ./run-mmtests.sh 
--config=configs/config-global-dhp__network-netperf-unbound `uname -r`
Reading package lists... Done
Building dependency tree
Reading state information... Done
Use 'apt autoremove' to remove them.
The following NEW packages will be installed:
  binutils-dev
0 upgraded, 1 newly installed, 0 to remove and 1 not upgraded.
Need to get 2,339 kB of archives.
After this operation, 21.1 MB of additional disk space will be used.
Get:1 http://ftp.nl.debian.org/debian testing/main amd64 binutils-dev amd64 
2.30-4 [2,339 kB]
Fetched 2,339 kB in 1s (3,451 kB/s)
Selecting previously unselected package binutils-dev.
(Reading database ... 126177 files and directories currently installed.)
Preparing to unpack .../binutils-dev_2.30-4_amd64.deb ...
Unpacking binutils-dev (2.30-4) ...
Setting up binutils-dev (2.30-4) ...
W: --force-yes is deprecated, use one of the options starting with --allow 
instead.
Installed binutils-dev
Reading package lists... Done
Building dependency tree
Reading state information... Done
W: --force-yes is deprecated, use one of the options starting with --allow 
instead.
E: Unable to locate package oprofile
Failed to install package oprofile for distro debian at 
/usr/local/src/mmtests/bin/install-depends line 121.
Reading package lists... Done
Building dependency tree
Reading state information... Done
W: --force-yes is deprecated, use one of the options starting with --allow 
instead.
E: Unable to locate package perl-Time-HiRes
Failed to install package perl-Time-HiRes for distro debian at 
/usr/local/src/mmtests/bin/install-depends line 121.
Reading package lists... Done
Building dependency tree
Reading state information... Done
W: --force-yes is deprecated, use one of the options starting with --allow 
instead.
E: Unable to locate package hwloc-lstopo
Failed to install package hwloc-lstopo for distro debian at 
/usr/local/src/mmtests/bin/install-depends line 121.
Reading package lists... Done
Building dependency tree
Reading state information... Done
W: --force-yes is deprecated, use one of the options starting with --allow 
instead.
E: Unable to locate package cpupower
Failed to install package cpupower for distro debian at 
/usr/local/src/mmtests/bin/install-depends line 121.
Using default SHELLPACK_DATA
grep: /usr/local/src/mmtests/shellpacks/shellpack-bench-netperf-udp: No such 
file or directory
Starting test netperf-udp
/usr/local/src/mmtests/bin/unbuffer: 8: exec: tclsh: not found
ls: cannot access '/proc/16070/fd/0': No such file or directory
ls: cannot access '/proc/16070/fd/0': No such file or directory

after which I killed it dead...


This is one dodgy script which I'll not touch again. Please provide a
small bash script that wraps the right netperf magic and I'll try and
run that.


Re: Serious performance degradation in Linux 4.15

2018-02-16 Thread Peter Zijlstra
On Wed, Feb 14, 2018 at 10:46:20PM +, Matt Fleming wrote:
> Peter, if you want to run this test yourself you can do:
> 
>  1. git clone https://github.com/gorman/mmmtests.git

root@ivb-ep:/usr/local/src# git clone https://github.com/gorman/mmmtests.git
Cloning into 'mmmtests'...
Username for 'https://github.com':


I'm thinking you meant this:

  https://github.com/gormanm/mmtests.git

right? (also, I still hate the github webthing and you made me look at
it :-)


Re: Serious performance degradation in Linux 4.15

2018-02-15 Thread Peter Zijlstra
On Wed, Feb 14, 2018 at 10:46:20PM +, Matt Fleming wrote:
> Here's some more numbers. This is with RETPOLINE=y but you'll see it
> doesn't make much of a difference. Oh, this is also with powersave
> cpufreq governor.

Hurmph, I'll go have a look when I can boot tip/master again :/

But didn't you bench those patches before we merged them? I can't
remember you reporting this..


Re: Serious performance degradation in Linux 4.15

2018-02-12 Thread Peter Zijlstra
On Fri, Feb 09, 2018 at 05:59:12PM +, Jon Maloy wrote:
> Command for TCP:
> "netperf TCP_STREAM  (netperf -n 4 -f m -c 4 -C 4 -P 1 -H 10.0.0.1 -t 
> TCP_STREAM -l 10 -- -O THROUGHPUT)"
> Command for TIPC:
> "netperf TIPC_STREAM (netperf -n 4 -f m -c 4 -C 4 -P 1 -H 10.0.0.1 -t 
> TCP_STREAM -l 10 -- -O THROUGHPUT)"

That looks like identical tests to me. And my netperf (debian testing)
doesn't appear to have -t TIPC_STREAM.

Please try a coherent report and I'll have another look. Don't (again)
forget to mention what kind of setup you're running this on.


On my IVB-EP (2 sockets, 10 cores, 2 threads), performance cpufreq,
PTI=n RETPOLINE=n, I get:


CPUS=`grep -c ^processor /proc/cpuinfo`

for test in TCP_STREAM
do
for i in 1 $((CPUS/4)) $((CPUS/2)) $((CPUS)) $((CPUS*2))
do
echo -n $test-$i ": "

(
  for ((j=0; jnr_running = READ_ONCE(sds->nr_running);
+   stats->load   = READ_ONCE(sds->load);
+   stats->capacity   = READ_ONCE(sds->capacity);
+   stats->has_capacity = stats->nr_running < per_cpu(sd_llc_size, cpu);
+
+   return true;
+}
+
+static int
+wake_affine_old(struct sched_domain *sd, struct task_struct *p,
+   int this_cpu, int prev_cpu, int sync)
+{
+   struct llc_stats prev_stats, this_stats;
+   s64 this_eff_load, prev_eff_load;
+   unsigned long task_load;
+
+   if (!get_llc_stats(_stats, prev_cpu) ||
+   !get_llc_stats(_stats, this_cpu))
+   return nr_cpumask_bits;
+
+   if (sync) {
+   unsigned long current_load = task_h_load(current);
+   if (current_load > this_stats.load)
+   return this_cpu;
+
+   this_stats.load -= current_load;
+   }
+
+   if (prev_stats.has_capacity && prev_stats.nr_running < 
this_stats.nr_running+1)
+   return nr_cpumask_bits;
+
+   if (this_stats.has_capacity && this_stats.nr_running+1 < 
prev_stats.nr_running)
+   return this_cpu;
+
+   task_load = task_h_load(p);
+
+   this_eff_load = 100;
+   this_eff_load *= prev_stats.capacity;
+
+   prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
+   prev_eff_load *= this_stats.capacity;
+
+   this_eff_load *= this_stats.load + task_load;
+   prev_eff_load *= prev_stats.load - task_load;
+
+   return this_eff_load <= prev_eff_load ? this_cpu : nr_cpumask_bits;
+}
+
 /*
  * The purpose of wake_affine() is to quickly determine on which CPU we can run
  * soonest. For the purpose of speed we only consider the waking and previous
@@ -5756,6 +5818,9 @@ static int wake_affine(struct sched_domain *sd, struct 
task_struct *p,
int this_cpu = smp_processor_id();
int target = nr_cpumask_bits;
 
+   if (sched_feat(WA_OLD))
+   target = wake_affine_old(sd, p, this_cpu, prev_cpu, sync);
+
if (sched_feat(WA_IDLE))
target = wake_affine_idle(this_cpu, prev_cpu, sync);
 
@@ -6209,18 +6274,20 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int 

Re: Serious performance degradation in Linux 4.15

2018-02-10 Thread Peter Zijlstra
On Fri, Feb 09, 2018 at 05:59:12PM +, Jon Maloy wrote:
> The two commits 
> d153b153446f7 (" sched/core: Fix wake_affine() performance regression") and
> f2cdd9cc6c97 ("sched/core: Address more wake_affine() regressions")
> are causing a serious performance degradation in Linux 4.5.
> 
> The effect is worst on TIPC, but even TCP is affected, as the figures below 
> show. 

I did run a whole bunch of netperf and didn't see anything like that.
0-day also didn't report anything, and it too runs netperf.

I'll try and see if I can reproduce somewhere next week.


Re: [4.15-rc9] fs_reclaim lockdep trace

2018-01-29 Thread Peter Zijlstra
On Mon, Jan 29, 2018 at 08:47:20PM +0900, Tetsuo Handa wrote:
> Peter Zijlstra wrote:
> > On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote:
> > > This warning seems to be caused by commit d92a8cfcb37ecd13
> > > ("locking/lockdep: Rework FS_RECLAIM annotation") which moved the
> > > location of
> > > 
> > >   /* this guy won't enter reclaim */
> > >   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> > >   return false;
> > > 
> > > check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> > > (__GFP_NOFS)").
> > 
> > I'm not entirly sure I get what you mean here. How did I move it? It was
> > part of lockdep_trace_alloc(), if __GFP_NOMEMALLOC was set, it would not
> > mark the lock as held.
> 
> d92a8cfcb37ecd13 replaced lockdep_set_current_reclaim_state() with
> fs_reclaim_acquire(), and removed current->lockdep_recursion handling.
> 
> --
> # git show d92a8cfcb37ecd13 | grep recursion
> -# define INIT_LOCKDEP  .lockdep_recursion = 0, 
> .lockdep_reclaim_gfp = 0,
> +# define INIT_LOCKDEP  .lockdep_recursion = 0,
> unsigned intlockdep_recursion;
> -   if (unlikely(current->lockdep_recursion))
> -   current->lockdep_recursion = 1;
> -   current->lockdep_recursion = 0;
> -* context checking code. This tests GFP_FS recursion (a lock taken
> --

That should not matter at all. The only case that would matter for is if
lockdep itself would ever call into lockdep again. Not something that
happens here.

> > The new code has it in fs_reclaim_acquire/release to the same effect, if
> > __GFP_NOMEMALLOC, we'll not acquire/release the lock.
> 
> Excuse me, but I can't catch.
> We currently acquire/release __fs_reclaim_map if __GFP_NOMEMALLOC.

Right, got the case inverted, same difference though. Before we'd do
mark_held_lock(), now we do acquire/release under the same conditions.

> > > Since __kmalloc_reserve() from __alloc_skb() adds
> > > __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is
> > > failing to return false despite PF_MEMALLOC context (and resulted in
> > > lockdep warning).
> > 
> > But that's correct right, __GFP_NOMEMALLOC should negate PF_MEMALLOC.
> > That's what the name says.
> 
> __GFP_NOMEMALLOC negates PF_MEMALLOC regarding what watermark that allocation
> request should use.

Right.

> But at the same time, PF_MEMALLOC negates __GFP_DIRECT_RECLAIM.

Ah indeed.

> Then, how can fs_reclaim contribute to deadlock?

Not sure it can. But if we're going to allow this, it needs to come with
a clear description on why. Not a few clues to a puzzle.

Now, even if its not strictly a deadlock, there is something to be said
for flagging GFP_FS allocs that lead to nested GFP_FS allocs, do we ever
want to allow that?



Re: [4.15-rc9] fs_reclaim lockdep trace

2018-01-29 Thread Peter Zijlstra
On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote:
> This warning seems to be caused by commit d92a8cfcb37ecd13
> ("locking/lockdep: Rework FS_RECLAIM annotation") which moved the
> location of
> 
>   /* this guy won't enter reclaim */
>   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
>   return false;
> 
> check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> (__GFP_NOFS)").

I'm not entirly sure I get what you mean here. How did I move it? It was
part of lockdep_trace_alloc(), if __GFP_NOMEMALLOC was set, it would not
mark the lock as held.

The new code has it in fs_reclaim_acquire/release to the same effect, if
__GFP_NOMEMALLOC, we'll not acquire/release the lock.


> Since __kmalloc_reserve() from __alloc_skb() adds
> __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is
> failing to return false despite PF_MEMALLOC context (and resulted in
> lockdep warning).

But that's correct right, __GFP_NOMEMALLOC should negate PF_MEMALLOC.
That's what the name says.

> Since there was no PF_MEMALLOC safeguard as of cf40bd16fdad42c0, checking
> __GFP_NOMEMALLOC might make sense. But since this safeguard was added by
> commit 341ce06f69abfafa ("page allocator: calculate the alloc_flags for
> allocation only once"), checking __GFP_NOMEMALLOC no longer makes sense.
> Thus, let's remove __GFP_NOMEMALLOC check and allow __need_fs_reclaim() to
> return false.

This does not in fact explain what's going on, it just points to
'random' patches.

Are you talking about this:

+   /* Avoid recursion of direct reclaim */
+   if (p->flags & PF_MEMALLOC)
+   goto nopage;

bit?

> Reported-by: Dave Jones <da...@codemonkey.org.uk>
> Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Nick Piggin <npig...@gmail.com>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 76c9688..7804b0e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3583,7 +3583,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
>   return false;
>  
>   /* this guy won't enter reclaim */
> - if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> + if (current->flags & PF_MEMALLOC)
>   return false;

I'm _really_ uncomfortable doing that. Esp. without a solid explanation
of how this raelly can't possibly lead to trouble. Which the above semi
incoherent rambling is not.

Your backtrace shows the btrfs shrinker doing an allocation, that's the
exact kind of thing we need to be extremely careful with.


Re: net: r8169: a question of memory barrier in the r8169 driver

2018-01-19 Thread Peter Zijlstra
On Fri, Jan 19, 2018 at 02:11:18AM +0100, Francois Romieu wrote:
> Peter Zijlstra <pet...@infradead.org> :
> [...]
> > There is only 1 variable afaict. Memory barriers need at least 2 in
> > order to be able to do _anything_.
> 
> I don't get your point: why don't {cur_tx, dirty_tx} qualify as said
> two variables ?

There wasn't any cur_tx in the code you provided.


Re: net: r8169: a question of memory barrier in the r8169 driver

2018-01-18 Thread Peter Zijlstra
On Thu, Jan 18, 2018 at 10:06:17PM +0800, Jia-Ju Bai wrote:
> In the rt8169 driver, the function "rtl_tx" uses "smp_mb" to sync the
> writing operation with rtl8169_start_xmit:
> if (tp->dirty_tx != dirty_tx) {
> tp->dirty_tx = dirty_tx;
> smp_mb();
> ...
> }
> The function rtl8169_start_xmit reads tp->dirty_tx in TX_FRAGS_READY_FOR:
> if (unlikely(!TX_FRAGS_READY_FOR(tp, skb_shinfo(skb)->nr_frags))) {
> netif_err(tp, drv, dev, "BUG! Tx Ring full when queue awake!\n");
> goto err_stop_0;
> }
> But there is no memory barrier around this code.
> 
> Is there a possible data race here?
> If not, how this data race is avoided?

There is only 1 variable afaict. Memory barriers need at least 2 in
order to be able to do _anything_.


Re: dvb usb issues since kernel 4.9

2018-01-08 Thread Peter Zijlstra
On Mon, Jan 08, 2018 at 10:31:09PM +0100, Jesper Dangaard Brouer wrote:
> I did expected the issue to get worse, when you load the Pi with
> network traffic, as now the softirq time-budget have to be shared
> between networking and USB/DVB. Thus, I guess you are running TCP and
> USB/mpeg2ts on the same CPU (why when you have 4 CPUs?...)

Isn't networking also over USB on the Pi ?


Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution

2018-01-08 Thread Peter Zijlstra
On Mon, Jan 08, 2018 at 11:43:42AM +, Alan Cox wrote:
> On Mon, 8 Jan 2018 11:08:36 +0100
> Peter Zijlstra <pet...@infradead.org> wrote:
> 
> > On Fri, Jan 05, 2018 at 10:30:16PM -0800, Dan Williams wrote:
> > > On Fri, Jan 5, 2018 at 6:22 PM, Eric W. Biederman <ebied...@xmission.com> 
> > > wrote:  
> > > > In at least one place (mpls) you are patching a fast path.  Compile out
> > > > or don't load mpls by all means.  But it is not acceptable to change the
> > > > fast path without even considering performance.  
> > > 
> > > Performance matters greatly, but I need help to identify a workload
> > > that is representative for this fast path to see what, if any, impact
> > > is incurred. Even better is a review that says "nope, 'index' is not
> > > subject to arbitrary userspace control at this point, drop the patch."  
> > 
> > I think we're focussing a little too much on pure userspace. That is, we
> > should be saying under the attackers control. Inbound network packets
> > could equally be under the attackers control.
> 
> Inbound network packets don't come with a facility to read back and do
> cache timimg. 

But could they not be used in conjunction with a local task to prime the
stuff?


Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution

2018-01-08 Thread Peter Zijlstra
On Fri, Jan 05, 2018 at 10:30:16PM -0800, Dan Williams wrote:
> On Fri, Jan 5, 2018 at 6:22 PM, Eric W. Biederman  
> wrote:
> > In at least one place (mpls) you are patching a fast path.  Compile out
> > or don't load mpls by all means.  But it is not acceptable to change the
> > fast path without even considering performance.
> 
> Performance matters greatly, but I need help to identify a workload
> that is representative for this fast path to see what, if any, impact
> is incurred. Even better is a review that says "nope, 'index' is not
> subject to arbitrary userspace control at this point, drop the patch."

I think we're focussing a little too much on pure userspace. That is, we
should be saying under the attackers control. Inbound network packets
could equally be under the attackers control.

Sure, userspace is the most direct and highest bandwidth one, but I
think we should treat all (kernel) external values with the same
paranoia.


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Peter Zijlstra
On Sun, Jan 07, 2018 at 06:57:35PM -0800, Alexei Starovoitov wrote:
> On Sun, Jan 07, 2018 at 01:59:35PM +, Alan Cox wrote:

> > lfence timing is also heavily dependent upon what work has to be done to
> > retire previous live instructions. 
> > BPF does not normally do a lot of writing so you'd expect the cost to be 
> > low.
> 
> right. to retire previous loads. Not sure what 'not a lot of writing'
> has to do with lfence.

LFENCE will wait for completion on _ALL_ prior instructions, not just
loads.

Stores are by far the most expensive instructions to wait for, as they
only complete once their value is globally visible (on x86).


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Peter Zijlstra
On Sun, Jan 07, 2018 at 06:24:11PM -0800, Alexei Starovoitov wrote:
> How about:
> CONFIG_SPECTRE1_WORKAROUND_INDEX_MASK
> CONFIG_SPECTRE1_WORKAROUND_LOAD_FENCE

INSTRUCTION_FENCE if anything. LFENCE for Intel (and now also for AMD as
per 0592b0bce169) is a misnomer, IFENCE would be a better name for it.


Re: [PATCH v5 3/6] perf: implement pmu perf_kprobe

2017-12-20 Thread Peter Zijlstra
On Wed, Dec 20, 2017 at 06:10:11PM +, Song Liu wrote:
> I think there is one more thing to change:

OK, folded that too; it should all be at:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/core

Can you verify it all looks/works right?


Re: [PATCH v5 3/6] perf: implement pmu perf_kprobe

2017-12-20 Thread Peter Zijlstra
On Wed, Dec 20, 2017 at 11:03:01AM +0100, Peter Zijlstra wrote:
> On Wed, Dec 06, 2017 at 02:45:15PM -0800, Song Liu wrote:
> > @@ -8537,7 +8620,7 @@ static int perf_event_set_filter(struct perf_event 
> > *event, void __user *arg)
> > char *filter_str;
> > int ret = -EINVAL;
> >  
> > -   if ((event->attr.type != PERF_TYPE_TRACEPOINT ||
> > +   if ((!perf_event_is_tracing(event) ||
> >  !IS_ENABLED(CONFIG_EVENT_TRACING)) &&
> > !has_addr_filter(event))
> > return -EINVAL;
> 
> You actually missed an instance later in this same function... fixing
> that.


@@ -8518,23 +8601,19 @@ perf_event_set_addr_filter(struct perf_e
 
 static int perf_event_set_filter(struct perf_event *event, void __user *arg)
 {
-   char *filter_str;
int ret = -EINVAL;
-
-   if ((event->attr.type != PERF_TYPE_TRACEPOINT ||
-   !IS_ENABLED(CONFIG_EVENT_TRACING)) &&
-   !has_addr_filter(event))
-   return -EINVAL;
+   char *filter_str;
 
filter_str = strndup_user(arg, PAGE_SIZE);
if (IS_ERR(filter_str))
return PTR_ERR(filter_str);
 
-   if (IS_ENABLED(CONFIG_EVENT_TRACING) &&
-   event->attr.type == PERF_TYPE_TRACEPOINT)
-   ret = ftrace_profile_set_filter(event, event->attr.config,
-   filter_str);
-   else if (has_addr_filter(event))
+#ifdef CONFIG_EVENT_TRACING
+   if (perf_event_is_tracing(event))
+   ret = ftrace_profile_set_filter(event, event->attr.config, 
filter_str);
+   else
+#endif
+   if (has_addr_filter(event))
ret = perf_event_set_addr_filter(event, filter_str);
 
kfree(filter_str);



Is that right?


Re: [PATCH v5 3/6] perf: implement pmu perf_kprobe

2017-12-20 Thread Peter Zijlstra
On Wed, Dec 06, 2017 at 02:45:15PM -0800, Song Liu wrote:
> @@ -8537,7 +8620,7 @@ static int perf_event_set_filter(struct perf_event 
> *event, void __user *arg)
>   char *filter_str;
>   int ret = -EINVAL;
>  
> - if ((event->attr.type != PERF_TYPE_TRACEPOINT ||
> + if ((!perf_event_is_tracing(event) ||
>!IS_ENABLED(CONFIG_EVENT_TRACING)) &&
>   !has_addr_filter(event))
>   return -EINVAL;

You actually missed an instance later in this same function... fixing
that.


Re: [PATCH v5 0/6] enable creating [k,u]probe with perf_event_open

2017-12-19 Thread Peter Zijlstra


Took 1-4, Thanks!


Re: [PATCH net-next v4 1/2] bpf/tracing: allow user space to query prog array on the same tp

2017-12-12 Thread Peter Zijlstra
On Mon, Dec 11, 2017 at 11:39:02AM -0800, Yonghong Song wrote:
> The usage:
>   struct perf_event_query_bpf *query = malloc(...);
>   query.ids_len = ids_len;
>   err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, );

You didn't spot the fixes to your changelog ;-) The above should read
something like:

struct perf_event_query_bpf *query =
malloc(sizeof(*query) + sizeof(u32) * ids_len);
query->ids_len = ids_len;
err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, query);


Re: [PATCH net-next v3 1/2] bpf/tracing: allow user space to query prog array on the same tp

2017-12-11 Thread Peter Zijlstra
On Wed, Dec 06, 2017 at 02:07:45PM -0800, Yonghong Song wrote:
> Commit e87c6bc3852b ("bpf: permit multiple bpf attachments
> for a single perf event") added support to attach multiple
> bpf programs to a single perf event.
> Although this provides flexibility, users may want to know
> what other bpf programs attached to the same tp interface.
> Besides getting visibility for the underlying bpf system,
> such information may also help consolidate multiple bpf programs,
> understand potential performance issues due to a large array,
> and debug (e.g., one bpf program which overwrites return code
> may impact subsequent program results).
> 
> Commit 2541517c32be ("tracing, perf: Implement BPF programs
> attached to kprobes") utilized the existing perf ioctl
> interface and added the command PERF_EVENT_IOC_SET_BPF
> to attach a bpf program to a tracepoint. This patch adds a new
> ioctl command, given a perf event fd, to query the bpf program
> array attached to the same perf tracepoint event.
> 
> The new uapi ioctl command:
>   PERF_EVENT_IOC_QUERY_BPF
> 
> The new uapi/linux/perf_event.h structure:
>   struct perf_event_query_bpf {
>__u32  ids_len;
>__u32  prog_cnt;
>__u32  ids[0];
>   };
> 
> User space provides buffer "ids" for kernel to copy to.
> When returning from the kernel, the number of available
> programs in the array is set in "prog_cnt".
> 
> The usage:
struct perf_event_query_bpf *query =
malloc(sizeof(*query) + sizeof(u32) * ids_len);
query->ids_len = ids_len;
err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, query);
>   if (err == 0) {
> /* query.prog_cnt is the number of available progs,
>  * number of progs in ids: (ids_len == 0) ? 0 : query.prog_cnt
>  */
>   } else if (errno == ENOSPC) {
> /* query.ids_len number of progs copied,
>  * query.prog_cnt is the number of available progs
>  */
>   } else {
>   /* other errors */
>   }
> 
> Signed-off-by: Yonghong Song <y...@fb.com>

Yes this looks much better, thanks!

Acked-by: Peter Zijlstra (Intel) <pet...@infradead.org>


Re: [PATCH net-next v2 1/2] bpf/tracing: allow user space to query prog array on the same tp

2017-12-06 Thread Peter Zijlstra
On Wed, Dec 06, 2017 at 12:56:36PM +0100, Peter Zijlstra wrote:
> On Tue, Dec 05, 2017 at 10:31:28PM -0800, Yonghong Song wrote:
> > Commit e87c6bc3852b ("bpf: permit multiple bpf attachments
> > for a single perf event") added support to attach multiple
> > bpf programs to a single perf event.
> > Commit 2541517c32be ("tracing, perf: Implement BPF programs
> > attached to kprobes") utilized the existing perf ioctl
> > interface and added the command PERF_EVENT_IOC_SET_BPF
> > to attach a bpf program to a tracepoint.
> > 
> > This patch adds a new ioctl
> > command, given a perf event fd, to query the bpf program array
> > attached to the same perf tracepoint event.
> > 
> > The new uapi ioctl command:
> >   PERF_EVENT_IOC_QUERY_BPF
> > 
> > The new uapi/linux/perf_event.h structure:
> >   struct perf_event_query_bpf {
> >__u64prog_ids;
> >__u32prog_cnt;
> >   };
> > 
> > The usage:
> >   struct perf_event_query_bpf query;
> >   query.prog_ids = (__u64)usr_prog_ids_buf;
> >   query.prog_cnt = usr_prog_ids_buf_len;
> >   err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, );
> > 
> > Signed-off-by: Yonghong Song <y...@fb.com>
> > Acked-by: Alexei Starovoitov <a...@kernel.org>
> 
> Can you please fix that example to make it clear that prog_ids is in
> fact a pointer to an array of size prog_cnt. Ideally also describing
> what the type of array is.
> 
> In fact, would not something like:
> 
>   struct perf_event_query_bpf {
>   __u32 len;
>   __u32 __reserved;

I suppose we could use this field to store the number of entries
returned, retaining the len to indicate how large the structure is.

>   __u64 ids[0];
>   };
> 
> be a much clearer interface?
> 
> Also, you forgot to tell us why we need this interface at all.


Re: [PATCH net-next v2 1/2] bpf/tracing: allow user space to query prog array on the same tp

2017-12-06 Thread Peter Zijlstra
On Tue, Dec 05, 2017 at 10:31:28PM -0800, Yonghong Song wrote:
> Commit e87c6bc3852b ("bpf: permit multiple bpf attachments
> for a single perf event") added support to attach multiple
> bpf programs to a single perf event.
> Commit 2541517c32be ("tracing, perf: Implement BPF programs
> attached to kprobes") utilized the existing perf ioctl
> interface and added the command PERF_EVENT_IOC_SET_BPF
> to attach a bpf program to a tracepoint.
> 
> This patch adds a new ioctl
> command, given a perf event fd, to query the bpf program array
> attached to the same perf tracepoint event.
> 
> The new uapi ioctl command:
>   PERF_EVENT_IOC_QUERY_BPF
> 
> The new uapi/linux/perf_event.h structure:
>   struct perf_event_query_bpf {
>__u64  prog_ids;
>__u32  prog_cnt;
>   };
> 
> The usage:
>   struct perf_event_query_bpf query;
>   query.prog_ids = (__u64)usr_prog_ids_buf;
>   query.prog_cnt = usr_prog_ids_buf_len;
>   err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, );
> 
> Signed-off-by: Yonghong Song 
> Acked-by: Alexei Starovoitov 

Can you please fix that example to make it clear that prog_ids is in
fact a pointer to an array of size prog_cnt. Ideally also describing
what the type of array is.

In fact, would not something like:

struct perf_event_query_bpf {
__u32 len;
__u32 __reserved;
__u64 ids[0];
};

be a much clearer interface?

Also, you forgot to tell us why we need this interface at all.


Re: [PATCH v4 1/6] perf: prepare perf_event.h for new types perf_kprobe and perf_uprobe

2017-12-06 Thread Peter Zijlstra
On Mon, Dec 04, 2017 at 05:27:24PM -0800, Song Liu wrote:
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 362493a..0f39b31 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -291,6 +291,16 @@ enum perf_event_read_format {
>   PERF_FORMAT_MAX = 1U << 4,  /* non-ABI */
>  };
>  
> +/*
> + * Flags in config, used by dynamic PMU kprobe and uprobe
> + *
> + * PERF_PROBE_CONFIG_IS_RETPROBE if set, create kretprobe/uretprobe
> + *   if not set, create kprobe/uprobe
> + */
> +enum perf_probe_config {
> + PERF_PROBE_CONFIG_IS_RETPROBE   = 1U << 0,  /* [k,u]retprobe */
> +};

This should not be in uapi; pmu's can describe their config format in
sysfs.

PMU_FORMAT_ATTR(retprobe, "config:0");

static struct attribute *kprobe_attr[] = {
_attr_retprobe,
NULL,
};

static struct attribute_group kprobe_format_group = {
.name = "format",
.attrs = kprobe_attrs,
};

static const struct attribute_group *kprobe_attr_groups[] = {
_format_group,
NULL,
};

struct pmu perf_kprobe {
...

.attr_groups = kprobe_attr_groups,
};

Other than that, this series looks good to me. Thanks!





Re: [PATCH tip/core/rcu 21/21] drivers/vhost: Remove now-redundant read_barrier_depends()

2017-12-05 Thread Peter Zijlstra
On Tue, Dec 05, 2017 at 01:36:44PM -0800, Paul E. McKenney wrote:
> What we do in some code is to comment the pairings, allowing the other
> side of the pairing to be easily located.  Would that work for you?

I would say that that is mandatory for any memory ordering code ;-)


Re: [PATCH tip/core/rcu 21/21] drivers/vhost: Remove now-redundant read_barrier_depends()

2017-12-05 Thread Peter Zijlstra
On Tue, Dec 05, 2017 at 11:24:49PM +0200, Michael S. Tsirkin wrote:
> READ_ONCE is really all over the place (some code literally replaced all
> memory accesses with READ/WRITE ONCE).

Yeah, so? Complain to the compiler people for forcing us into that.

> Would an API like WRITE_POINTER()/smp_store_pointer make sense,
> and READ_POINTER for symmetry?

No, the whole point of the exercise was to get away from the fact that
dependent loads are special.


Re: [PATCH tip/core/rcu 21/21] drivers/vhost: Remove now-redundant read_barrier_depends()

2017-12-05 Thread Peter Zijlstra
On Tue, Dec 05, 2017 at 10:28:38PM +0200, Michael S. Tsirkin wrote:
> On Tue, Dec 05, 2017 at 08:57:52PM +0100, Peter Zijlstra wrote:
> > On Tue, Dec 05, 2017 at 09:51:48PM +0200, Michael S. Tsirkin wrote:
> > > > > WRITE_ONCE(obj->val, 1);
> > > > > smp_wmb();
> > > > > WRITE_ONCE(*foo, obj);
> > > > 
> > > > I believe Peter was instead suggesting:
> > > > 
> > > > WRITE_ONCE(obj->val, 1);
> > > > smp_store_release(foo, obj);
> > > 
> > > Isn't that more expensive though?
> > 
> > Depends on the architecture. The only architecture where it is more
> > expensive and people actually still care about is ARM I think.
> 
> Right. Why should I use the more expensive smp_store_release then?

Because it makes more sense. Memory ordering is hard enough, don't make
it harder still if you don't have to.


Re: [PATCH tip/core/rcu 21/21] drivers/vhost: Remove now-redundant read_barrier_depends()

2017-12-05 Thread Peter Zijlstra
On Tue, Dec 05, 2017 at 09:51:48PM +0200, Michael S. Tsirkin wrote:
> > > WRITE_ONCE(obj->val, 1);
> > > smp_wmb();
> > > WRITE_ONCE(*foo, obj);
> > 
> > I believe Peter was instead suggesting:
> > 
> > WRITE_ONCE(obj->val, 1);
> > smp_store_release(foo, obj);
> 
> Isn't that more expensive though?

Depends on the architecture. The only architecture where it is more
expensive and people actually still care about is ARM I think.


Re: [PATCH tip/core/rcu 21/21] drivers/vhost: Remove now-redundant read_barrier_depends()

2017-12-05 Thread Peter Zijlstra
On Tue, Dec 05, 2017 at 09:24:21PM +0200, Michael S. Tsirkin wrote:
> On Tue, Dec 05, 2017 at 08:17:33PM +0100, Peter Zijlstra wrote:
> > On Tue, Dec 05, 2017 at 08:57:46PM +0200, Michael S. Tsirkin wrote:
> > 
> > > I don't see WRITE_ONCE inserting any barriers, release or
> > > write.
> > 
> > Correct, never claimed there was.
> > 
> > Just saying that:
> > 
> > obj = READ_ONCE(*foo);
> > val = READ_ONCE(obj->val);
> > 
> > Never needs a barrier (except on Alpha and we want to make that go
> > away). Simply because a CPU needs to complete the load of @obj before it
> > can compute the address >val. Thus the second load _must_ come
> > after the first load and we get LOAD-LOAD ordering.
> > 
> > Alpha messing that up is a royal pain, and Alpha not being an
> > active/living architecture is just not worth the pain of keeping this in
> > the generic model.
> > 
> 
> Right. What I am saying is that for writes you need
> 
> WRITE_ONCE(obj->val, 1);
> smp_wmb();
> WRITE_ONCE(*foo, obj);

You really should use smp_store_release() here instead of the smp_wmb().
But yes.

> and this barrier is no longer paired with anything until
> you realize there's a dependency barrier within READ_ONCE.

No, there isn't. read_dependecy barriers are no more. They don't exist.
They never did, except on Alpha anyway.

There were a ton of sites that relied on this but never had the
smp_read_barrier_depends() annotation, similarly there were quite a few
sites that had the barrier but nobody could explain wtf for.

What these patches do is return the sane rule that dependent loads are
ordered.

And like all memory ordering; it should come with comments. Any piece of
code that relies on memory ordering and doesn't have big fat comments
that explain the required ordering are broken per definition. Maybe not
now, but they will be after a few years because someone changed it and
didn't know.

> Barrier pairing was a useful tool to check code validity,
> maybe there are other, better tools now.

Same is true for the closely related control dependency, you can pair
against those just fine but they don't have an explicit barrier
annotation.

There are no tools, use brain.



Re: [PATCH tip/core/rcu 21/21] drivers/vhost: Remove now-redundant read_barrier_depends()

2017-12-05 Thread Peter Zijlstra
On Tue, Dec 05, 2017 at 08:57:46PM +0200, Michael S. Tsirkin wrote:

> I don't see WRITE_ONCE inserting any barriers, release or
> write.

Correct, never claimed there was.

Just saying that:

obj = READ_ONCE(*foo);
val = READ_ONCE(obj->val);

Never needs a barrier (except on Alpha and we want to make that go
away). Simply because a CPU needs to complete the load of @obj before it
can compute the address >val. Thus the second load _must_ come
after the first load and we get LOAD-LOAD ordering.

Alpha messing that up is a royal pain, and Alpha not being an
active/living architecture is just not worth the pain of keeping this in
the generic model.




Re: [PATCH tip/core/rcu 21/21] drivers/vhost: Remove now-redundant read_barrier_depends()

2017-12-05 Thread Peter Zijlstra
On Tue, Dec 05, 2017 at 08:31:20PM +0200, Michael S. Tsirkin wrote:

> Apropos, READ_ONCE is now asymmetrical with WRITE_ONCE.
> 
> I can read a pointer with READ_ONCE and be sure the value
> is sane, but only if I also remember to put in smp_wmb before
> WRITE_ONCE. Otherwise the pointer is ok but no guarantees
> about the data pointed to.

That was already the case on everything except Alpha. And the canonical
match do the data dependency is store_release, not wmb.


Re: [PATCH v3 3/6] perf: implement pmu perf_kprobe

2017-12-04 Thread Peter Zijlstra
On Thu, Nov 30, 2017 at 03:50:20PM -0800, Song Liu wrote:
> + tp_event = create_local_trace_kprobe(
> + func, (void *)(unsigned long)(p_event->attr.kprobe_addr),
> + p_event->attr.probe_offset, p_event->attr.config != 0);

So you want to explicitly test bit0 instead? That would leave the other
63 bits available for future extension. The current implementation
completely consumes all 64bits of that config space.


Re: [PATCH v3 3/6] perf: implement pmu perf_kprobe

2017-12-04 Thread Peter Zijlstra
On Thu, Nov 30, 2017 at 03:50:20PM -0800, Song Liu wrote:
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 494eca1..49bbf46 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c

>  static inline void perf_tp_register(void)
>  {
>   perf_pmu_register(_tracepoint, "tracepoint", PERF_TYPE_TRACEPOINT);
> + perf_pmu_register(_kprobe, "kprobe", -1);

Should we maybe not register this PMU when !KPROBE_EVENTS ? That looks
to save a bunch of #ifdeffery later on.

>  }


Re: [PATCH v3 3/6] perf: implement pmu perf_kprobe

2017-12-04 Thread Peter Zijlstra
On Thu, Nov 30, 2017 at 03:50:20PM -0800, Song Liu wrote:

> +static struct pmu perf_kprobe = {

> +};

> +static inline bool perf_event_is_tracing(struct perf_event *event)
> +{
> + return event->attr.type == PERF_TYPE_TRACEPOINT ||
> + strncmp(event->pmu->name, "kprobe", 6) == 0;

Why a strcmp here? Wouldn't something like:

event->pmu == _kprobe

work much simpler?

> +}


Re: [PATCH 1/6] perf: Add new type PERF_TYPE_PROBE

2017-11-30 Thread Peter Zijlstra
On Thu, Nov 30, 2017 at 01:43:06AM +, Song Liu wrote:
> I added two fixed types (PERF_TYPE_KPROBE and PERF_TYPE_UPROBE) in the new 
> version. I know that perf doesn't need them any more. But currently bcc still 
> relies on these fixed types to use the probes/tracepoints. 

Yeah, sorry, that's just not going to fly. Fix bcc.


Re: [PATCH 1/6] perf: Add new type PERF_TYPE_PROBE

2017-11-26 Thread Peter Zijlstra
On Sat, Nov 25, 2017 at 05:59:54PM -0800, Alexei Starovoitov wrote:

> If we were poking into 'struct perf_event_attr __user *uptr'
> directly like get|put_user(.., >config)
> then 32-bit user space with 4-byte aligned u64s would cause
> 64-bit kernel to trap on archs like sparc.

But surely archs that have hardware alignment requirements have __u64 ==
__aligned_u64 ?

It's just that the structure layout can change between archs that have
__u64 != __aligned_u64 and __u64 == __aligned_u64.

But I would argue an architecture that has hardware alignment
requirements and has an unaligned __u64 is just plain broken.

> But in this case you're right. We can use config[12] as-is, since these
> u64 fields are passing the value one way only (into the kernel) and
> we do full perf_copy_attr() first and all further accesses are from
> copied structure and u64_to_user_ptr(event->attr.config) will be fine.

Right. Also note that there are no holes in perf_event_attr, if the
structure itself is allocated aligned the individual fields will be
aligned.

> Do you mind we do
> union {
>  __u64 file_path;
>  __u64 func_name;
>  __u64 config;
> };
> and similar with config1 ?

> Or prefer that we use 'config/config1' to store string+offset there?
> I think config/config1 is cleaner than config1/config2

I would prefer you use config1/config2 for this and leave config itself
for modifiers (like the retprobe thing). It also better lines up with
the BP stuff.

A little something like so perhaps:

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 362493a2f950..b6e76512f757 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -380,10 +380,14 @@ struct perf_event_attr {
__u32   bp_type;
union {
__u64   bp_addr;
+   __u64   uprobe_path;
+   __u64   kprobe_func;
__u64   config1; /* extension of config */
};
union {
__u64   bp_len;
+   __u64   kprobe_addr;
+   __u64   probe_offset;
__u64   config2; /* extension of config1 */
};
__u64   branch_sample_type; /* enum perf_branch_sample_type */





Re: [PATCH 1/6] perf: Add new type PERF_TYPE_PROBE

2017-11-24 Thread Peter Zijlstra
On Thu, Nov 23, 2017 at 10:31:29PM -0800, Alexei Starovoitov wrote:
> unfortunately 32-bit is more screwed than it seems:
> 
> $ cat align.c
> #include 
> 
> struct S {
>   unsigned long long a;
> } s;
> 
> struct U {
>   unsigned long long a;
> } u;
> 
> int main()
> {
> printf("%d, %d\n", sizeof(unsigned long long),
>__alignof__(unsigned long long));
> printf("%d, %d\n", sizeof(s), __alignof__(s));
> printf("%d, %d\n", sizeof(u), __alignof__(u));
> }
> $ gcc -m32 align.c
> $ ./a.out
> 8, 8
> 8, 4
> 8, 4

*blink* how is that even correct? I understood the spec to say the
alignment of composite types should be the max alignment of any of its
member types (otherwise it cannot guarantee the alignment of its
members).

> so we have to use __aligned_u64 in uapi.

Ideally yes, but effectively it most often doesn't matter.

> Otherwise, yes, we could have used config1 and config2 to pass pointers
> to the kernel, but since they're defined as __u64 already we cannot
> change them and have to do this ugly dance around 'config' field.

I don't understand the reasoning why you cannot use them. Even if they
are not naturally aligned on x86_32, why would it matter?

x86_32 needs two loads in any case, but there is no concurrency, so
split loads is not a problem. Add to that that 'intptr_t' on ILP32
is in fact only a single u32 and thus the other u32 will always be 0.

So yes, alignment is screwy, but I really don't see who cares and why it
would matter in practise.

Please explain.


Re: [PATCH 1/6] perf: Add new type PERF_TYPE_PROBE

2017-11-23 Thread Peter Zijlstra
On Wed, Nov 15, 2017 at 09:23:33AM -0800, Song Liu wrote:
> A new perf type PERF_TYPE_PROBE is added to allow creating [k,u]probe
> with perf_event_open. These [k,u]probe are associated with the file
> decriptor created by perf_event_open, thus are easy to clean when
> the file descriptor is destroyed.
> 
> Struct probe_desc and two flags, is_uprobe and is_return, are added
> to describe the probe being created with perf_event_open.

> ---
>  include/uapi/linux/perf_event.h | 35 +--
>  1 file changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 362493a..cc42d59 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -33,6 +33,7 @@ enum perf_type_id {
>   PERF_TYPE_HW_CACHE  = 3,
>   PERF_TYPE_RAW   = 4,
>   PERF_TYPE_BREAKPOINT= 5,
> + PERF_TYPE_PROBE = 6,

Not required.. these fixed types are mostly legacy at this point.

>  
>   PERF_TYPE_MAX,  /* non-ABI */
>  };
> @@ -299,6 +300,29 @@ enum perf_event_read_format {
>  #define PERF_ATTR_SIZE_VER4  104 /* add: sample_regs_intr */
>  #define PERF_ATTR_SIZE_VER5  112 /* add: aux_watermark */
>  
> +#define MAX_PROBE_FUNC_NAME_LEN  64
> +/*
> + * Describe a kprobe or uprobe for PERF_TYPE_PROBE.
> + * perf_event_attr.probe_desc will point to this structure. is_uprobe
> + * and is_return are used to differentiate different types of probe
> + * (k/u, probe/retprobe).
> + *
> + * The two unions should be used as follows:
> + * For uprobe: use path and offset;
> + * For kprobe: if func is empty, use addr
> + * if func is not emtpy, use func and offset
> + */
> +struct probe_desc {
> + union {
> + __aligned_u64   func;
> + __aligned_u64   path;
> + };
> + union {
> + __aligned_u64   addr;
> + __u64   offset;
> + };
> +};
> +
>  /*
>   * Hardware event_id to monitor via a performance monitoring event:
>   *
> @@ -320,7 +344,10 @@ struct perf_event_attr {
>   /*
>* Type specific configuration information.
>*/
> - __u64   config;
> + union {
> + __u64   config;
> + __u64   probe_desc; /* ptr to struct probe_desc */
> + };
>  
>   union {
>   __u64   sample_period;
> @@ -370,7 +397,11 @@ struct perf_event_attr {
>   context_switch :  1, /* context switch data */
>   write_backward :  1, /* Write ring buffer from 
> end to beginning */
>   namespaces :  1, /* include namespaces data 
> */
> - __reserved_1   : 35;
> +
> + /* For PERF_TYPE_PROBE */
> + is_uprobe  :  1, /* 0: kprobe, 1: uprobe */
> + is_return  :  1, /* 0: probe, 1: retprobe */
> + __reserved_1   : 33;
>  
>   union {
>   __u32   wakeup_events;/* wakeup every n events */


So I hate this... there's so much that doesn't make sense.. not in the
least that __align_u64 fixation. Who cares about that?

Why does probe_desc exist? You could have overloaded config{1,2}
further, just like the breakpoint crap did.

And the extra flags seem misplaced too, why are they not config?

You could have simply registered two PMU types:

  perf_pmu_register(_uprobe, "uprobe", -1);
  perf_pmu_register(_kprobe, "kprobe", -1);

Where perf_{u,k}probe differ only in the init function.

Then for uprobe you use config1 as pointer to a path string and config2
as offset. And for kprobe you use config1 as function string pointer and
config2 as either address or offset.

This leaves you config in both cases to stuff further modifiers, like
for example the is_return thing for kprobes.


Re: [PATCH 3/6] perf: implement kprobe support to PERF_TYPE_PROBE

2017-11-23 Thread Peter Zijlstra
On Wed, Nov 15, 2017 at 09:23:36AM -0800, Song Liu wrote:
> +int perf_probe_init(struct perf_event *p_event)
> +{

> + __aligned_u64 aligned_probe_desc;
> +
> + /*
> +  * attr.probe_desc may not be 64-bit aligned on 32-bit systems.
> +  * Make an aligned copy of it to before u64_to_user_ptr().
> +  */
> + memcpy(_probe_desc, _event->attr.probe_desc,
> +sizeof(__aligned_u64));
> +
> + if (copy_from_user(, u64_to_user_ptr(aligned_probe_desc),
> +sizeof(struct probe_desc)))
> + return -EFAULT;

That doesn't seem to make any sense what so ever.. the alignment has no
effect on this usecase. Not to mention that the kernel variable should
very much already be aligned.


Re: [PATCH 1/6] perf: Add new type PERF_TYPE_PROBE

2017-11-23 Thread Peter Zijlstra
On Wed, Nov 15, 2017 at 09:23:33AM -0800, Song Liu wrote:

> Note: We use type __u64 for pointer probe_desc instead of __aligned_u64.
> The reason here is to avoid changing the size of struct perf_event_attr,
> and breaking new-kernel-old-utility scenario. To avoid alignment problem
> with the pointer, we will (in the following patches) copy probe_desc to
> __aligned_u64 before using it as pointer.

ISTR there are only relatively few architectures where __u64 and
__aligned_u64 are not the same thing.

The comment that goes with it seems to suggest i386 has short alignment
for u64 but my compiler says differently:

printf("%d, %d\n", sizeof(unsigned long long), __alignof__(unsigned 
long long));

$ gcc -m32 -o align align.c && ./align
8, 8




Re: [PATCH 0/6] enable creating [k,u]probe with perf_event_open

2017-11-23 Thread Peter Zijlstra
On Thu, Nov 23, 2017 at 01:02:00AM -0800, Christoph Hellwig wrote:
> Just curious: why do you want to overload a multiplexer syscall even
> more instead of adding explicit syscalls?

Mostly because perf provides much of what they already want; fd-based
lifetime and bpf integration.


Re: [PATCH net-next 2/8] rtnetlink: add rtnl_register_module

2017-11-12 Thread Peter Zijlstra
On Mon, Nov 13, 2017 at 08:21:59AM +0100, Florian Westphal wrote:
> Reason is that some places do this:
> 
> rtnl_register(pf, RTM_FOO, doit, NULL, 0);
> rtnl_register(pf, RTM_FOO, NULL, dumpit, 0);

Sure, however,

> (from different call sites in the stack).
> > -   if (doit)
> > -   tab[msgindex].doit = doit;
> > -   if (dumpit)
> > -   tab[msgindex].dumpit = dumpit;
> 
> Which is the reason for these if () tests.

then we assign NULL, which is fine, no?



Re: [PATCH 0/2][v5] Add the ability to do BPF directed error injection

2017-11-10 Thread Peter Zijlstra
On Wed, Nov 08, 2017 at 06:43:25AM +0900, Alexei Starovoitov wrote:
> On 11/8/17 5:28 AM, Josef Bacik wrote:
> > I'm sending this through Dave since it'll conflict with other BPF changes 
> > in his
> > tree, but since it touches tracing as well Dave would like a review from
> > somebody on the tracing side.
> > 
> > v4->v5:
> > - disallow kprobe_override programs from being put in the prog map array so 
> > we
> >   don't tail call into something we didn't check.  This allows us to make 
> > the
> >   normal path still fast without a bunch of percpu operations.
> 
> looks great to me.
> Peter,
> could you please review x86 bits?

I did not in fact get the patch... But after digging it out of LKML I
don't immediately see something broken, but this is not stuff I
typically know well.


Re: [PATCH v3] scripts: add leaking_addresses.pl

2017-11-08 Thread Peter Zijlstra
On Tue, Nov 07, 2017 at 05:44:13PM -0500, Steven Rostedt wrote:
> On Tue, 7 Nov 2017 13:44:01 -0800
> Linus Torvalds  wrote:
> 
> > > Looking other places that stand out, it seems like
> > > /proc/lockdep_chains and /proc/lockdep (CONFIG_LOCKDEP=y) has a ton of
> > > %p usage. It's unclear to me if a hash is sufficient for meaningful
> > > debugging there?  
> > 
> > Maybe not, but that is also _so_ esoteric that I suspect the right fix
> > is to just make it root-only readable.
> 
> Also note, I don't believe anyone should be running a LOCKDEP
> configured kernel in a production (secured) environment. As it adds
> quite a bit of overhead. It's something you run on test environments to
> make sure it doesn't detect any possible deadlocks.
> 
> > 
> > I've never used it, we should check with people who have. I get the
> > feeling that this is purely for PeterZ debugging.
> 
> I've used it. But then again, I also debug lockdep ;-)
> 
> > 
> > The very first commit that introduced that code actually has a
> > 
> > (FIXME: should go into debugfs)
> > 
> > so I suspect it never should have been user-readable to begin with. I
> > guess it makes some things easier, but it really is *very* different
> > from things like profiling.
> 
> Want me to whip up a patch to move the file?

Fine by me; create /debug/lockdep/ for the 3 files or something like
that.

As to the actual addresses, they can be used to double check things are
in fact the same object (in case of name collisions), are in static
memory (as these things ought to be) etc.. But mostly they're not too
important.

And yes, as everybody says, LOCKDEP is debug tool; you run this on your
(local) dev kernels, anything else it out of spec.



Re: [PATCH net-next 2/8] rtnetlink: add rtnl_register_module

2017-11-07 Thread Peter Zijlstra
On Tue, Nov 07, 2017 at 10:47:51AM +0100, Florian Westphal wrote:
> I would expect this to trigger all the time, due to
> 
> rtnl_register(AF_INET, RTM_GETROUTE, ...
> rtnl_register(AF_INET, RTM_GETADDR, ...

Ah, sure, then something like so then...

There's bound to be bugs there too, as I pretty much typed this without
thinking, but it should show the idea.

---
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 5ace48926b19..de1336775602 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -63,6 +63,7 @@ struct rtnl_link {
rtnl_doit_func  doit;
rtnl_dumpit_funcdumpit;
unsigned intflags;
+   struct rcu_head rcu;
 };
 
 static DEFINE_MUTEX(rtnl_mutex);
@@ -127,7 +128,7 @@ bool lockdep_rtnl_is_held(void)
 EXPORT_SYMBOL(lockdep_rtnl_is_held);
 #endif /* #ifdef CONFIG_PROVE_LOCKING */
 
-static struct rtnl_link __rcu *rtnl_msg_handlers[RTNL_FAMILY_MAX + 1];
+static struct rtnl_link __rcu **rtnl_msg_handlers[RTNL_FAMILY_MAX + 1];
 static refcount_t rtnl_msg_handlers_ref[RTNL_FAMILY_MAX + 1];
 
 static inline int rtm_msgindex(int msgtype)
@@ -144,6 +145,23 @@ static inline int rtm_msgindex(int msgtype)
return msgindex;
 }
 
+static struct rtnl_link *rtnl_get_link(int protocol, int msgtype)
+{
+   struct rtnl_link **tab;
+
+   if (protocol >= ARRAY_SIZE(rtnl_msg_handlers))
+   protocol = PF_UNSPEC;
+
+   tab = rcu_dereference(rtnl_msg_handlers[protocol]);
+   if (!tab) {
+   tab = rcu_dereference(rtnl_msg_handlers[PF_UNSPECl]);
+   if (!tab)
+   return NULL;
+   }
+
+   return rcu_dereference(handlers[rtm_msgindex(msgtype)]);
+}
+
 /**
  * __rtnl_register - Register a rtnetlink message type
  * @protocol: Protocol family or PF_UNSPEC
@@ -166,28 +184,39 @@ int __rtnl_register(int protocol, int msgtype,
rtnl_doit_func doit, rtnl_dumpit_func dumpit,
unsigned int flags)
 {
-   struct rtnl_link *tab;
+   struct rtnl_link **tab, *link;
int msgindex;
+   int ret = -ENOBUFS;
 
BUG_ON(protocol < 0 || protocol > RTNL_FAMILY_MAX);
msgindex = rtm_msgindex(msgtype);
 
-   tab = rcu_dereference_raw(rtnl_msg_handlers[protocol]);
+   rtnl_lock();
+   tab = rtnl_msg_handlers[protocol];
if (tab == NULL) {
-   tab = kcalloc(RTM_NR_MSGTYPES, sizeof(*tab), GFP_KERNEL);
-   if (tab == NULL)
-   return -ENOBUFS;
+   tab = kcalloc(RTM_NR_MSGTYPES, sizeof(void *), GFP_KERNEL);
+   if (!tab)
+   goto unlock;
 
+   /* ensures we see the 0 stores */
rcu_assign_pointer(rtnl_msg_handlers[protocol], tab);
}
 
-   if (doit)
-   tab[msgindex].doit = doit;
-   if (dumpit)
-   tab[msgindex].dumpit = dumpit;
-   tab[msgindex].flags |= flags;
+   WARN_ONCE(tab[msgindex], "Double allocated rtnl(%d:%d)\n", protocol, 
msgtype);
+   link = kzalloc(sizeof(struct rtnl_link), GFP_KERNEL);
+   if (!link)
+   goto unlock;
 
-   return 0;
+   link->doit = doit;
+   link->dumpit = dumpit;
+   link->flags |= flags;
+   /* publish protocol:msgtype */
+   rcu_assign_pointer(tab[msgindex], link);
+   ret = 0;
+unlock:
+   rtnl_unlock();
+
+   return ret;
 }
 EXPORT_SYMBOL_GPL(__rtnl_register);
 
@@ -220,22 +249,22 @@ EXPORT_SYMBOL_GPL(rtnl_register);
  */
 int rtnl_unregister(int protocol, int msgtype)
 {
-   struct rtnl_link *handlers;
+   struct rtnl_link **tab, *link;
int msgindex;
 
BUG_ON(protocol < 0 || protocol > RTNL_FAMILY_MAX);
msgindex = rtm_msgindex(msgtype);
 
rtnl_lock();
-   handlers = rtnl_dereference(rtnl_msg_handlers[protocol]);
-   if (!handlers) {
+   tab = rtnl_dereference(rtnl_msg_handlers[protocol]);
+   if (!tab) {
rtnl_unlock();
return -ENOENT;
}
 
-   handlers[msgindex].doit = NULL;
-   handlers[msgindex].dumpit = NULL;
-   handlers[msgindex].flags = 0;
+   link = tab[msgindex];
+   rcu_assign_pointer(tab[msgindex], NULL);
+   kfree_rcu(link, rcu);
rtnl_unlock();
 
return 0;
@@ -251,13 +280,22 @@ EXPORT_SYMBOL_GPL(rtnl_unregister);
  */
 void rtnl_unregister_all(int protocol)
 {
-   struct rtnl_link *handlers;
+   struct rtnl_link **tab, *link;
+   int msgindex;
 
BUG_ON(protocol < 0 || protocol > RTNL_FAMILY_MAX);
 
rtnl_lock();
-   handlers = rtnl_dereference(rtnl_msg_handlers[protocol]);
+   tab = rtnl_msg_handlers[protocol];
RCU_INIT_POINTER(rtnl_msg_handlers[protocol], NULL);
+   for (msgindex = 0; msgindex < RTM_NR_MSGTYPES; msgindex++) {
+   link = tab[msgindex];
+   if (!link)
+   continue;
+
+   

Re: [PATCH net-next 2/8] rtnetlink: add rtnl_register_module

2017-11-07 Thread Peter Zijlstra
On Tue, Nov 07, 2017 at 10:10:04AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 07, 2017 at 07:11:56AM +0100, Florian Westphal wrote:
> > Peter Zijlstra <pet...@infradead.org> wrote:
> > > On Mon, Nov 06, 2017 at 11:51:07AM +0100, Florian Westphal wrote:
> > > > @@ -180,6 +164,12 @@ int __rtnl_register(int protocol, int msgtype,
> > > > rcu_assign_pointer(rtnl_msg_handlers[protocol], tab);
> > > > }
> > > >  
> > > > +   WARN_ON(tab[msgindex].owner && tab[msgindex].owner != owner);
> > > > +
> > > > +   tab[msgindex].owner = owner;
> > > > +   /* make sure owner is always visible first */
> > > > +   smp_wmb();
> > > > +
> > > > if (doit)
> > > > tab[msgindex].doit = doit;
> > > > if (dumpit)
> > > 
> > > > @@ -235,6 +279,9 @@ int rtnl_unregister(int protocol, int msgtype)
> > > > handlers[msgindex].doit = NULL;
> > > > handlers[msgindex].dumpit = NULL;
> > > > handlers[msgindex].flags = 0;
> > > > +   /* make sure we clear owner last */
> > > > +   smp_wmb();
> > > > +   handlers[msgindex].owner = NULL;
> > > > rtnl_unlock();
> > > >  
> > > > return 0;
> > > 
> > > These wmb()'s don't make sense; and the comments are incomplete. What do
> > > they pair with? Who cares about this ordering?
> > 
> > rtnetlink_rcv_msg:
> > 
> > 4406 dumpit = READ_ONCE(handlers[type].dumpit);
> > 4407 if (!dumpit)
> > 4408 goto err_unlock;
> > 4409 owner = READ_ONCE(handlers[type].owner);
> 
> So what stops the CPU from hoisting this load before the dumpit load?
> 
> > 4410 }
> > ..
> > 4417 if (!try_module_get(owner))
> > 4418 err = -EPROTONOSUPPORT;
> > 4419 
> > 
> > I don't want dumpit function address to be visible before owner.
> > Does that make sense?
> 
> And no. That's insane, how can it ever observe an incomplete tab in the
> first place.
> 
> The problem is that __rtnl_register() and rtnl_unregister are broken.
> 
> __rtnl_register() publishes the tab before it initializes it; allowing
> people to observe the thing incomplete.
> 
> Also, are we required to hold rtnl_lock() across __rtnl_register()? I'd
> hope so, otherwise what stops concurrent allocations and leaking of tab?
> 
> Also, rtnl_register() doesn't seen to employ rtnl_lock() and panic()
> WTF?!
> 
> rtnl_unregister() should then RCU free the tab.
> 
> None of that is happening, so what is that RCU stuff supposed to do?

Something like the below would go some way toward sanitizing this stuff;
rcu_assign_pointer() is a store-release, meaning it happens after
everything coming before.

Therefore, when you observe that tab (through rcu_dereference) you're
guaranteed to see the thing initialized. The memory ordering on the
consume side is through an address dependency; we need to have completed
the load of the tab pointer before we can compute the address of its
members and load from there, these are not things a CPU is allowed to
reorder (lets forget about Alpha).

Quite possibly, if rtnl_unregister() is called from module unload, this
is broken; in that case we'd need something like:

rcu_assign_pointer(rtnl_msg_handler[protocol], NULL);
/*
 * Ensure nobody can still observe our old protocol handler
 * before continuing to free the module that includes the
 * functions called from it.
 */
synchronize_rcu();

---

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 5ace48926b19..25391c7b9c5d 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -63,6 +63,7 @@ struct rtnl_link {
rtnl_doit_func  doit;
rtnl_dumpit_funcdumpit;
unsigned intflags;
+   struct rcu_head rcu;
 };
 
 static DEFINE_MUTEX(rtnl_mutex);
@@ -172,14 +173,15 @@ int __rtnl_register(int protocol, int msgtype,
BUG_ON(protocol < 0 || protocol > RTNL_FAMILY_MAX);
msgindex = rtm_msgindex(msgtype);
 
-   tab = rcu_dereference_raw(rtnl_msg_handlers[protocol]);
-   if (tab == NULL) {
-   tab = kcalloc(RTM_NR_MSGTYPES, sizeof(*tab), GFP_KERNEL);
-   if (tab == NULL)
-   return -ENOBUFS;
+   if (WARN_ONCE(rtnl_msg_handler[protocol],
+ "Double registration for protocol:

Re: [PATCH net-next 2/8] rtnetlink: add rtnl_register_module

2017-11-07 Thread Peter Zijlstra
On Tue, Nov 07, 2017 at 07:11:56AM +0100, Florian Westphal wrote:
> Peter Zijlstra <pet...@infradead.org> wrote:
> > On Mon, Nov 06, 2017 at 11:51:07AM +0100, Florian Westphal wrote:
> > > @@ -180,6 +164,12 @@ int __rtnl_register(int protocol, int msgtype,
> > >   rcu_assign_pointer(rtnl_msg_handlers[protocol], tab);
> > >   }
> > >  
> > > + WARN_ON(tab[msgindex].owner && tab[msgindex].owner != owner);
> > > +
> > > + tab[msgindex].owner = owner;
> > > + /* make sure owner is always visible first */
> > > + smp_wmb();
> > > +
> > >   if (doit)
> > >   tab[msgindex].doit = doit;
> > >   if (dumpit)
> > 
> > > @@ -235,6 +279,9 @@ int rtnl_unregister(int protocol, int msgtype)
> > >   handlers[msgindex].doit = NULL;
> > >   handlers[msgindex].dumpit = NULL;
> > >   handlers[msgindex].flags = 0;
> > > + /* make sure we clear owner last */
> > > + smp_wmb();
> > > + handlers[msgindex].owner = NULL;
> > >   rtnl_unlock();
> > >  
> > >   return 0;
> > 
> > These wmb()'s don't make sense; and the comments are incomplete. What do
> > they pair with? Who cares about this ordering?
> 
> rtnetlink_rcv_msg:
> 
> 4406 dumpit = READ_ONCE(handlers[type].dumpit);
> 4407 if (!dumpit)
> 4408 goto err_unlock;
> 4409 owner = READ_ONCE(handlers[type].owner);

So what stops the CPU from hoisting this load before the dumpit load?

> 4410 }
> ..
> 4417 if (!try_module_get(owner))
> 4418 err = -EPROTONOSUPPORT;
> 4419 
> 
> I don't want dumpit function address to be visible before owner.
> Does that make sense?

And no. That's insane, how can it ever observe an incomplete tab in the
first place.

The problem is that __rtnl_register() and rtnl_unregister are broken.

__rtnl_register() publishes the tab before it initializes it; allowing
people to observe the thing incomplete.

Also, are we required to hold rtnl_lock() across __rtnl_register()? I'd
hope so, otherwise what stops concurrent allocations and leaking of tab?

Also, rtnl_register() doesn't seen to employ rtnl_lock() and panic()
WTF?!

rtnl_unregister() should then RCU free the tab.

None of that is happening, so what is that RCU stuff supposed to do?


Re: [PATCH net-next 2/8] rtnetlink: add rtnl_register_module

2017-11-06 Thread Peter Zijlstra
On Mon, Nov 06, 2017 at 11:51:07AM +0100, Florian Westphal wrote:
> @@ -180,6 +164,12 @@ int __rtnl_register(int protocol, int msgtype,
>   rcu_assign_pointer(rtnl_msg_handlers[protocol], tab);
>   }
>  
> + WARN_ON(tab[msgindex].owner && tab[msgindex].owner != owner);
> +
> + tab[msgindex].owner = owner;
> + /* make sure owner is always visible first */
> + smp_wmb();
> +
>   if (doit)
>   tab[msgindex].doit = doit;
>   if (dumpit)

> @@ -235,6 +279,9 @@ int rtnl_unregister(int protocol, int msgtype)
>   handlers[msgindex].doit = NULL;
>   handlers[msgindex].dumpit = NULL;
>   handlers[msgindex].flags = 0;
> + /* make sure we clear owner last */
> + smp_wmb();
> + handlers[msgindex].owner = NULL;
>   rtnl_unlock();
>  
>   return 0;

These wmb()'s don't make sense; and the comments are incomplete. What do
they pair with? Who cares about this ordering?


Re: linux-next: manual merge of the tip tree with the net-next tree

2017-11-01 Thread Peter Zijlstra
On Wed, Nov 01, 2017 at 09:27:43AM +0100, Ingo Molnar wrote:
> 
> * Peter Zijlstra <pet...@infradead.org> wrote:
> 
> > On Wed, Nov 01, 2017 at 06:15:54PM +1100, Stephen Rothwell wrote:
> > > Hi all,
> > > 
> > > Today's linux-next merge of the tip tree got a conflict in:
> > > 
> > >   kernel/trace/bpf_trace.c
> > > 
> > > between commits:
> > > 
> > >   97562633bcba ("bpf: perf event change needed for subsequent bpf 
> > > helpers")
> > > and more changes ...
> > > 
> > > from the net-next tree and commit:
> > > 
> > >   7d9285e82db5 ("perf/bpf: Extend the perf_event_read_local() interface, 
> > > a.k.a. "bpf: perf event change needed for subsequent bpf helpers"")
> > > 
> > > from the tip tree.
> > 
> > So those should be the exact same patch; except for Changelog and
> > subject. Code wise there shouldn't be a conflict.
> 
> So the problem is that then we have:
> 
>   0d3d73aac2ff ("perf/core: Rewrite event timekeeping")
> 
> which changes the code. This is a known conflict generation pattern: Git 
> isn't 
> smart enough to sort out that (probably because it would make merges too 
> expensive) - and it's a bad flow in any case.

Hmm, I thought having that same base patch in both trees would allow it
to resolve that conflict. A well..


Re: linux-next: manual merge of the tip tree with the net-next tree

2017-11-01 Thread Peter Zijlstra
On Wed, Nov 01, 2017 at 06:15:54PM +1100, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the tip tree got a conflict in:
> 
>   kernel/trace/bpf_trace.c
> 
> between commits:
> 
>   97562633bcba ("bpf: perf event change needed for subsequent bpf helpers")
> and more changes ...
> 
> from the net-next tree and commit:
> 
>   7d9285e82db5 ("perf/bpf: Extend the perf_event_read_local() interface, 
> a.k.a. "bpf: perf event change needed for subsequent bpf helpers"")
> 
> from the tip tree.

So those should be the exact same patch; except for Changelog and
subject. Code wise there shouldn't be a conflict.


Re: [PATCH net-next] bpf: avoid rcu_dereference inside bpf_event_mutex lock region

2017-10-31 Thread Peter Zijlstra
On Mon, Oct 30, 2017 at 01:50:22PM -0700, Yonghong Song wrote:
> Could you check whether the below change to remove rcu_dereference_protected
> is what you wanted or not?

Yep that looks fine. Thanks!

> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index b65011d..e7685c5 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -767,20 +767,19 @@ int perf_event_attach_bpf_prog(struct perf_event *event,
>   mutex_lock(_event_mutex);
>  
>   if (event->prog)
> - goto out;
> + goto unlock;
>  
> - old_array = rcu_dereference_protected(event->tp_event->prog_array,
> -   
> lockdep_is_held(_event_mutex));
> + old_array = event->tp_event->prog_array;
>   ret = bpf_prog_array_copy(old_array, NULL, prog, _array);
>   if (ret < 0)
> - goto out;
> + goto unlock;
>  
>   /* set the new array to event->tp_event and set event->prog */
>   event->prog = prog;
>   rcu_assign_pointer(event->tp_event->prog_array, new_array);
>   bpf_prog_array_free(old_array);
>  
> -out:
> +unlock:
>   mutex_unlock(_event_mutex);
>   return ret;
>  }
> @@ -794,11 +793,9 @@ void perf_event_detach_bpf_prog(struct perf_event *event)
>   mutex_lock(_event_mutex);
>  
>   if (!event->prog)
> - goto out;
> -
> - old_array = rcu_dereference_protected(event->tp_event->prog_array,
> -   
> lockdep_is_held(_event_mutex));
> + goto unlock;
>  
> + old_array = event->tp_event->prog_array;
>   ret = bpf_prog_array_copy(old_array, event->prog, NULL, _array);
>   if (ret < 0) {
>   bpf_prog_array_delete_safe(old_array, event->prog);
> @@ -810,6 +807,6 @@ void perf_event_detach_bpf_prog(struct perf_event *event)
>   bpf_prog_put(event->prog);
>   event->prog = NULL;
>  
> -out:
> +unlock:
>   mutex_unlock(_event_mutex);
>  }
> -- 
> 2.9.5
> 


Re: [PATCH net-next 2/3] bpf: permit multiple bpf attachments for a single perf event

2017-10-26 Thread Peter Zijlstra
On Mon, Oct 23, 2017 at 10:58:04AM -0700, Yonghong Song wrote:
> This patch enables multiple bpf attachments for a
> kprobe/uprobe/tracepoint single trace event.

This forgets to explain _why_ this is a good thing to do.

> +static DEFINE_MUTEX(bpf_event_mutex);
> +
> +int perf_event_attach_bpf_prog(struct perf_event *event,
> +struct bpf_prog *prog)
> +{
> + struct bpf_prog_array __rcu *old_array;
> + struct bpf_prog_array *new_array;
> + int ret;
> +
> + mutex_lock(_event_mutex);
> +
> + if (event->prog)
> + return -EEXIST;
> +
> + old_array = rcu_dereference_protected(event->tp_event->prog_array,
> +   
> lockdep_is_held(_event_mutex));

Since all modifications to prog_array are serialized by this one mutex;
you don't need rcu_dereference() here, there are no possible ordering
problems.

> + ret = bpf_prog_array_copy(old_array, NULL, prog, _array);
> + if (ret < 0)
> + goto out;
> +
> + /* set the new array to event->tp_event and set event->prog */
> + event->prog = prog;
> + rcu_assign_pointer(event->tp_event->prog_array, new_array);
> +
> + if (old_array)
> + bpf_prog_array_free(old_array);
> +
> +out:

Its customary to call that unlock:

> + mutex_unlock(_event_mutex);
> + return ret;
> +}


Re: problem with rtnetlink 'reference' count

2017-10-24 Thread Peter Zijlstra
On Mon, Oct 23, 2017 at 09:37:03PM +0200, Florian Westphal wrote:

> > OK, so then why not do something like so?
> > @@ -260,10 +259,18 @@ void rtnl_unregister_all(int protocol)
> > RCU_INIT_POINTER(rtnl_msg_handlers[protocol], NULL);
> > rtnl_unlock();
> >  
> > +   /*
> > +* XXX explain what this is for...
> > +*/
> > synchronize_net();
> >  
> > -   while (refcount_read(_msg_handlers_ref[protocol]) > 1)
> > -   schedule();
> > +   /*
> > +* This serializes against the rcu_read_lock() section in
> > +* rtnetlink_rcv_msg() such that after this, all prior instances have
> > +* completed and future instances must observe the NULL written above.
> > +*/
> > +   synchronize_rcu();
> 
> Yes, but that won't help with running dumpers, see below.
> 
> > @@ -4218,7 +4223,6 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, 
> > struct nlmsghdr *nlh,
> > };
> > err = netlink_dump_start(rtnl, skb, nlh, );
> 
> This will copy .dumper function address to nlh->cb for later invocation
> when dump gets resumed (its called from netlink_recvmsg()),
> so this can return to userspace and dump can be resumed on next recv().
> 
> Because the dumper function was stored in the socket, NULLing the
> rtnl_msg_handlers[] only prevents new dumps from starting but not
> already set-up dumps from resuming.

but netlink_dump_start() will actually grab a reference on the module;
but it does so too late.

Would it not be sufficient to put that try_module_get() under the
rcu_read_lock()?


Re: problem with rtnetlink 'reference' count

2017-10-23 Thread Peter Zijlstra
On Mon, Oct 23, 2017 at 06:37:44PM +0200, Florian Westphal wrote:

> Is refcount_t only supposed to be used with dec_and_test patterns?

Yes, for reference counting objects.

> > This rtnetlink_rcv_msg() is called from softirq-context, right? Also,
> > all that stuff happens with rcu_read_lock() held.
> 
> No, its called from process context.

OK, so then why not do something like so?


---
 net/core/rtnetlink.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index d4bcdcc68e92..473cabd0a551 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -128,7 +128,6 @@ bool lockdep_rtnl_is_held(void)
 #endif /* #ifdef CONFIG_PROVE_LOCKING */
 
 static struct rtnl_link __rcu *rtnl_msg_handlers[RTNL_FAMILY_MAX + 1];
-static refcount_t rtnl_msg_handlers_ref[RTNL_FAMILY_MAX + 1];
 
 static inline int rtm_msgindex(int msgtype)
 {
@@ -260,10 +259,18 @@ void rtnl_unregister_all(int protocol)
RCU_INIT_POINTER(rtnl_msg_handlers[protocol], NULL);
rtnl_unlock();
 
+   /*
+* XXX explain what this is for...
+*/
synchronize_net();
 
-   while (refcount_read(_msg_handlers_ref[protocol]) > 1)
-   schedule();
+   /*
+* This serializes against the rcu_read_lock() section in
+* rtnetlink_rcv_msg() such that after this, all prior instances have
+* completed and future instances must observe the NULL written above.
+*/
+   synchronize_rcu();
+
kfree(handlers);
 }
 EXPORT_SYMBOL_GPL(rtnl_unregister_all);
@@ -4203,8 +4210,6 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto err_unlock;
}
 
-   refcount_inc(_msg_handlers_ref[family]);
-
if (type == RTM_GETLINK - RTM_BASE)
min_dump_alloc = rtnl_calcit(skb, nlh);
 
@@ -4218,7 +4223,6 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct 
nlmsghdr *nlh,
};
err = netlink_dump_start(rtnl, skb, nlh, );
}
-   refcount_dec(_msg_handlers_ref[family]);
return err;
}
 
@@ -4230,12 +4234,10 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, 
struct nlmsghdr *nlh,
 
flags = READ_ONCE(handlers[type].flags);
if (flags & RTNL_FLAG_DOIT_UNLOCKED) {
-   refcount_inc(_msg_handlers_ref[family]);
doit = READ_ONCE(handlers[type].doit);
rcu_read_unlock();
if (doit)
err = doit(skb, nlh, extack);
-   refcount_dec(_msg_handlers_ref[family]);
return err;
}
 
@@ -4333,9 +4335,6 @@ void __init rtnetlink_init(void)
 {
int i;
 
-   for (i = 0; i < ARRAY_SIZE(rtnl_msg_handlers_ref); i++)
-   refcount_set(_msg_handlers_ref[i], 1);
-
if (register_pernet_subsys(_net_ops))
panic("rtnetlink_init: cannot initialize rtnetlink\n");
 


Re: problem with rtnetlink 'reference' count

2017-10-23 Thread Peter Zijlstra
On Mon, Oct 23, 2017 at 05:32:00PM +0200, Florian Westphal wrote:

> >  1) it not in fact a refcount, so using refcount_t is silly
> 
> Your suggestion is...?

Normal atomic_t

> >  2) there is a distinct lack of memory barriers, so we can easily
> > observe the decrement while the msg_handler is still in progress.
> 
> I guess you mean it needs:
> 
> + smp_mb__before_atomic();
>   refcount_dec(_msg_handlers_ref[family]);
> ?

Yes, but also:

atomic_inc();
smp_mb__after_atomic();

To avoid the problem of te inc being observed late.

> However, this refcount_dec is misplaced anyway as it would need
> to occur from nlcb->done() (the handler function gets stored in socket for
> use by next recvmsg), so this change is indeed not helpful at all.
> 
> >  3) waiting with a schedule()/yield() loop is complete crap and subject
> > life-locks, imagine doing that rtnl_unregister_all() from a RT task.

> Alternatively we can of course sleep instead of schedule() but that
> doesn't appear too appealing either (albeit it is a lot less intrusive).

That is much better than a yield loop.

> Any other idea?

This rtnetlink_rcv_msg() is called from softirq-context, right? Also,
all that stuff happens with rcu_read_lock() held.

So why isn't that synchronize_net() call sufficient? You first clear
rtnl_msg_handlers[protocol], and then you do synchronize_net() which
will wait for all concurrent softirq handlers to complete. Which, if
rtnetlink_rcv_msg() is called from softir, guarantees nobody still uses
it.


Also, if that is all softirq, you should maybe use rcu_read_lock_bh(),
alternatively you should use synchronize_rcu(), as is its a bit
inconsistent.


problem with rtnetlink 'reference' count

2017-10-23 Thread Peter Zijlstra
Hi,

I just ran across commit:

  019a316992ee ("rtnetlink: add reference counting to prevent module unload 
while dump is in progress")

And that commit is _completely_ broken.

 1) it not in fact a refcount, so using refcount_t is silly
 2) there is a distinct lack of memory barriers, so we can easily
observe the decrement while the msg_handler is still in progress.
 3) waiting with a schedule()/yield() loop is complete crap and subject
life-locks, imagine doing that rtnl_unregister_all() from a RT task.

Please fix..


Re: [PATCH RFC tip/core/rcu 14/15] netfilter: Remove now-redundant smp_read_barrier_depends()

2017-10-10 Thread Peter Zijlstra
On Mon, Oct 09, 2017 at 05:22:48PM -0700, Paul E. McKenney wrote:
> READ_ONCE() now implies smp_read_barrier_depends(), which means that
> the instances in arpt_do_table(), ipt_do_table(), and ip6t_do_table()
> are now redundant.  This commit removes them and adjusts the comments.

Similar to the previous patch, the lack of READ_ONCE() in the original
code is a pre-existing bug. It would allow the compiler to tear the load
and observe a composite of two difference pointer values, or reload the
private pointer and result in table_base and jumpstacl being part of
different objects.

It would be good to point out this actually fixes a bug in the code.


Re: [PATCH net-next v7 1/5] bpf: perf event change needed for subsequent bpf helpers

2017-10-06 Thread Peter Zijlstra
On Thu, Oct 05, 2017 at 09:19:19AM -0700, Yonghong Song wrote:
> This patch does not impact existing functionalities.
> It contains the changes in perf event area needed for
> subsequent bpf_perf_event_read_value and
> bpf_perf_prog_read_value helpers.
> 
> Signed-off-by: Yonghong Song <y...@fb.com>

Acked-by: Peter Zijlstra (Intel) <pet...@infradead.org>

And as discussed, I'll take this patch into my dev tree while Dave will
take all of them into the network tree.


Re: [PATCH net-next v6 0/4] bpf: add two helpers to read perf event enabled/running time

2017-10-05 Thread Peter Zijlstra
On Wed, Oct 04, 2017 at 04:00:56PM -0700, David Miller wrote:
> From: Yonghong Song 
> Date: Mon, 2 Oct 2017 15:42:14 -0700
> 
> > [Dave, Peter,
> > 
> >  Previous communcation shows that this patch may potentially have
> >  merge conflict with upcoming tip changes in the next merge window.
> > 
> >  Could you advise how this patch should proceed?
> > 
> >  Thanks!
> > ]
> 
> Indeed, Peter how do you want to handle this?

I think Alexei suggested that we merge the one patch in two branches and
let git sort it out. I _think_ I've done something similar before and it
worked.


Re: [PATCH net] bpf: one perf event close won't free bpf program attached by another perf event

2017-09-21 Thread Peter Zijlstra
On Wed, Sep 20, 2017 at 10:20:13PM -0700, Yonghong Song wrote:
> > (2). trace_event_call->perf_events are per cpu data structure, that
> > means, some filtering logic is needed to avoid the same perf_event prog
> > is executing twice.
> 
> What I mean here is that the trace_event_call->perf_events need to be
> checked on ALL cpus since bpf prog should be executed regardless of
> cpu affiliation. It is possible that the same perf_event in different
> per_cpu bucket and hence filtering is needed to avoid the same
> perf_event bpf_prog is executed twice.

An event will only ever be on a single CPU's list at any one time IIRC.

Now, hysterically perf_event_set_bpf_prog used the tracepoint crud
because that already had bpf bits in. But it might make sense to look at
unifying the bpf stuff across all the different event types. Have them
all use event->prog.

I suspect that would break a fair bunch of bpf proglets, since the data
access to the trace data would be completely different, but it would be
much nicer to not have this distinction based on event type.


Re: [PATCH net-next v5 1/4] bpf: add helper bpf_perf_event_read_value for perf event array map

2017-09-20 Thread Peter Zijlstra
On Tue, Sep 19, 2017 at 11:09:32PM -0700, Yonghong Song wrote:
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 3e691b7..2d5bbe5 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3684,10 +3684,12 @@ static inline u64 perf_event_count(struct perf_event 
> *event)
>   * will not be local and we cannot read them atomically
>   *   - must not have a pmu::count method
>   */
> -int perf_event_read_local(struct perf_event *event, u64 *value)
> +int perf_event_read_local(struct perf_event *event, u64 *value,
> +   u64 *enabled, u64 *running)
>  {
>   unsigned long flags;
>   int ret = 0;
> + u64 now;
>  
>   /*
>* Disabling interrupts avoids all counter scheduling (context
> @@ -3718,14 +3720,21 @@ int perf_event_read_local(struct perf_event *event, 
> u64 *value)
>   goto out;
>   }
>  
> + now = event->shadow_ctx_time + perf_clock();
> + if (enabled)
> + *enabled = now - event->tstamp_enabled;
>   /*
>* If the event is currently on this CPU, its either a per-task event,
>* or local to this CPU. Furthermore it means its ACTIVE (otherwise
>* oncpu == -1).
>*/
> - if (event->oncpu == smp_processor_id())
> + if (event->oncpu == smp_processor_id()) {
>   event->pmu->read(event);
> -
> + if (running)
> + *running = now - event->tstamp_running;
> + } else if (running) {
> + *running = event->total_time_running;
> + }
>   *value = local64_read(>count);
>  out:
>   local_irq_restore(flags);

Yeah, this looks about right.

Dave, could we have this in a topic tree of sorts, because I have a
pending series to rework all the timekeeping and it might be nice to not
have sfr run into all sorts of conflicts.


Re: [PATCH v2 net-next 1/4] bpf: add helper bpf_perf_read_counter_time for perf event array map

2017-09-04 Thread Peter Zijlstra
On Fri, Sep 01, 2017 at 10:48:21PM -0700, Yonghong Song wrote:
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index b14095b..5a50808 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -898,7 +898,8 @@ perf_event_create_kernel_counter(struct perf_event_attr 
> *attr,
>   void *context);
>  extern void perf_pmu_migrate_context(struct pmu *pmu,
>   int src_cpu, int dst_cpu);
> -int perf_event_read_local(struct perf_event *event, u64 *value);
> +int perf_event_read_local(struct perf_event *event, u64 *value,
> +   u64 *enabled, u64 *running);
>  extern u64 perf_event_read_value(struct perf_event *event,
>u64 *enabled, u64 *running);
>  

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 8c01572..20c4039 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3670,7 +3670,8 @@ static inline u64 perf_event_count(struct perf_event 
> *event)
>   * will not be local and we cannot read them atomically
>   *   - must not have a pmu::count method
>   */
> -int perf_event_read_local(struct perf_event *event, u64 *value)
> +int perf_event_read_local(struct perf_event *event, u64 *value,
> +   u64 *enabled, u64 *running)
>  {
>   unsigned long flags;
>   int ret = 0;
> @@ -3694,7 +3695,7 @@ int perf_event_read_local(struct perf_event *event, u64 
> *value)
>* It must not have a pmu::count method, those are not
>* NMI safe.
>*/
> - if (event->pmu->count) {
> + if (value && event->pmu->count) {
>   ret = -EOPNOTSUPP;
>   goto out;
>   }

No, value _must_ be !NULL. Otherwise you allow getting timestamps
independently from the count value and that is broken.

The {value, enabled, running} tuple is temporally related.

> @@ -3718,10 +3719,16 @@ int perf_event_read_local(struct perf_event *event, 
> u64 *value)
>* or local to this CPU. Furthermore it means its ACTIVE (otherwise
>* oncpu == -1).
>*/
> - if (event->oncpu == smp_processor_id())
> - event->pmu->read(event);
> -
> - *value = local64_read(>count);
> + if (value) {
> + if (event->oncpu == smp_processor_id())
> + event->pmu->read(event);
> + *value = local64_read(>count);
> + }
> + if (enabled && running) {
> + u64 ctx_time = event->shadow_ctx_time + perf_clock();
> + *enabled = ctx_time - event->tstamp_enabled;
> + *running = ctx_time - event->tstamp_running;
> + }
>  out:
>   local_irq_restore(flags);

Please make that something like:


u64 now = event->shadow_ctx_time + perf_clock();

if (enabled)
*enabled = now - event->tstamp_enabled;

if (event->oncpu == smp_processor_id()) {
event->pmu->read(event);
if (running)
*running = now - event->tstamp_running;
} else {
*running = event->total_time_running;
}

And I'll fix it up when I make:

  
https://lkml.kernel.org/r/20170831171837.njnc6r6elsvkl...@hirez.programming.kicks-ass.net

happen.


Re: [PATCH net-next 1/4] bpf: add helper bpf_perf_read_counter_time for perf event array map

2017-09-01 Thread Peter Zijlstra
On Fri, Sep 01, 2017 at 01:29:17PM -0700, Alexei Starovoitov wrote:

> >+BPF_CALL_4(bpf_perf_read_counter_time, struct bpf_map *, map, u64, flags,
> >+struct bpf_perf_counter_time *, buf, u32, size)
> >+{
> >+struct perf_event *pe;
> >+u64 now;
> >+int err;
> >+
> >+if (unlikely(size != sizeof(struct bpf_perf_counter_time)))
> >+return -EINVAL;
> >+err = get_map_perf_counter(map, flags, >counter, );
> >+if (err)
> >+return err;
> >+
> >+calc_timer_values(pe, , >time.enabled, >time.running);
> >+return 0;
> >+}
> 
> Peter,
> I believe we're doing it correctly above.
> It's a copy paste of the same logic as in total_time_enabled/running.
> We cannot expose total_time_enabled/running to bpf, since they are
> different counters. The above two are specific to bpf usage.
> See commit log.

No, the patch is atrocious and the usage is wrong.

Exporting a function called 'calc_timer_values' is a horrible violation
of the namespace.

And its wrong because it should be done in conjunction with
perf_event_read_local(). You cannot afterwards call this because you
don't know if the event was active when you read it and you don't have
temporal guarantees; that is, reading these timestamps long after or
before the read is wrong, and this interface allows it.

So no, sorry this is just fail.


Re: [PATCH net-next 1/4] bpf: add helper bpf_perf_read_counter_time for perf event array map

2017-09-01 Thread Peter Zijlstra
On Fri, Sep 01, 2017 at 09:53:54AM -0700, Yonghong Song wrote:
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index b14095b..7fd5e94 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -901,6 +901,8 @@ extern void perf_pmu_migrate_context(struct pmu *pmu,
>  int perf_event_read_local(struct perf_event *event, u64 *value);
>  extern u64 perf_event_read_value(struct perf_event *event,
>u64 *enabled, u64 *running);
> +extern void calc_timer_values(struct perf_event *event, u64 *now,
> + u64 *enabled, u64 *running);
>  
>  

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 8c01572..ef5c7fb 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4883,7 +4883,7 @@ static int perf_event_index(struct perf_event *event)
>   return event->pmu->event_idx(event);
>  }
>  
> -static void calc_timer_values(struct perf_event *event,
> +void calc_timer_values(struct perf_event *event,
>   u64 *now,
>   u64 *enabled,
>   u64 *running)

Yeah, not going to happen...

Why not do the obvious thing and extend perf_event_read_local() to
optionally return the enabled/running times?




Re: [PATCH net-next v2 1/2] bpf: add support for sys_enter_* and sys_exit_* tracepoints

2017-08-03 Thread Peter Zijlstra
On Wed, Aug 02, 2017 at 10:28:27PM -0700, Yonghong Song wrote:
> Currently, bpf programs cannot be attached to sys_enter_* and sys_exit_*
> style tracepoints. The iovisor/bcc issue #748
> (https://github.com/iovisor/bcc/issues/748) documents this issue.
> For example, if you try to attach a bpf program to tracepoints
> syscalls/sys_enter_newfstat, you will get the following error:
># ./tools/trace.py t:syscalls:sys_enter_newfstat
>Ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument
>Failed to attach BPF to tracepoint
> 
> The main reason is that syscalls/sys_enter_* and syscalls/sys_exit_*
> tracepoints are treated differently from other tracepoints and there
> is no bpf hook to it.
> 
> This patch adds bpf support for these syscalls tracepoints by
>   . permitting bpf attachment in ioctl PERF_EVENT_IOC_SET_BPF
>   . calling bpf programs in perf_syscall_enter and perf_syscall_exit
> 
> Signed-off-by: Yonghong Song 

Ack for the perf bits, but you should've Cc'ed steve too I suppose.

> ---
>  include/linux/syscalls.h  |  6 +
>  kernel/events/core.c  |  8 ---
>  kernel/trace/trace_syscalls.c | 53 
> +--
>  3 files changed, 62 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 3cb15ea..00fa3eb 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -117,6 +117,12 @@ extern struct trace_event_class event_class_syscall_exit;
>  extern struct trace_event_functions enter_syscall_print_funcs;
>  extern struct trace_event_functions exit_syscall_print_funcs;
>  
> +static inline int is_syscall_trace_event(struct trace_event_call *tp_event)
> +{
> + return tp_event->class == _class_syscall_enter ||
> +tp_event->class == _class_syscall_exit;
> +}
> +
>  #define SYSCALL_TRACE_ENTER_EVENT(sname) \
>   static struct syscall_metadata __syscall_meta_##sname;  \
>   static struct trace_event_call __used   \
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 426c2ff..750b8d3 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -8050,7 +8050,7 @@ static void perf_event_free_bpf_handler(struct 
> perf_event *event)
>  
>  static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
>  {
> - bool is_kprobe, is_tracepoint;
> + bool is_kprobe, is_tracepoint, is_syscall_tp;
>   struct bpf_prog *prog;
>  
>   if (event->attr.type != PERF_TYPE_TRACEPOINT)
> @@ -8061,7 +8061,8 @@ static int perf_event_set_bpf_prog(struct perf_event 
> *event, u32 prog_fd)
>  
>   is_kprobe = event->tp_event->flags & TRACE_EVENT_FL_UKPROBE;
>   is_tracepoint = event->tp_event->flags & TRACE_EVENT_FL_TRACEPOINT;
> - if (!is_kprobe && !is_tracepoint)
> + is_syscall_tp = is_syscall_trace_event(event->tp_event);
> + if (!is_kprobe && !is_tracepoint && !is_syscall_tp)
>   /* bpf programs can only be attached to u/kprobe or tracepoint 
> */
>   return -EINVAL;
>  
> @@ -8070,7 +8071,8 @@ static int perf_event_set_bpf_prog(struct perf_event 
> *event, u32 prog_fd)
>   return PTR_ERR(prog);
>  
>   if ((is_kprobe && prog->type != BPF_PROG_TYPE_KPROBE) ||
> - (is_tracepoint && prog->type != BPF_PROG_TYPE_TRACEPOINT)) {
> + (is_tracepoint && prog->type != BPF_PROG_TYPE_TRACEPOINT) ||
> + (is_syscall_tp && prog->type != BPF_PROG_TYPE_TRACEPOINT)) {
>   /* valid fd, but invalid bpf program type */
>   bpf_prog_put(prog);
>   return -EINVAL;
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index 5e10395..3bd9e1c 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -559,11 +559,29 @@ static DECLARE_BITMAP(enabled_perf_exit_syscalls, 
> NR_syscalls);
>  static int sys_perf_refcount_enter;
>  static int sys_perf_refcount_exit;
>  
> +static int perf_call_bpf_enter(struct bpf_prog *prog, struct pt_regs *regs,
> +   struct syscall_metadata *sys_data,
> +   struct syscall_trace_enter *rec) {
> + struct syscall_tp_t {
> + unsigned long long regs;
> + unsigned long syscall_nr;
> + unsigned long args[6]; /* maximum 6 arguments */
> + } param;
> + int i;
> +
> + *(struct pt_regs **) = regs;
> + param.syscall_nr = rec->nr;
> + for (i = 0; i < sys_data->nb_args && i < 6; i++)
> + param.args[i] = rec->args[i];
> + return trace_call_bpf(prog, );
> +}
> +
>  static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
>  {
>   struct syscall_metadata *sys_data;
>   struct syscall_trace_enter *rec;
>   struct hlist_head *head;
> + struct bpf_prog *prog;
>   int syscall_nr;
>   int rctx;
>   int size;
> @@ -578,8 +596,9 @@ static 

Re: [PATCH net-next 1/2] bpf: add support for sys_{enter|exit}_* tracepoints

2017-08-02 Thread Peter Zijlstra
On Tue, Aug 01, 2017 at 11:30:04PM -0700, Yonghong Song wrote:
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 426c2ff..623c977 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -8050,7 +8050,7 @@ static void perf_event_free_bpf_handler(struct 
> perf_event *event)
>  
>  static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
>  {
> - bool is_kprobe, is_tracepoint;
> + bool is_cap_any, is_kprobe, is_tracepoint;
>   struct bpf_prog *prog;
>  
>   if (event->attr.type != PERF_TYPE_TRACEPOINT)
> @@ -8059,9 +8059,11 @@ static int perf_event_set_bpf_prog(struct perf_event 
> *event, u32 prog_fd)
>   if (event->tp_event->prog)
>   return -EEXIST;
>  
> + /* currently, CAP_ANY only for sys_enter_* and sys_exit_* tracepoints */
> + is_cap_any = event->tp_event->flags & TRACE_EVENT_FL_CAP_ANY;
>   is_kprobe = event->tp_event->flags & TRACE_EVENT_FL_UKPROBE;
>   is_tracepoint = event->tp_event->flags & TRACE_EVENT_FL_TRACEPOINT;
> - if (!is_kprobe && !is_tracepoint)
> + if (!is_cap_any && !is_kprobe && !is_tracepoint)
>   /* bpf programs can only be attached to u/kprobe or tracepoint 
> */
>   return -EINVAL;
>  
> @@ -8070,7 +8072,8 @@ static int perf_event_set_bpf_prog(struct perf_event 
> *event, u32 prog_fd)
>   return PTR_ERR(prog);
>  
>   if ((is_kprobe && prog->type != BPF_PROG_TYPE_KPROBE) ||
> - (is_tracepoint && prog->type != BPF_PROG_TYPE_TRACEPOINT)) {
> + (is_tracepoint && prog->type != BPF_PROG_TYPE_TRACEPOINT) ||
> + (is_cap_any && prog->type != BPF_PROG_TYPE_TRACEPOINT)) {
>   /* valid fd, but invalid bpf program type */
>   bpf_prog_put(prog);
>   return -EINVAL;

This looks wrong. FL_CAP_ANY is a privilege thing, not something that
identifies syscall hooks (it just so happens only those now have the bit
set, but that's an accident more than anything else).



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-07 Thread Peter Zijlstra
On Fri, Jul 07, 2017 at 12:33:49PM +0200, Ingo Molnar wrote:

>  [1997/04] v2.1.36:
> 
>   the spin_unlock_wait() primitive gets introduced as part of release()


Whee, that goes _way_ further back than I thought it did :-)

>  [2017/07] v4.12:
> 
>   wait_task_inactive() is still alive and kicking. Its poll loop has
>   increased in complexity, but it still does not use spin_unlock_wait()
> 

I've tried 'fixing' that wait_task_inactive() thing a number of times,
but always failed :/ That is very nasty code indeed.


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-07 Thread Peter Zijlstra
On Fri, Jul 07, 2017 at 10:31:28AM +0200, Ingo Molnar wrote:
> Here's a quick list of all the use cases:
> 
>  net/netfilter/nf_conntrack_core.c:
> 
>- This is I believe the 'original', historic spin_unlock_wait() usecase 
> that
>  still exists in the kernel. spin_unlock_wait() is only used in a rare 
> case, 
>  when the netfilter hash is resized via nf_conntrack_hash_resize() - 
> which is 
>  a very heavy operation to begin with. It will no doubt get slower with 
> the 
>  proposed changes, but it probably does not matter. A networking person 
>  Acked-by would be nice though.
> 
>  drivers/ata/libata-eh.c:
> 
>- Locking of the ATA port in ata_scsi_cmd_error_handler(), presumably this 
> can
>  race with IRQs and ioctls() on other CPUs. Very likely not performance 
>  sensitive in any fashion, on IO errors things stop for many seconds 
> anyway.
> 
>  ipc/sem.c:
> 
>- A rare race condition branch in the SysV IPC semaphore freeing code in 
>  exit_sem() - where even the main code flow is not performance sensitive, 
>  because typical database workloads get their semaphore arrays during 
> startup 
>  and don't ever do heavy runtime allocation/freeing of them.
> 
>  kernel/sched/completion.c:
> 
>- completion_done(). This is actually a (comparatively) rarely used 
> completion 
>  API call - almost all the upstream usecases are in drivers, plus two in 
>  filesystems - neither usecase seems in a performance critical hot path. 
>  Completions typically involve scheduling and context switching, so in 
> the 
>  worst case the proposed change adds overhead to a scheduling slow path.
> 

You missed the one in do_exit(), which I thought was the original one.


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Peter Zijlstra
On Thu, Jul 06, 2017 at 12:49:12PM -0400, Alan Stern wrote:
> On Thu, 6 Jul 2017, Paul E. McKenney wrote:
> 
> > On Thu, Jul 06, 2017 at 06:10:47PM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 06, 2017 at 08:21:10AM -0700, Paul E. McKenney wrote:
> > > > And yes, there are architecture-specific optimizations for an
> > > > empty spin_lock()/spin_unlock() critical section, and the current
> > > > arch_spin_unlock_wait() implementations show some of these 
> > > > optimizations.
> > > > But I expect that performance benefits would need to be demonstrated at
> > > > the system level.
> > > 
> > > I do in fact contended there are any optimizations for the exact
> > > lock+unlock semantics.
> > 
> > You lost me on this one.
> > 
> > > The current spin_unlock_wait() is weaker. Most notably it will not (with
> > > exception of ARM64/PPC for other reasons) cause waits on other CPUs.
> > 
> > Agreed, weaker semantics allow more optimizations.  So use cases needing
> > only the weaker semantics should more readily show performance benefits.
> > But either way, we need compelling use cases, and I do not believe that
> > any of the existing spin_unlock_wait() calls are compelling.  Perhaps I
> > am confused, but I am not seeing it for any of them.
> 
> If somebody really wants the full spin_unlock_wait semantics and
> doesn't want to interfere with other CPUs, wouldn't synchronize_sched()
> or something similar do the job?  It wouldn't be as efficient as
> lock+unlock, but it also wouldn't affect other CPUs.

So please don't do that. That'll create massive pain for RT. Also I
don't think it works. The whole point was that spin_unlock_wait() is
_cheaper_ than lock()+unlock(). If it gets to be more expensive there is
absolutely no point in using it.




Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Peter Zijlstra
On Thu, Jul 06, 2017 at 09:20:24AM -0700, Paul E. McKenney wrote:
> On Thu, Jul 06, 2017 at 06:05:55PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 06, 2017 at 02:12:24PM +, David Laight wrote:
> > > From: Paul E. McKenney
> 
> [ . . . ]
> 
> > Now on the one hand I feel like Oleg that it would be a shame to loose
> > the optimization, OTOH this thing is really really tricky to use,
> > and has lead to a number of bugs already.
> 
> I do agree, it is a bit sad to see these optimizations go.  So, should
> this make mainline, I will be tagging the commits that spin_unlock_wait()
> so that they can be easily reverted should someone come up with good
> semantics and a compelling use case with compelling performance benefits.

Ha!, but what would constitute 'good semantics' ?

The current thing is something along the lines of:

  "Waits for the currently observed critical section
   to complete with ACQUIRE ordering such that it will observe
   whatever state was left by said critical section."

With the 'obvious' benefit of limited interference on those actually
wanting to acquire the lock, and a shorter wait time on our side too,
since we only need to wait for completion of the current section, and
not for however many contender are before us.

Not sure I have an actual (micro) benchmark that shows a difference
though.



Is this all good enough to retain the thing, I dunno. Like I said, I'm
conflicted on the whole thing. On the one hand its a nice optimization,
on the other hand I don't want to have to keep fixing these bugs.


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Peter Zijlstra
On Thu, Jul 06, 2017 at 09:24:12AM -0700, Paul E. McKenney wrote:
> On Thu, Jul 06, 2017 at 06:10:47PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 06, 2017 at 08:21:10AM -0700, Paul E. McKenney wrote:
> > > And yes, there are architecture-specific optimizations for an
> > > empty spin_lock()/spin_unlock() critical section, and the current
> > > arch_spin_unlock_wait() implementations show some of these optimizations.
> > > But I expect that performance benefits would need to be demonstrated at
> > > the system level.
> > 
> > I do in fact contended there are any optimizations for the exact
> > lock+unlock semantics.
> 
> You lost me on this one.

For the exact semantics you'd have to fully participate in the fairness
protocol. You have to in fact acquire the lock in order to have the
other contending CPUs wait (otherwise my earlier case 3 will fail).

At that point I'm not sure there is much actual code you can leave out.

What actual optimization is there left at that point?


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Peter Zijlstra
On Thu, Jul 06, 2017 at 08:21:10AM -0700, Paul E. McKenney wrote:
> And yes, there are architecture-specific optimizations for an
> empty spin_lock()/spin_unlock() critical section, and the current
> arch_spin_unlock_wait() implementations show some of these optimizations.
> But I expect that performance benefits would need to be demonstrated at
> the system level.

I do in fact contended there are any optimizations for the exact
lock+unlock semantics.

The current spin_unlock_wait() is weaker. Most notably it will not (with
exception of ARM64/PPC for other reasons) cause waits on other CPUs.



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Peter Zijlstra
On Thu, Jul 06, 2017 at 02:12:24PM +, David Laight wrote:
> From: Paul E. McKenney
> > Sent: 06 July 2017 00:30
> > There is no agreed-upon definition of spin_unlock_wait()'s semantics,
> > and it appears that all callers could do just as well with a lock/unlock
> > pair.  This series therefore removes spin_unlock_wait() and changes
> > its users to instead use a lock/unlock pair.  The commits are as follows,
> > in three groups:
> > 
> > 1-7.Change uses of spin_unlock_wait() and raw_spin_unlock_wait()
> > to instead use a spin_lock/spin_unlock pair.  These may be
> > applied in any order, but must be applied before any later
> > commits in this series.  The commit logs state why I believe
> > that these commits won't noticeably degrade performance.
> 
> I can't help feeling that it might be better to have a spin_lock_sync()
> call that is equivalent to a spin_lock/spin_unlock pair.
> The default implementation being an inline function that does exactly that.
> This would let an architecture implement a more efficient version.

So that has the IRQ inversion issue found earlier in this patch-set. Not
actually doing the acquire though breaks other guarantees. See later.

> It might even be that this is the defined semantics of spin_unlock_wait().

As is, spin_unlock_wait() is somewhat ill defined. IIRC it grew from an
optimization by Oleg and subsequently got used elsewhere. And it being
the subtle bugger it is, there were bugs.

But part of the problem with that definition is fairness. For fair locks,
spin_lock()+spin_unlock() partakes in the fairness protocol. Many of the
things that would make spin_lock_sync() cheaper preclude it doing that.

(with the exception of ticket locks, those could actually do this).

But I think we can argue we don't in fact want that, all we really need
is to ensure the completion of the _current_ lock. But then you've
violated that equivalent thing.

As is, this is all a bit up in the air -- some of the ticket lock
variants are fair or minimal, the qspinlock on is prone to starvation
(although I think I can in fact fix that for qspinlock).

> Note that it can only be useful to do a spin_lock/unlock pair if it is
> impossible for another code path to try to acquire the lock.
> (Or, at least, the code can't care if the lock is acquired just after.)
> So if it can de determined that the lock isn't held (a READ_ONCE()
> might be enough) the lock itself need not be acquired (with the
> associated slow bus cycles).

Now look at the ARM64/PPC implementations that do explicit stores.

READ_ONCE() can only ever be sufficient on strongly ordered
architectures, but given many performance critical architectures are in
fact weakly ordered, you've opened up the exact can of worms we want to
get rid of the thing for.


So given:

(1)

  spin_lock()
  spin_unlock()

does not in fact provide memory ordering. Any prior/subsequent
load/stores can leak into the section and cross there.

What the thing does do however is serialize against other critical
sections in that:

(2)

  CPU0  CPU1

spin_lock()
  spin_lock()
X = 5
spin_unlock()
  spin_unlock()
  r = X;

we must have r == 5 (due to address dependency on the lock; CPU1 does
the store to X, then a store-release to the lock. CPU0 then does a
load-acquire on the lock and that fully orders the subsequent load of X
to the prior store of X).


The other way around is more difficult though:

(3)

  CPU0  CPU1

  X=5
  spin_lock()
spin_lock()
  spin_unlock()
r = X;
spin_unlock()

Where the above will in fact observe r == 5, this will be very difficult
to achieve with anything that will not let CPU1 wait. Which was the
entire premise of the original optimization by Oleg.

One of the later fixes we did to spin_unlock_wait() is to give it
ACQUIRE semantics to deal with (2) but even that got massively tricky,
see for example the ARM64 / PPC implementations.



In short, I doubt there is any real optimization possible if you want to
retain exact lock+unlock semantics.

Now on the one hand I feel like Oleg that it would be a shame to loose
the optimization, OTOH this thing is really really tricky to use,
and has lead to a number of bugs already.


Re: [PATCH v3 net-next 1/3] perf, bpf: Add BPF support to all perf_event types

2017-06-02 Thread Peter Zijlstra
On Thu, Jun 01, 2017 at 07:03:34PM -0700, Alexei Starovoitov wrote:
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 172dc8ee0e3b..ab93490d1a00 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -452,39 +452,18 @@ static void bpf_event_entry_free_rcu(struct 
> bpf_event_entry *ee)
>  static void *perf_event_fd_array_get_ptr(struct bpf_map *map,
>struct file *map_file, int fd)
>  {
> - const struct perf_event_attr *attr;
>   struct bpf_event_entry *ee;
> - struct perf_event *event;
>   struct file *perf_file;
>  
>   perf_file = perf_event_get(fd);
>   if (IS_ERR(perf_file))
>   return perf_file;
>  
> - event = perf_file->private_data;
> - ee = ERR_PTR(-EINVAL);
> -
> - attr = perf_event_attrs(event);
> - if (IS_ERR(attr) || attr->inherit)
> - goto err_out;
> -
> - switch (attr->type) {
> - case PERF_TYPE_SOFTWARE:
> - if (attr->config != PERF_COUNT_SW_BPF_OUTPUT)
> - goto err_out;
> - /* fall-through */
> - case PERF_TYPE_RAW:
> - case PERF_TYPE_HARDWARE:
> - ee = bpf_event_entry_gen(perf_file, map_file);
> - if (ee)
> - return ee;
> - ee = ERR_PTR(-ENOMEM);
> - /* fall-through */
> - default:
> - break;
> - }

Would it make sense to call perf_event_read_local() on the events here
in order to weed out the -EOPNOTSUPP  ones?

> + ee = bpf_event_entry_gen(perf_file, map_file);
> + if (ee)
> + return ee;
>  
> -err_out:
> + ee = ERR_PTR(-ENOMEM);
>   fput(perf_file);
>   return ee;
>  }


  1   2   3   4   5   6   >