On Wed, Sep 21, 2016 at 20:19:18 +0200, Paolo Bonzini wrote: (snip) > No, this is not true. Barriers order stores and loads within a thread > _and_ establish synchronizes-with edges. > > In the example above you are violating causality: > > - cpu0 stores cpu->running before loading pending_cpus > > - because pending_cpus == 0, cpu1 stores pending_cpus = 1 after cpu0 > loads it > > - cpu1 loads cpu->running after it stores pending_cpus
OK. So I simplified the example to understand this better: cpu0 cpu1 ---- ---- { A = B = 0, r0 and r1 are private variables } x = 1 y = 1 smp_mb() smp_mb() r0 = y r1 = x Turns out this is scenario 10 here: https://lwn.net/Articles/573436/ The source of my confusion was not paying due attention to smp_mb, which is necessary for maintaining transitivity. > > Is there a performance (scalability) reason behind this patch? > > Yes: it speeds up all cpu_exec_start/end, _not_ start/end_exclusive. > > With this patch, as long as there are no start/end_exclusive (which are > supposed to be rare) there is no contention on multiple CPUs doing > cpu_exec_start/end. > > Without it, as CPUs increase, the global cpu_list_mutex is going to > become a bottleneck. I see. Scalability-wise I wouldn't expect much improvement with MTTCG full-system, given that the iothread lock is still acquired on every CPU loop exit (just like in KVM). However, for user-mode this should yield measurable improvements =D Thanks, E.