On 21/09/2016 19:24, Emilio G. Cota wrote:
> On Mon, Sep 19, 2016 at 14:50:59 +0200, Paolo Bonzini wrote:
>> Set cpu->running without taking the cpu_list lock, only look at it if
>> there is a concurrent exclusive section.  This requires adding a new
>> field to CPUState, which records whether a running CPU is being counted
>> in pending_cpus.  When an exclusive section is started concurrently with
>> cpu_exec_start, cpu_exec_start can use the new field to wait for the end
>> of the exclusive section.
>>
>> This a separate patch for easier bisection of issues.
>>
>> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
>> ---
>>  cpus-common.c              | 73 
>> ++++++++++++++++++++++++++++++++++++++++------
>>  docs/tcg-exclusive.promela | 53 +++++++++++++++++++++++++++++++--
>>  include/qom/cpu.h          |  5 ++--
>>  3 files changed, 117 insertions(+), 14 deletions(-)
>>
>> diff --git a/cpus-common.c b/cpus-common.c
>> index f7ad534..46cf8ef 100644
>> --- a/cpus-common.c
>> +++ b/cpus-common.c
>> @@ -184,8 +184,12 @@ void start_exclusive(void)
>>  
>>      /* Make all other cpus stop executing.  */
>>      pending_cpus = 1;
>> +
>> +    /* Write pending_cpus before reading other_cpu->running.  */
>> +    smp_mb();
>>      CPU_FOREACH(other_cpu) {
>>          if (other_cpu->running) {
>> +            other_cpu->has_waiter = true;
>>              pending_cpus++;
>>              qemu_cpu_kick(other_cpu);
>>          }
>> @@ -212,24 +216,75 @@ void end_exclusive(void)
>>  /* Wait for exclusive ops to finish, and begin cpu execution.  */
>>  void cpu_exec_start(CPUState *cpu)
>>  {
>> -    qemu_mutex_lock(&qemu_cpu_list_mutex);
>> -    exclusive_idle();
>>      cpu->running = true;
>> -    qemu_mutex_unlock(&qemu_cpu_list_mutex);
>> +
>> +    /* Write cpu->running before reading pending_cpus.  */
>> +    smp_mb();
>> +
>> +    /* 1. start_exclusive saw cpu->running == true and pending_cpus >= 1.
>> +     * After taking the lock we'll see cpu->has_waiter == true and run---not
>> +     * for long because start_exclusive kicked us.  cpu_exec_end will
>> +     * decrement pending_cpus and signal the waiter.
>> +     *
>> +     * 2. start_exclusive saw cpu->running == false but pending_cpus >= 1.
>> +     * This includes the case when an exclusive item is running now.
>> +     * Then we'll see cpu->has_waiter == false and wait for the item to
>> +     * complete.
>> +     *
>> +     * 3. pending_cpus == 0.  Then start_exclusive is definitely going to
>> +     * see cpu->running == true, and it will kick the CPU.
>> +     */
>> +    if (pending_cpus) {
>> +        qemu_mutex_lock(&qemu_cpu_list_mutex);
>> +        if (!cpu->has_waiter) {
>> +            /* Not counted in pending_cpus, let the exclusive item
>> +             * run.  Since we have the lock, set cpu->running to true
>> +             * while holding it instead of retrying.
>> +             */
>> +            cpu->running = false;
>> +            exclusive_idle();
>> +            /* Now pending_cpus is zero.  */
>> +            cpu->running = true;
>> +        } else {
>> +            /* Counted in pending_cpus, go ahead.  */
>> +        }
>> +        qemu_mutex_unlock(&qemu_cpu_list_mutex);
>> +    }
> 
> wrt scenario (3): I don't think other threads will always see cpu->running == 
> true.
> Consider the following:
> 
> cpu0                                  cpu1
> ----                                  ----
> 
> cpu->running = true;                  pending_cpus = 1;
> smp_mb();                             smp_mb();
> if (pending_cpus) { /* false */ }     CPU_FOREACH(other_cpu) { if 
> (other_cpu->running) { /* false */ } }
> 
> The barriers here don't guarantee that changes are immediately visible to 
> others
> (for that we need strong ops, i.e. atomics).

No, this is not true.  Barriers order stores and loads within a thread
_and_ establish synchronizes-with edges.

In the example above you are violating causality:

- cpu0 stores cpu->running before loading pending_cpus

- because pending_cpus == 0, cpu1 stores pending_cpus = 1 after cpu0
loads it

- cpu1 loads cpu->running after it stores pending_cpus

hence the only valid ordering is

  cpu->running = true
  if (pending_cpus)
                                        pending_cpus
                                        if (other_cpu->running)

> Is there a performance (scalability) reason behind this patch?

Yes: it speeds up all cpu_exec_start/end, _not_ start/end_exclusive.

With this patch, as long as there are no start/end_exclusive (which are
supposed to be rare) there is no contention on multiple CPUs doing
cpu_exec_start/end.

Without it, as CPUs increase, the global cpu_list_mutex is going to
become a bottleneck.

Paolo

Reply via email to