Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

liu ping fan Wed, 19 Sep 2012 01:37:24 -0700

On Wed, Sep 19, 2012 at 4:01 PM, Avi Kivity <a...@redhat.com> wrote:
> On 09/17/2012 05:24 AM, liu ping fan wrote:
>> On Thu, Sep 13, 2012 at 4:19 PM, Avi Kivity <a...@redhat.com> wrote:
>>> On 09/13/2012 09:55 AM, liu ping fan wrote:
>>>> On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity <a...@redhat.com> wrote:
>>>>> On 09/11/2012 03:24 PM, Avi Kivity wrote:
>>>>>> On 09/11/2012 12:57 PM, Jan Kiszka wrote:
>>>>>>> On 2012-09-11 11:44, liu ping fan wrote:
>>>>>>>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <a...@redhat.com> wrote:
>>>>>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>>>>>>>> From: Liu Ping Fan <pingf...@linux.vnet.ibm.com>
>>>>>>>>>>
>>>>>>>>>> The func call chain can suffer from recursively hold
>>>>>>>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>>>>>>>>> lock depth.
>>>>>>>>>
>>>>>>>>> What is the root cause?  io handlers initiating I/O?
>>>>>>>>>
>>>>>>>> cpu_physical_memory_rw() can be called nested, and when called, it can
>>>>>>>> be protected by no-lock/device lock/ big-lock.
>>>>>>>> I think without big lock, io-dispatcher should face the same issue.
>>>>>>>> As to main-loop, have not carefully consider, but at least, dma-helper
>>>>>>>> will call cpu_physical_memory_rw() with big-lock.
>>>>>>>
>>>>>>> That is our core problem: inconsistent invocation of existing services
>>>>>>> /wrt locking. For portio, I was lucky that there is no nesting and I was
>>>>>>> able to drop the big lock around all (x86) call sites. But MMIO is way
>>>>>>> more tricky due to DMA nesting.
>>>>>>
>>>>>> Maybe we need to switch to a continuation style.  Instead of expecting
>>>>>> cpu_physical_memory_rw() to complete synchronously, it becomes an
>>>>>> asynchronous call and you provide it with a completion.  That means
>>>>>> devices which use it are forced to drop the lock in between.  Block and
>>>>>> network clients will be easy to convert since they already use APIs that
>>>>>> drop the lock (except for accessing the descriptors).
>>>>>>
>>>>>>> We could try to introduce a different version of cpu_physical_memory_rw,
>>>>>>> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
>>>>>>> request can trigger the very same access in a nested fashion, and we
>>>>>>> will have to detect this to avoid locking up QEMU (locking up the guest
>>>>>>> might be OK).
>>>>>>
>>>>>> An async version of c_p_m_rw() will just cause a completion to bounce
>>>>>> around, consuming cpu but not deadlocking anything.  If we can keep a
>>>>>> count of the bounces, we might be able to stall it indefinitely or at
>>>>>> least ratelimit it.
>>>>>>
>>>>>
>>>>> Another option is to require all users of c_p_m_rw() and related to use
>>>>> a coroutine or thread.  That makes the programming easier (but still
>>>>> required a revalidation after the dropped lock).
>>>>>
>>>> For the nested cpu_physical_memory_rw(), we change it internal but
>>>> keep it sync API as it is.  (Wrapper current cpu_physical_memory_rw()
>>>> into cpu_physical_memory_rw_internal() )
>>>>
>>>>
>>>> LOCK()  // can be device or big lock or both, depend on caller.
>>>> ..............
>>>> cpu_physical_memory_rw()
>>>> {
>>>>    UNLOCK() //unlock all the locks
>>>>    queue_work_on_thread(cpu_physical_memory_rw_internal, completion);
>>>> // cpu_physical_memory_rw_internal can take lock(device,biglock) again
>>>>    wait_for_completion(completion)
>>>>    LOCK()
>>>> }
>>>> ..................
>>>> UNLOCK()
>>>>
>>>
>>> This is dangerous.  The caller expects to hold the lock across the call,
>>> and will not re-validate its state.
>>>
>>>> Although cpu_physical_memory_rw_internal() can be freed to use lock,
>>>> but with precondition. -- We still need to trace lock stack taken by
>>>> cpu_physical_memory_rw(), so that it can return to caller correctly.
>>>> Is that OK?
>>>
>>> I'm convinced that we need a recursive lock if we don't convert
>>> everything at once.
>>>
>>> So how about:
>>>
>>> - make bql a recursive lock (use pthreads, don't invent our own)
>>> - change c_p_m_rw() calling convention to "caller may hold the BQL, but
>>> no device lock"
>>>
>>> this allows devices to DMA each other.  Unconverted device models will
>>> work as they used to.  Converted device models will need to drop the
>>> device lock, re-acquire it, and revalidate any state.  That will cause
>>
>> I think we are cornered by devices to DMA each other, and it raises
>> the unavoidable nested lock. Also to avoid deadlock caused by device's
>> lock sequence, we should drop the current device lock  to acquire
>> another one.   The only little  diverge is about the "revalidate", do
>> we need to rollback?
>
> It basically means you can't hold contents of device state in local
> variables.  You need to read everything again from the device.  That
> includes things like DMA enable bits.
>
I think that read everything again from the device can not work.
Suppose the following scene: If the device's state contains the change
of a series of internal registers (supposing partA+partB;
partC+partD), after partA changed, the device's lock is broken.  At
this point, another access to this device, it will work on partA+partB
to determine C+D, but since partB is not correctly updated yet. So C+D
may be decided by broken context and be wrong.


> (btw: to we respect PCI_COMMAND_MASTER? it seems that we do not.  This
> is a trivial example of an iommu, we should get that going).
>
>> Or for converted device, we can just tag it a
>> busy flag, that is check&set busy flag at the entry of device's
>> mmio-dispatch.  So  when re-acquire device's lock, the device's state
>> is intact.
>
> The state can be changed by a parallel access to another register, which
> is valid.
>
Do you mean that the device can be accessed in parallel? But how? we
use device's lock.
What my suggestion is:
lock();
set_and_test(dev->busy);
if busy
  unlock and return;
changing device registers;
do other things including calling to c_p_m_rw() //here,lock broken,
but set_and_test() works
clear(dev->busy);
unlock();

So changing device registers is protected, and unbreakable.

Regards,
pingfan
>>
>> Has anybody other suggestion?
>
>
> --
> error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

Reply via email to