On Wed, Sep 19, 2012 at 4:01 PM, Avi Kivity <a...@redhat.com> wrote: > On 09/17/2012 05:24 AM, liu ping fan wrote: >> On Thu, Sep 13, 2012 at 4:19 PM, Avi Kivity <a...@redhat.com> wrote: >>> On 09/13/2012 09:55 AM, liu ping fan wrote: >>>> On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity <a...@redhat.com> wrote: >>>>> On 09/11/2012 03:24 PM, Avi Kivity wrote: >>>>>> On 09/11/2012 12:57 PM, Jan Kiszka wrote: >>>>>>> On 2012-09-11 11:44, liu ping fan wrote: >>>>>>>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <a...@redhat.com> wrote: >>>>>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote: >>>>>>>>>> From: Liu Ping Fan <pingf...@linux.vnet.ibm.com> >>>>>>>>>> >>>>>>>>>> The func call chain can suffer from recursively hold >>>>>>>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the >>>>>>>>>> lock depth. >>>>>>>>> >>>>>>>>> What is the root cause? io handlers initiating I/O? >>>>>>>>> >>>>>>>> cpu_physical_memory_rw() can be called nested, and when called, it can >>>>>>>> be protected by no-lock/device lock/ big-lock. >>>>>>>> I think without big lock, io-dispatcher should face the same issue. >>>>>>>> As to main-loop, have not carefully consider, but at least, dma-helper >>>>>>>> will call cpu_physical_memory_rw() with big-lock. >>>>>>> >>>>>>> That is our core problem: inconsistent invocation of existing services >>>>>>> /wrt locking. For portio, I was lucky that there is no nesting and I was >>>>>>> able to drop the big lock around all (x86) call sites. But MMIO is way >>>>>>> more tricky due to DMA nesting. >>>>>> >>>>>> Maybe we need to switch to a continuation style. Instead of expecting >>>>>> cpu_physical_memory_rw() to complete synchronously, it becomes an >>>>>> asynchronous call and you provide it with a completion. That means >>>>>> devices which use it are forced to drop the lock in between. Block and >>>>>> network clients will be easy to convert since they already use APIs that >>>>>> drop the lock (except for accessing the descriptors). >>>>>> >>>>>>> We could try to introduce a different version of cpu_physical_memory_rw, >>>>>>> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO >>>>>>> request can trigger the very same access in a nested fashion, and we >>>>>>> will have to detect this to avoid locking up QEMU (locking up the guest >>>>>>> might be OK). >>>>>> >>>>>> An async version of c_p_m_rw() will just cause a completion to bounce >>>>>> around, consuming cpu but not deadlocking anything. If we can keep a >>>>>> count of the bounces, we might be able to stall it indefinitely or at >>>>>> least ratelimit it. >>>>>> >>>>> >>>>> Another option is to require all users of c_p_m_rw() and related to use >>>>> a coroutine or thread. That makes the programming easier (but still >>>>> required a revalidation after the dropped lock). >>>>> >>>> For the nested cpu_physical_memory_rw(), we change it internal but >>>> keep it sync API as it is. (Wrapper current cpu_physical_memory_rw() >>>> into cpu_physical_memory_rw_internal() ) >>>> >>>> >>>> LOCK() // can be device or big lock or both, depend on caller. >>>> .............. >>>> cpu_physical_memory_rw() >>>> { >>>> UNLOCK() //unlock all the locks >>>> queue_work_on_thread(cpu_physical_memory_rw_internal, completion); >>>> // cpu_physical_memory_rw_internal can take lock(device,biglock) again >>>> wait_for_completion(completion) >>>> LOCK() >>>> } >>>> .................. >>>> UNLOCK() >>>> >>> >>> This is dangerous. The caller expects to hold the lock across the call, >>> and will not re-validate its state. >>> >>>> Although cpu_physical_memory_rw_internal() can be freed to use lock, >>>> but with precondition. -- We still need to trace lock stack taken by >>>> cpu_physical_memory_rw(), so that it can return to caller correctly. >>>> Is that OK? >>> >>> I'm convinced that we need a recursive lock if we don't convert >>> everything at once. >>> >>> So how about: >>> >>> - make bql a recursive lock (use pthreads, don't invent our own) >>> - change c_p_m_rw() calling convention to "caller may hold the BQL, but >>> no device lock" >>> >>> this allows devices to DMA each other. Unconverted device models will >>> work as they used to. Converted device models will need to drop the >>> device lock, re-acquire it, and revalidate any state. That will cause >> >> I think we are cornered by devices to DMA each other, and it raises >> the unavoidable nested lock. Also to avoid deadlock caused by device's >> lock sequence, we should drop the current device lock to acquire >> another one. The only little diverge is about the "revalidate", do >> we need to rollback? > > It basically means you can't hold contents of device state in local > variables. You need to read everything again from the device. That > includes things like DMA enable bits. > I think that read everything again from the device can not work. Suppose the following scene: If the device's state contains the change of a series of internal registers (supposing partA+partB; partC+partD), after partA changed, the device's lock is broken. At this point, another access to this device, it will work on partA+partB to determine C+D, but since partB is not correctly updated yet. So C+D may be decided by broken context and be wrong.
> (btw: to we respect PCI_COMMAND_MASTER? it seems that we do not. This > is a trivial example of an iommu, we should get that going). > >> Or for converted device, we can just tag it a >> busy flag, that is check&set busy flag at the entry of device's >> mmio-dispatch. So when re-acquire device's lock, the device's state >> is intact. > > The state can be changed by a parallel access to another register, which > is valid. > Do you mean that the device can be accessed in parallel? But how? we use device's lock. What my suggestion is: lock(); set_and_test(dev->busy); if busy unlock and return; changing device registers; do other things including calling to c_p_m_rw() //here,lock broken, but set_and_test() works clear(dev->busy); unlock(); So changing device registers is protected, and unbreakable. Regards, pingfan >> >> Has anybody other suggestion? > > > -- > error compiling committee.c: too many arguments to function