Hi Sergey, Thank you for this precise summary.
On Thu, Jun 9, 2016 at 1:42 PM, Sergey Fedorov <serge.f...@gmail.com> wrote: > Hi, > > On 19/04/16 16:39, Alvise Rigo wrote: >> This patch series provides an infrastructure for atomic instruction >> implementation in QEMU, thus offering a 'legacy' solution for >> translating guest atomic instructions. Moreover, it can be considered as >> a first step toward a multi-thread TCG. >> >> The underlying idea is to provide new TCG helpers (sort of softmmu >> helpers) that guarantee atomicity to some memory accesses or in general >> a way to define memory transactions. >> >> More specifically, the new softmmu helpers behave as LoadLink and >> StoreConditional instructions, and are called from TCG code by means of >> target specific helpers. This work includes the implementation for all >> the ARM atomic instructions, see target-arm/op_helper.c. > > I think that is a generally good idea to provide LL/SC TCG operations > for emulating guest atomic instruction behaviour as those operations > allow to implement other atomic primitives such as copmare-and-swap and > atomic arithmetic easily. Another advantage of these operations is that > they are free from ABA problem. > >> The implementation heavily uses the software TLB together with a new >> bitmap that has been added to the ram_list structure which flags, on a >> per-CPU basis, all the memory pages that are in the middle of a LoadLink >> (LL), StoreConditional (SC) operation. Since all these pages can be >> accessed directly through the fast-path and alter a vCPU's linked value, >> the new bitmap has been coupled with a new TLB flag for the TLB virtual >> address which forces the slow-path execution for all the accesses to a >> page containing a linked address. > > But I'm afraid we've got a scalability problem using software TLB engine > heavily. This approach relies on TLB flush of all CPUs which is not very > cheap operation. That is going to be even more expansive in case of > MTTCG as you need to exit the CPU execution loop in order to avoid > deadlocks. > > I see you try mitigate this issue by introducing a history of N last > pages touched by an exclusive access. That would work fine avoiding > excessive TLB flushes as long as the current working set of exclusively > accessed pages does not go beyond N. Once we exceed this limit we'll get > a global TLB flush on most LL operations. I'm afraid we can get dramatic Indeed, if the guest does a loop of N+1 atomic operations, at each iteration we will have N flushes. > performance decrease as guest code implements finer-grained locking > scheme. I would like to emphasise that performance can degrade sharply > and dramatically as soon as the limit gets exceeded. How could we tackle > this problem? In my opinion, the length of the history should not be fixed to avoid the drawback of above. We can make the history's length dynamic (until a given threshold is reached) according to the pressure of atomic instructions. What should remain constant is the time elapsed to make a cycle of the history's array. We can for instance store in the lower bits of the addresses in the history a sort of timestamp used to calculate the period and adjust accordingly the length of the history. What do you think? I will try to explore other ways to tackle the problem. Best regards, alvise > > Kind regards, > Sergey