Re: [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation

alvise rigo Thu, 09 Jun 2016 05:37:46 -0700

Hi Sergey,

Thank you for this precise summary.


On Thu, Jun 9, 2016 at 1:42 PM, Sergey Fedorov <serge.f...@gmail.com> wrote:
> Hi,
>
> On 19/04/16 16:39, Alvise Rigo wrote:
>> This patch series provides an infrastructure for atomic instruction
>> implementation in QEMU, thus offering a 'legacy' solution for
>> translating guest atomic instructions. Moreover, it can be considered as
>> a first step toward a multi-thread TCG.
>>
>> The underlying idea is to provide new TCG helpers (sort of softmmu
>> helpers) that guarantee atomicity to some memory accesses or in general
>> a way to define memory transactions.
>>
>> More specifically, the new softmmu helpers behave as LoadLink and
>> StoreConditional instructions, and are called from TCG code by means of
>> target specific helpers. This work includes the implementation for all
>> the ARM atomic instructions, see target-arm/op_helper.c.
>
> I think that is a generally good idea to provide LL/SC TCG operations
> for emulating guest atomic instruction behaviour as those operations
> allow to implement other atomic primitives such as copmare-and-swap and
> atomic arithmetic easily. Another advantage of these operations is that
> they are free from ABA problem.
>
>> The implementation heavily uses the software TLB together with a new
>> bitmap that has been added to the ram_list structure which flags, on a
>> per-CPU basis, all the memory pages that are in the middle of a LoadLink
>> (LL), StoreConditional (SC) operation.  Since all these pages can be
>> accessed directly through the fast-path and alter a vCPU's linked value,
>> the new bitmap has been coupled with a new TLB flag for the TLB virtual
>> address which forces the slow-path execution for all the accesses to a
>> page containing a linked address.
>
> But I'm afraid we've got a scalability problem using software TLB engine
> heavily. This approach relies on TLB flush of all CPUs which is not very
> cheap operation. That is going to be even more expansive in case of
> MTTCG as you need to exit the CPU execution loop in order to avoid
> deadlocks.
>
> I see you try mitigate this issue by introducing a history of N last
> pages touched by an exclusive access. That would work fine avoiding
> excessive TLB flushes as long as the current working set of exclusively
> accessed pages does not go beyond N. Once we exceed this limit we'll get
> a global TLB flush on most LL operations. I'm afraid we can get dramatic

Indeed, if the guest does a loop of N+1 atomic operations, at each
iteration we will have N flushes.

> performance decrease as guest code implements finer-grained locking
> scheme. I would like to emphasise that performance can degrade sharply
> and dramatically as soon as the limit gets exceeded. How could we tackle
> this problem?

In my opinion, the length of the history should not be fixed to avoid
the drawback of above. We can make the history's length dynamic (until
a given threshold is reached) according to the pressure of atomic
instructions. What should remain constant is the time elapsed to make
a cycle of the history's array. We can for instance store in the lower
bits of the addresses in the history a sort of timestamp used to
calculate the period and adjust accordingly the length of the history.
What do you think?

I will try to explore other ways to tackle the problem.

Best regards,
alvise

>
> Kind regards,
> Sergey

Re: [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation

Reply via email to