Re: [Qt4-preview-feedback] QAtomicInt documentation need some work

Thiago Macieira Wed, 07 Oct 2009 04:44:12 -0700

Em Quarta-feira 07 Outubro 2009, às 11:51:37, você escreveu:
[snip]
> So, this produce "normal" mov instructions, that fetches data from the
> cache if available. But on a modern computer, with several cores (or
> several processors), you have a cache per processor or at least per
> group of cores (on a core2quad, you have 2 caches, one per pair of
> core).
> 
> On the contrary, fetch-and-set and friends issue a LOCK# signal that
> ensure cache consistency between your processors (see Intel
> Architectures Software Developer’s Manual, Volume 3A, Section 8.1.4
> for details). So even if get/set are atomic operation, they do not
> provide the same thread-safety as provided by lock+xadd, lock+cmpxchg
> or xchg since they don't ensure we're not fetching the data from an
> outdated cache line.


I don't think the LOCK# is necessary. Or that it's even possible.

The LOCK# signal is necessary to guarantee exclusive access to the memory 
shared between processors. But we don't need exclusive access to the memory if 
we're reading or writing to it.

According to the manual Volume 2A, the LOCK instruction says:

> The LOCK prefix can be prepended only to the following instructions and
> only to those forms of the instructions where the destination operand is a
> memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC,
> NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. If the LOCK prefix is used
> with one of these instructions and the source operand is a memory operand,
> an undefined opcode exception (#UD) may be generated. An undefined opcode
> exception will also be generated if the LOCK prefix is used with any
> instruction not in the above list.

Note that MOV isn't on that list. So not only would it have no effect, it would 
also generate #UD (SIGILL).

Anyway, what you're asking for isn't necessary. If we forget for a moment all 
the other variables in memory and concentrate only on the atomic variable 
itself, what you're asking for is that:

volatile int i;

thread1:
        i = 1;

thread2:
        if (i == 1)
                do something
        else
                do something else

Note that there's no actual synchronisation between the two threads. You could 
say you want to know if thread1 has finished doing its work (i.e., imagine that 
it's actually "volatile bool finished") but it's not really a synchronisation.

It's not a synchronisation because thread2 could miss the store to i by one 
instruction and that would be enough. Any code working like this needs to go 
back and test again. It's the principle of the spinlock.

This is the Read-After-Write case. The same also applies to the WAW case:

thread1:
        i = 1;

thread2:
        i = 2;

There's no synchronisation. Either result is possible.

Therefore, we don't need anything special to load or store to memory, not on 
x86 or x86-64 at least.

It only gets interesting when we deal with more than one variable, like the 
case in the blog:

volatile int x, y, z;

thread1:
        x = 1;
        y = 2;
        z = 3;

thread2:
        if (z == 3)
                function(x, y)
        else
                function(-1, -1)

The x86 and x86-64 processors don't have explicit memory ordering. All 
instructions have full memory barrier semantics, meaning that all memory 
operations that have externally-visible effects must be executed in order. The 
compiler must generate the necessary Store instructions in the right order and 
the processor must either execute them in the right order, or make it so that 
the externally-visible effects happen in the right order (this is the Release 
memory barrier). 

In the other thread, the reverse also applies: the compiler generates the 
Loads in the proper order and the processor must guarantee that loads happen 
in the right order, or that they behave as if they did. That is, since z is 
loaded first, the processor must ensure that any writes that happened to x and 
y before z was written are also observed when x and y are read (that's the 
Acquire memory barrier).

The example I gave in the blog was if y were in a separate cacheline that had 
recently been touched. So if y is cached, but the read to z causes a cache 
miss and read from memory, then y needs to be flushed too.

How? I don't care. The processor has to ensure that because x86 has full 
memory barrier for all instructions.

On the IA-64 it gets interesting because the instructions don't have memory 
barrier semantics unless you tell them to do it. So, if you translated that to 
IA-64 assembly (very dumb, no optimisation or reordering, which a compiler 
would definitely do):

common:
        mov             loc0 = x
        mov             loc1 = y
        mov             loc2 = z
        ;;

thread1:
        mov             r8 = 1
        mov             r9 = 2
        mov             r10 = 3
        ;;
        st4             [loc0] = r8
        st4             [loc1] = r9
        st4             [loc2] = r10

thread2:
        ld4             r10 = [loc2]
        ;;
        cmp.eq  p6, p7 = r10, 3
        ;;
(p6) ld4                out0 = [loc0]
(p6) ld4                out1 = [loc1]
(p7) mov                out0 = -1
(p7) mov                out1 = -1
        ;;
        br.call rp = function

If every instruction had full memory barrier, then the only possible values 
for out0 and out1 are 1 and 2, respectively.

But ld4 has no memory barrier, like I said in the blog, the possible values 
are (assuming they were initialised to 0 before):
        x       y
        0       0
        1       0
        0       2
        1       2

To fix this, we need to do proper memory barriers. We do that by making the 
final store a store-release and the initial load a load-acquire:

thread1:
        mov             r8 = 1
        mov             r9 = 2
        mov             r10 = 3
        ;;
        st4             [loc0] = r8
        st4             [loc1] = r9
        st4.rel [loc2] = r10

thread2:
        ld4.acq r10 = [loc2]
        ;;
        cmp.eq  p6, p7 = r10, 3
        ;;
(p6) ld4                out0 = [loc0]
(p6) ld4                out1 = [loc1]
(p7) mov                out0 = -1
(p7) mov                out1 = -1
        ;;
        br.call rp = function


In reality, since we declared x and y to be volatile too, the compiler would 
generate st4.rel and ld4.acq for all of them, though only the last/first one is 
strictly necessary. (The compiler doesn't know)

If we don't declare them volatile, the compiler could reorder the loads or 
reuse existing values -- i.e., it could move them across the memory barriers, 
thus defeating their use.

The inline assembly functions have a special marker to the compiler telling it 
not to do any reordering across the function. There's no such marker in the 
volatiles, so this is the only point where I'm not sure if we're doing the 
right thing...

-- 
Thiago Macieira - thiago.macieira (AT) nokia.com
  Senior Product Manager - Nokia, Qt Development Frameworks
     Sandakerveien 116, NO-0402 Oslo, Norway

Qt Developer Days 2009 | Registration Now Open!
Munich, Germany: Oct 12 - 14     San Francisco, California: Nov 2 - 4
      http://qt.nokia.com/qtdevdays2009

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Qt4-preview-feedback mailing list
[email protected]
http://lists.trolltech.com/mailman/listinfo/qt4-preview-feedback

Re: [Qt4-preview-feedback] QAtomicInt documentation need some work

Reply via email to