Re: [gem5-users] O3 compare-and-swap appears not to be atomic

Gabe Black Sun, 18 Sep 2011 18:59:46 -0700

I think X86 is excessively red. Timing simple CPU should be yellow for
SE and FS uni and multiprocessor, except for the small exception that
atomic accesses aren't atomic. The implementation is always going to be
incomplete because X86 is so large and frequently ambiguous and because
only a fraction of it is actually useful. Marking it as "definitely does
not work" is a bit draconian. O3 support in SE should be the same, and
I'd say O3 in FS should be orange.


On the other hand, MIPS is overly not red. There is no MIPS_FS target
because one wouldn't compile as of today, so everything FS should be red.

I don't know the status of Ruby on anything so I can't comment on those.

Gabe

On 09/18/11 18:17, Steve Reinhardt wrote:
> Yea, whether you call it a bug or an unimplemented feature, it still
> doesn't work... it's definitely a bug that that's not documented
> though.  I updated the status matrix to reflect this problem:
> http://gem5.org/Status_Matrix
>
> (I also did a bunch of general editing on the status matrix too...
> Gabe, you may want to check it out and see what you think.)
>
> Note that this is a problem only in the "classic" m5 cache models;
> Ruby does support x86 locking.  However, Ruby doesn't support O3 LSQ
> probes to enforce stronger consistency models, so this gets you
> TimingSimple CPUs but not O3 CPUs.
>
> Adding locked RMW access to the classic caches is doable, but not
> completely trivial... basically if a snoop (probe) arrives that would
> downgrade access to a locked block, that snoop has to be deferred and
> processed after the lock is released.  There's already support in the
> protocol for deferring snoops that hit on an MSHR, but the details of
> how that's extended to handle locked blocks are TBD.  I expect the
> solution involves either (1) adding a bit to the tags to mark locked
> blocks or (2) allocating an MSHR or MSHR-like structure for each
> locked block.  There are pros and cons to each.  I don't have time to
> implement this myself, but if someone else wants to take a crack, I'd
> be glad to consult.
>
> Do we have an issue in O3 with speculatively issuing a the read part
> of a locked RMW (which locks the block) but then, due to a squash, not
> issuing the write that unlocks it?  That seems like a tricky bit... I
> don't know if Ruby handles this or not.
>
> Steve
>
> On Sat, Sep 17, 2011 at 4:39 PM, Gabriel Michael Black
> <[email protected] <mailto:[email protected]>> wrote:
>
>     Hi Meredydd. I'd say this isn't a bug, perse, but it is wrong.
>     Basically the support for locking memory operations is incomplete.
>
>     The way this is supposed to work is that a load with the LOCKED
>     flag set will lock a chunk of memory, and then a subsequent store
>     with the LOCKED flag set will unlock it. All stores with LOCKED
>     set must be preceded by a load with that set. You could think of
>     the load as acquiring a mutex and the store as releasing it.
>
>     In atomic mode, because gem5 is single threaded and because atomic
>     memory accesses complete immediately, the only thing you need to
>     do to make sure locked memory accesses aren't interrupted by
>     anything is to make sure the cpu keeps control until the locked
>     section is complete. To do that we just keep track of whether or
>     not we've executed a locked load and don't stop executing
>     instructions until we see a locked store. This is what you're
>     seeing in the atomic mode CPU.
>
>     In timing mode, which is what all other CPUs use including the
>     timing simple CPU, something more complex is needed because memory
>     accesses take "time" and other things can happen while the CPU
>     waits for a response. In that case, the locking would have to
>     actually happen in the memory system and the various components
>     (caches, memory, or something else) would have to keep track of
>     what areas of memory (if any) are currently locked. This is the
>     part that isn't yet implemented.
>
>     So in summary, yes it is known to not work properly, but I
>     wouldn't call it a bug, I'd say that it's just not finished yet.
>
>     Gabe
>
>
>     Quoting Meredydd Luff <[email protected]
>     <mailto:[email protected]>>:
>
>         It appears that the CAS (LOCK; CMPXCHGx) instruction doesn't
>         do what
>         it says on the tin, at least using the O3 model and X86_SE.
>         When I run
>         the following code (inside a container that runs this code once on
>         each of four processors):
>
>            volatile unsigned long x;
>            [...]
>            for(a=0; a<1000; a++) {
>                while(lastx = *x, oldx = cas(x, lastx, lastx+1), oldx
>         != lastx);
>
>
>         ...I get final x values of 1200 or so (rather than 4000, as would
>         happen if the compare-and-swap were atomic). This is using the
>         standard se.py, and a fresh checkout of the gem5 repository - my
>         command line is:
>         build/X86_SE/m5.opt configs/example/se.py -d --caches -n 4 -c
>         /path/to/my/binary
>
>
>         Is this a known bug? Looking at the x86 microcode, it appears
>         that the
>         relevant microops are ldstl and stul. Their only difference
>         from what
>         appears to be their unlocked equivalents (ldst and st) is the
>         addition
>         of the Request::LOCKED flag. A quick grep indicates that that
>         LOCKED
>         flag is only accessed by the Request::isLocked() accessor
>         function,
>         and that isLocked() is not referenced anywhere except twice in
>         cpu/simple/atomic.cc.
>
>         Unless I'm missing something, it appears that atomic memory
>         accesses
>         are simply not implemented. Is this true?
>
>         Meredydd
>
>
>         PS - This is the CAS I'm using:
>
>         static inline unsigned long cas(volatile unsigned long* ptr,
>         unsigned
>         long old, unsigned long _new)
>         {
>            unsigned long prev;
>            asm volatile("lock;"
>                         "cmpxchgq %1, %2;"
>                         : "=a"(prev)
>                         : "q"(_new), "m"(*ptr), "0"(old)
>                         : "memory");
>            return prev;
>         }
>
>
>
>         PPS - I searched around this issue, and the only relevant thing I
>         found was a mailing list post from last year, indicating that
>         ldstl
>         and stul were working for someone (no indication that was
>         using O3,
>         though):
>         http://www.mail-archive.com/[email protected]/msg07297.html
>         This would indicate that at least one CPU model does support
>         atomicity
>         - but even looking in atomic.cc, I can't immediately see why that
>         would work!
>
>         There is some code for handling a flag called
>         Request::MEM_SWAP_COND/isCondSwap(), but it appears to be
>         generated
>         only by the SPARC ISA, and examined only by the simple timing and
>         atomic models.
>         _______________________________________________
>         gem5-users mailing list
>         [email protected] <mailto:[email protected]>
>         http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>
>     _______________________________________________
>     gem5-users mailing list
>     [email protected] <mailto:[email protected]>
>     http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>
> _______________________________________________
> gem5-users mailing list
> [email protected]
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] O3 compare-and-swap appears not to be atomic

Reply via email to