I'd say that if atomic accesses aren't atomic, then multiprocessor systems do not work, and that's not a particularly small exception. I suppose you could still run multiprogrammed single-thread workloads in SE mode, but that's a small exception to it not working, not the other way around. Uniprocessor FS is also very suspect since you can still have atomicity violations with device accesses. I think only uniprocessor SE mode can be considered working if atomic accesses aren't atomic.
This is not really related to completeness; it's a fundamental operation that is implemented but is not guaranteed to give the correct answer. The same thing goes for not getting the consistency model right. I didn't touch the MIPS table at all, so if it's out of date, go ahead and update it. Steve On Sun, Sep 18, 2011 at 6:59 PM, Gabe Black <[email protected]> wrote: > ** > I think X86 is excessively red. Timing simple CPU should be yellow for SE > and FS uni and multiprocessor, except for the small exception that atomic > accesses aren't atomic. The implementation is always going to be incomplete > because X86 is so large and frequently ambiguous and because only a fraction > of it is actually useful. Marking it as "definitely does not work" is a bit > draconian. O3 support in SE should be the same, and I'd say O3 in FS should > be orange. > > On the other hand, MIPS is overly not red. There is no MIPS_FS target > because one wouldn't compile as of today, so everything FS should be red. > > I don't know the status of Ruby on anything so I can't comment on those. > > Gabe > > > On 09/18/11 18:17, Steve Reinhardt wrote: > > Yea, whether you call it a bug or an unimplemented feature, it still > doesn't work... it's definitely a bug that that's not documented though. I > updated the status matrix to reflect this problem: > http://gem5.org/Status_Matrix > > (I also did a bunch of general editing on the status matrix too... Gabe, > you may want to check it out and see what you think.) > > Note that this is a problem only in the "classic" m5 cache models; Ruby > does support x86 locking. However, Ruby doesn't support O3 LSQ probes to > enforce stronger consistency models, so this gets you TimingSimple CPUs but > not O3 CPUs. > > Adding locked RMW access to the classic caches is doable, but not > completely trivial... basically if a snoop (probe) arrives that would > downgrade access to a locked block, that snoop has to be deferred and > processed after the lock is released. There's already support in the > protocol for deferring snoops that hit on an MSHR, but the details of how > that's extended to handle locked blocks are TBD. I expect the solution > involves either (1) adding a bit to the tags to mark locked blocks or (2) > allocating an MSHR or MSHR-like structure for each locked block. There are > pros and cons to each. I don't have time to implement this myself, but if > someone else wants to take a crack, I'd be glad to consult. > > Do we have an issue in O3 with speculatively issuing a the read part of a > locked RMW (which locks the block) but then, due to a squash, not issuing > the write that unlocks it? That seems like a tricky bit... I don't know if > Ruby handles this or not. > > Steve > > On Sat, Sep 17, 2011 at 4:39 PM, Gabriel Michael Black < > [email protected]> wrote: > >> Hi Meredydd. I'd say this isn't a bug, perse, but it is wrong. Basically >> the support for locking memory operations is incomplete. >> >> The way this is supposed to work is that a load with the LOCKED flag set >> will lock a chunk of memory, and then a subsequent store with the LOCKED >> flag set will unlock it. All stores with LOCKED set must be preceded by a >> load with that set. You could think of the load as acquiring a mutex and the >> store as releasing it. >> >> In atomic mode, because gem5 is single threaded and because atomic memory >> accesses complete immediately, the only thing you need to do to make sure >> locked memory accesses aren't interrupted by anything is to make sure the >> cpu keeps control until the locked section is complete. To do that we just >> keep track of whether or not we've executed a locked load and don't stop >> executing instructions until we see a locked store. This is what you're >> seeing in the atomic mode CPU. >> >> In timing mode, which is what all other CPUs use including the timing >> simple CPU, something more complex is needed because memory accesses take >> "time" and other things can happen while the CPU waits for a response. In >> that case, the locking would have to actually happen in the memory system >> and the various components (caches, memory, or something else) would have to >> keep track of what areas of memory (if any) are currently locked. This is >> the part that isn't yet implemented. >> >> So in summary, yes it is known to not work properly, but I wouldn't call >> it a bug, I'd say that it's just not finished yet. >> >> Gabe >> >> >> Quoting Meredydd Luff <[email protected]>: >> >> It appears that the CAS (LOCK; CMPXCHGx) instruction doesn't do what >>> it says on the tin, at least using the O3 model and X86_SE. When I run >>> the following code (inside a container that runs this code once on >>> each of four processors): >>> >>> volatile unsigned long x; >>> [...] >>> for(a=0; a<1000; a++) { >>> while(lastx = *x, oldx = cas(x, lastx, lastx+1), oldx != lastx); >>> >>> >>> ...I get final x values of 1200 or so (rather than 4000, as would >>> happen if the compare-and-swap were atomic). This is using the >>> standard se.py, and a fresh checkout of the gem5 repository - my >>> command line is: >>> build/X86_SE/m5.opt configs/example/se.py -d --caches -n 4 -c >>> /path/to/my/binary >>> >>> >>> Is this a known bug? Looking at the x86 microcode, it appears that the >>> relevant microops are ldstl and stul. Their only difference from what >>> appears to be their unlocked equivalents (ldst and st) is the addition >>> of the Request::LOCKED flag. A quick grep indicates that that LOCKED >>> flag is only accessed by the Request::isLocked() accessor function, >>> and that isLocked() is not referenced anywhere except twice in >>> cpu/simple/atomic.cc. >>> >>> Unless I'm missing something, it appears that atomic memory accesses >>> are simply not implemented. Is this true? >>> >>> Meredydd >>> >>> >>> PS - This is the CAS I'm using: >>> >>> static inline unsigned long cas(volatile unsigned long* ptr, unsigned >>> long old, unsigned long _new) >>> { >>> unsigned long prev; >>> asm volatile("lock;" >>> "cmpxchgq %1, %2;" >>> : "=a"(prev) >>> : "q"(_new), "m"(*ptr), "0"(old) >>> : "memory"); >>> return prev; >>> } >>> >>> >>> >>> PPS - I searched around this issue, and the only relevant thing I >>> found was a mailing list post from last year, indicating that ldstl >>> and stul were working for someone (no indication that was using O3, >>> though): http://www.mail-archive.com/[email protected]/msg07297.html >>> This would indicate that at least one CPU model does support atomicity >>> - but even looking in atomic.cc, I can't immediately see why that >>> would work! >>> >>> There is some code for handling a flag called >>> Request::MEM_SWAP_COND/isCondSwap(), but it appears to be generated >>> only by the SPARC ISA, and examined only by the simple timing and >>> atomic models. >>> _______________________________________________ >>> gem5-users mailing list >>> [email protected] >>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >>> >>> >> >> _______________________________________________ >> gem5-users mailing list >> [email protected] >> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >> > > > _______________________________________________ > gem5-users mailing > [email protected]http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users > > > > _______________________________________________ > gem5-users mailing list > [email protected] > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >
_______________________________________________ gem5-users mailing list [email protected] http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
