I'd like to second Gabe here - if I saw "known not to work", I'd have gone elsewhere, whereas in fact I am getting useful results out of O3 x86 SE. I like how the x86 status matrix is at the time of writing.
As for my applications, I'm considering a grotty little hack along the lines of marking the Stul microop as SerializeBefore, then modifying the caches to pin locked lines in place, buffering snoop requests until the Stul happens. Does that sound reasonable (if grubby) to you, or would you suggest alternative approaches? Clearing the ROB is probably an unpleasant performance penalty, but at least my benchmarks would run... Meredydd On Mon, Sep 19, 2011 at 4:07 AM, Gabe Black <[email protected]> wrote: > There's a heck of a lot more to an ISA implementation than one primitive. By > making a blanket statement that it doesn't work at all, you negate all the > things it does do perfectly correctly and has done correctly for years. The > fact that no one has bothered to fix the memory system is not a defect of > x86, and the fact that some instructions may not work in all situations > because of it does not mean that x86 is in the same category as > functionality that doesn't even compile. > > Gabe > > On 09/18/11 19:53, Steve Reinhardt wrote: > > I'd say that if atomic accesses aren't atomic, then multiprocessor systems > do not work, and that's not a particularly small exception. I suppose you > could still run multiprogrammed single-thread workloads in SE mode, but > that's a small exception to it not working, not the other way around. > Uniprocessor FS is also very suspect since you can still have atomicity > violations with device accesses. I think only uniprocessor SE mode can be > considered working if atomic accesses aren't atomic. > This is not really related to completeness; it's a fundamental operation > that is implemented but is not guaranteed to give the correct answer. The > same thing goes for not getting the consistency model right. > > I didn't touch the MIPS table at all, so if it's out of date, go ahead and > update it. > > Steve > > On Sun, Sep 18, 2011 at 6:59 PM, Gabe Black <[email protected]> wrote: >> >> I think X86 is excessively red. Timing simple CPU should be yellow for SE >> and FS uni and multiprocessor, except for the small exception that atomic >> accesses aren't atomic. The implementation is always going to be incomplete >> because X86 is so large and frequently ambiguous and because only a fraction >> of it is actually useful. Marking it as "definitely does not work" is a bit >> draconian. O3 support in SE should be the same, and I'd say O3 in FS should >> be orange. >> >> On the other hand, MIPS is overly not red. There is no MIPS_FS target >> because one wouldn't compile as of today, so everything FS should be red. >> >> I don't know the status of Ruby on anything so I can't comment on those. >> >> Gabe >> >> On 09/18/11 18:17, Steve Reinhardt wrote: >> >> Yea, whether you call it a bug or an unimplemented feature, it still >> doesn't work... it's definitely a bug that that's not documented though. I >> updated the status matrix to reflect this problem: >> http://gem5.org/Status_Matrix >> (I also did a bunch of general editing on the status matrix too... Gabe, >> you may want to check it out and see what you think.) >> Note that this is a problem only in the "classic" m5 cache models; Ruby >> does support x86 locking. However, Ruby doesn't support O3 LSQ probes to >> enforce stronger consistency models, so this gets you TimingSimple CPUs but >> not O3 CPUs. >> Adding locked RMW access to the classic caches is doable, but not >> completely trivial... basically if a snoop (probe) arrives that would >> downgrade access to a locked block, that snoop has to be deferred and >> processed after the lock is released. There's already support in the >> protocol for deferring snoops that hit on an MSHR, but the details of how >> that's extended to handle locked blocks are TBD. I expect the solution >> involves either (1) adding a bit to the tags to mark locked blocks or (2) >> allocating an MSHR or MSHR-like structure for each locked block. There are >> pros and cons to each. I don't have time to implement this myself, but if >> someone else wants to take a crack, I'd be glad to consult. >> Do we have an issue in O3 with speculatively issuing a the read part of a >> locked RMW (which locks the block) but then, due to a squash, not issuing >> the write that unlocks it? That seems like a tricky bit... I don't know if >> Ruby handles this or not. >> Steve >> On Sat, Sep 17, 2011 at 4:39 PM, Gabriel Michael Black >> <[email protected]> wrote: >>> >>> Hi Meredydd. I'd say this isn't a bug, perse, but it is wrong. Basically >>> the support for locking memory operations is incomplete. >>> >>> The way this is supposed to work is that a load with the LOCKED flag set >>> will lock a chunk of memory, and then a subsequent store with the LOCKED >>> flag set will unlock it. All stores with LOCKED set must be preceded by a >>> load with that set. You could think of the load as acquiring a mutex and the >>> store as releasing it. >>> >>> In atomic mode, because gem5 is single threaded and because atomic memory >>> accesses complete immediately, the only thing you need to do to make sure >>> locked memory accesses aren't interrupted by anything is to make sure the >>> cpu keeps control until the locked section is complete. To do that we just >>> keep track of whether or not we've executed a locked load and don't stop >>> executing instructions until we see a locked store. This is what you're >>> seeing in the atomic mode CPU. >>> >>> In timing mode, which is what all other CPUs use including the timing >>> simple CPU, something more complex is needed because memory accesses take >>> "time" and other things can happen while the CPU waits for a response. In >>> that case, the locking would have to actually happen in the memory system >>> and the various components (caches, memory, or something else) would have to >>> keep track of what areas of memory (if any) are currently locked. This is >>> the part that isn't yet implemented. >>> >>> So in summary, yes it is known to not work properly, but I wouldn't call >>> it a bug, I'd say that it's just not finished yet. >>> >>> Gabe >>> >>> Quoting Meredydd Luff <[email protected]>: >>> >>>> It appears that the CAS (LOCK; CMPXCHGx) instruction doesn't do what >>>> it says on the tin, at least using the O3 model and X86_SE. When I run >>>> the following code (inside a container that runs this code once on >>>> each of four processors): >>>> >>>> volatile unsigned long x; >>>> [...] >>>> for(a=0; a<1000; a++) { >>>> while(lastx = *x, oldx = cas(x, lastx, lastx+1), oldx != lastx); >>>> >>>> >>>> ...I get final x values of 1200 or so (rather than 4000, as would >>>> happen if the compare-and-swap were atomic). This is using the >>>> standard se.py, and a fresh checkout of the gem5 repository - my >>>> command line is: >>>> build/X86_SE/m5.opt configs/example/se.py -d --caches -n 4 -c >>>> /path/to/my/binary >>>> >>>> >>>> Is this a known bug? Looking at the x86 microcode, it appears that the >>>> relevant microops are ldstl and stul. Their only difference from what >>>> appears to be their unlocked equivalents (ldst and st) is the addition >>>> of the Request::LOCKED flag. A quick grep indicates that that LOCKED >>>> flag is only accessed by the Request::isLocked() accessor function, >>>> and that isLocked() is not referenced anywhere except twice in >>>> cpu/simple/atomic.cc. >>>> >>>> Unless I'm missing something, it appears that atomic memory accesses >>>> are simply not implemented. Is this true? >>>> >>>> Meredydd >>>> >>>> >>>> PS - This is the CAS I'm using: >>>> >>>> static inline unsigned long cas(volatile unsigned long* ptr, unsigned >>>> long old, unsigned long _new) >>>> { >>>> unsigned long prev; >>>> asm volatile("lock;" >>>> "cmpxchgq %1, %2;" >>>> : "=a"(prev) >>>> : "q"(_new), "m"(*ptr), "0"(old) >>>> : "memory"); >>>> return prev; >>>> } >>>> >>>> >>>> >>>> PPS - I searched around this issue, and the only relevant thing I >>>> found was a mailing list post from last year, indicating that ldstl >>>> and stul were working for someone (no indication that was using O3, >>>> though): http://www.mail-archive.com/[email protected]/msg07297.html >>>> This would indicate that at least one CPU model does support atomicity >>>> - but even looking in atomic.cc, I can't immediately see why that >>>> would work! >>>> >>>> There is some code for handling a flag called >>>> Request::MEM_SWAP_COND/isCondSwap(), but it appears to be generated >>>> only by the SPARC ISA, and examined only by the simple timing and >>>> atomic models. >>>> _______________________________________________ >>>> gem5-users mailing list >>>> [email protected] >>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >>>> >>> >>> >>> _______________________________________________ >>> gem5-users mailing list >>> [email protected] >>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >> >> >> _______________________________________________ >> gem5-users mailing list >> [email protected] >> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >> >> _______________________________________________ >> gem5-users mailing list >> [email protected] >> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users > > > _______________________________________________ > gem5-users mailing list > [email protected] > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users > > _______________________________________________ > gem5-users mailing list > [email protected] > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users > _______________________________________________ gem5-users mailing list [email protected] http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
