Yeah, I think we've talked about this topic in the past, but it was a while ago and I don't remember exactly what all we talked about or the conclusion(s) we reached.
The problem at the ISA level is that there are lots of instructions in x86 which are pretty basic and used a lot (adds, subtracts, etc.) which compute condition codes every time in case you need them. That, combined with the fact that the instructions which update the condition codes update somewhat erratic combinations of bits, means that lots of instructions write the condition code bits, and those same common instructions read them too so they can do a partial update. This has happened to a lesser extent before where there are control like bits and condition code like bits in the same register. To my knowledge that's happened at least on SPARC, ARM, and x86. That's dealt with by splitting the condition code bits out into their own register, which is treated as a renamed integer register, and the control bits which are treated as a misc reg with all the overhead and special precautions. That doesn't entirely work on x86, though, because even among the condition code bits there are a lot of partial accesses as described above. The cc bits could be broken down into individual bits, but that's pretty cumbersome since there are, including the two artificial ones for microcode, 8 of them I believe? That would be a lot of registers to rename, would slow down the simulator, wouldn't be that realistic, etc. What real CPUs do, after talking to someone in the know at AMD, is that they gather up one group of flags, about 4 if I recall, and treat those as a unit. The others are handled individually. The group of 4 is still not 100% treated as a unit since some instructions modify just one of them, for instance, but it's pretty close, optimizes for the common case, and the odd cases can still work like they do today. The difficulty implementing this is that exactly which condition code bits to set and which to check for conditional microops are decided at the microcode level and are arbitrary combinations. They don't need to be completely arbitrary, but that means that microops really effectively know which, how many, etc., condition code registers they need at construction time as apposed to compile time. So what we'd need to do is to allow the constructor for a microop to look at the flags it was being given and to use that to more programatically figure out which registers it had as sources or destinations, and how many. The body of the instructions themselves would need to be sophisticated enough to pull together the different source registers, whatever they are, and to process them appropriately with a consistent bit of code (and not 18 different parameters to some function where 14 aren't used at any particular time). It would also have to know how to split things back up again when writing out the results. What I did to move us a little bit in this direction is to make the types of operands much more flexible so that we can have structures, typedefs, etc. What we'd still need is truely composite operand types where a single operand, for instance the condition code bits, is built from a set of registers (determined in some way appropriate to the operand) and/or written back to a set of registers, but which could be handled easily as a single value inside the code blob. Then we can avoid having 100(s) of versions of microops for all the different combinations of flag bits, which would be a terrible thing to have to live with. As far as easier ways to deal with this, there is only one which is what I was alluding to in what I think was my earliest email, and that's to just hack around it so the instructions you know you're using in the performance sensitive part behave incorrectly generally speaking, but do what you expect for the benchmark. Maybe they'd even have to know where they were running from, that they were in a range of ticks, etc. A gross and terrible hack unfit to check in, but something that would get the poster unstuck for now. Doing things the "right" way will take some infrastructure work, and that may not be very quick. I don't think there's any real shortcut around doing the infrastructure work that doesn't have a pretty heavy cost (like blowing up the number of microop classes 100 fold). Gabe On 04/07/12 11:32, Steve Reinhardt wrote: > Hi Gabe, > > Your earlier email said "I've made some changes over time which should make > it easier to do this like a real x86 CPU would". Could you expand on that? > It sounded like you had some sort of plan or direction at least. If we're > going to start working on this ourselves, it would be best if we can > benefit from whatever insights you've had or preliminary work you've done. > > I see your later email says "I don't have any ideas for how to make it much > simpler", but that seems to contradict what you said at first. In > particular, you also earlier said "If you have an idea of how to get it to > do what you want locally, feel free. That will get you going, and when I > get it fixed for real then you can start using that.". I'd like to > explicitly reject that idea... for one thing, I'm not sure what a "local" > solution would look like, and more importantly, this issue seems > complicated enough that us doing some sort of temporary or stopgap solution > like you're implying, only to throw it away once you've done it "for real", > seems like a huge waste of effort. So overall I'd like to be sure we're in > sync with whatever you're thinking to make sure that our efforts are > additive and complementary and not redundant. > > Thanks, > > Steve > > On Fri, Apr 6, 2012 at 3:43 PM, Watanabe, Yasuko > <[email protected]>wrote: > >> Hi Gabe, >> >> I also went through the code and got a sense of changes that need to be >> made. You are right. The current infrastructure makes it difficult to fix >> this issue. >> >> Yasuko >> >> -----Original Message----- >> From: [email protected] [mailto:[email protected]] On >> Behalf Of Gabe Black >> Sent: Friday, April 06, 2012 12:18 PM >> To: [email protected] >> Subject: Re: [gem5-dev] Data dependency caused by flags >> >> It's complicated. Looking at it again I reminded myself of all the ways it >> doesn't fit into the way the ISA parser does things, so it's going to quite >> a bit of work to fix properly. I don't have any ideas for how to make it >> much simpler that would be at all practical. >> >> Gabe >> >> On 04/05/12 21:10, Watanabe, Yasuko wrote: >>> Hi Gabe, >>> >>> Do you already have an idea of how to fix this? If so, can you give me >> some pointers? >>> Yasuko >>> >>> -----Original Message----- >>> From: [email protected] [mailto:[email protected]] On >>> Behalf Of Gabe Black >>> Sent: Thursday, April 05, 2012 6:12 PM >>> To: [email protected] >>> Subject: Re: [gem5-dev] Data dependency caused by flags >>> >>> Yes, you guys are right. This is a recognized problem, and I've made >> some changes over time which should make it easier to do this like a real >> x86 CPU would. I haven't yet, but it's on the horizon. I tend to be very >> busy, although circumstances may mean I have a little more or less time >> than normal for a little while so I don't know for sure when I'll get it >> fixed. If you have an idea of how to get it to do what you want locally, >> feel free. That will get you going, and when I get it fixed for real then >> you can start using that. >>> Gabe >>> >>> On 04/05/12 17:18, Watanabe, Yasuko wrote: >>>> Nilay, >>>> >>>> I agree with you. I think the dependencies of those flag bits should be >> evaluated at bit level. >>>> Gabe and others, >>>> >>>> This change seems invasive. Do you know the best way to handle this? >>>> >>>> Yasuko >>>> >>>> -----Original Message----- >>>> From: [email protected] [mailto:[email protected]] On >>>> Behalf Of Nilay Vaish >>>> Sent: Thursday, April 05, 2012 3:35 AM >>>> To: gem5 Developer List >>>> Subject: Re: [gem5-dev] Data dependency caused by flags >>>> >>>> The code for the function genFlags() in >> src/arch/x86/insts/microregop.cc suggests that the values of flag bits not >> updated by the ADD instruction need to be retained. This means that the >> previous values need to be read and written again, which means the second >> ADD can be dependent on a value written by the first ADD. If the >> dependencies were evaulated at bit level, then these instructions would not >> be dependent. >>>> -- >>>> Nilay >>>> >>>> On Thu, 5 Apr 2012, Watanabe, Yasuko wrote: >>>> >>>>> I ran O3 CPU in FS mode in x86 with a simple microbenchmark and got >>>>> a much lower IPC than the theoretical IPC. The issue seems to be >>>>> data dependencies caused by (control) flags, not registers, and I am >>>>> wondering if anyone has come across the same issue. >>>>> >>>>> The microbenchmark has many data independent ADD instructions >>>>> (http://repo.gem5.org/gem5/file/570b44fe6e04/src/arch/x86/isa/insts/ >>>>> g >>>>> e >>>>> neral_purpose/arithmetic/add_and_subtract.py#l41) >>>>> in a loop. On a 2-wide out-of-order machine with enough resources, >>>>> the IPC should be two at a steady stated. However, the IPC only goes >>>>> up to one. What is happening is that even though the ADDs have two >>>>> source and one destination registers and a flag to set in x86, gem5 >>>>> adds one extra flag source register to the ADDs. As a result, each >>>>> ADD becomes dependent on the earlier ADD's destination flag, >>>>> constraining the achievable IPC to one. >>>>> >>>>> Here is an example sequence with physical register mappings: >>>>> ADD: S1=98, S2=9, S3=2, D1=82, D2=105 (flag) >>>>> ADD: S1=92, S2=9, S3=105 (flag), D1=79, D2=90 ... >>>>> >>>>> Physical registers 98, 9, and 92 are ready when those two ADDs are >>>>> renamed; however, as you can see, the second ADD has to wait for the >>>>> first ADD because of the extra flag source register S3. When I >>>>> removed those flags in the macroop definition, the IPC jumped up from >> 1 to 1.7. >>>>> Does anyone know why the ADD has to read the flags, even though the >>>>> x86 manual does not say that? Those flags should just cause >>>>> write-after-write dependency, not read-after-write. >>>>> >>>>> Yasuko >>>>> >>>>> _______________________________________________ >>>>> gem5-dev mailing list >>>>> [email protected] >>>>> http://m5sim.org/mailman/listinfo/gem5-dev >>>>> >>>> _______________________________________________ >>>> gem5-dev mailing list >>>> [email protected] >>>> http://m5sim.org/mailman/listinfo/gem5-dev >>>> >>>> >>>> _______________________________________________ >>>> gem5-dev mailing list >>>> [email protected] >>>> http://m5sim.org/mailman/listinfo/gem5-dev >>> _______________________________________________ >>> gem5-dev mailing list >>> [email protected] >>> http://m5sim.org/mailman/listinfo/gem5-dev >>> >>> >>> _______________________________________________ >>> gem5-dev mailing list >>> [email protected] >>> http://m5sim.org/mailman/listinfo/gem5-dev >> _______________________________________________ >> gem5-dev mailing list >> [email protected] >> http://m5sim.org/mailman/listinfo/gem5-dev >> >> >> _______________________________________________ >> gem5-dev mailing list >> [email protected] >> http://m5sim.org/mailman/listinfo/gem5-dev >> > _______________________________________________ > gem5-dev mailing list > [email protected] > http://m5sim.org/mailman/listinfo/gem5-dev _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
