Re: [gem5-dev] Data dependency caused by flags

Gabe Black Sun, 08 Apr 2012 01:11:28 -0700

I don't think that's true, although I'm not willing to trawl though the
ISA manual to determine one way or the other. You'll need to just trust
me, since if AMD once implemented chips that way they intended to sell,
I doubt they just picked something from a hat. The group of four which
are considered together are I think ZAPS, zero, auxiliary carry (? it's
been a while), parity and sign, leaving overflow, carry, and the
artificial emulation zero and emulation carry flags. That's five groups
which is reasonable as far as having registers to rename, etc., but
still having a full cross product of combinations would be 64 variants.
I did look through the entire ISA manual's instruction listing once a
couple of years ago when I was researching this, and I think there are
only a small handful of instructions which behave badly with these
groups and, say, write to only the zero flag.


Gabe

On 04/07/12 22:38, Nilay Vaish wrote:
> It seems to me if we can cover the following cases most of the
> instructions would get covered --
> 1. All six condition codes -- (OF, SF, ZF, AF, PF, CF)
> 2. Two classes of five condition codes -- (OF,SF,ZF,PF,CF), (OF, SF,
> ZF, AF, PF)
> 3. One class of two condition codes -- (OF,CF)
>
> Yasuko's current problem about ADD instructions will get solved if we
> just handle the first case i.e. specify that if an instruction is
> writing all the condition codes, then do not assume condition code
> register to be a source register.
>
> -- 
> Nilay
>
> On Sat, 7 Apr 2012, Gabe Black wrote:
>
>> Yeah, I think we've talked about this topic in the past, but it was a
>> while ago and I don't remember exactly what all we talked about or the
>> conclusion(s) we reached.
>>
>> The problem at the ISA level is that there are lots of instructions in
>> x86 which are pretty basic and used a lot (adds, subtracts, etc.) which
>> compute condition codes every time in case you need them. That, combined
>> with the fact that the instructions which update the condition codes
>> update somewhat erratic combinations of bits, means that lots of
>> instructions write the condition code bits, and those same common
>> instructions read them too so they can do a partial update.
>>
>> This has happened to a lesser extent before where there are control like
>> bits and condition code like bits in the same register. To my knowledge
>> that's happened at least on SPARC, ARM, and x86. That's dealt with by
>> splitting the condition code bits out into their own register, which is
>> treated as a renamed integer register, and the control bits which are
>> treated as a misc reg with all the overhead and special precautions.
>> That doesn't entirely work on x86, though, because even among the
>> condition code bits there are a lot of partial accesses as described
>> above. The cc bits could be broken down into individual bits, but that's
>> pretty cumbersome since there are, including the two artificial ones for
>> microcode, 8 of them I believe? That would be a lot of registers to
>> rename, would slow down the simulator, wouldn't be that realistic, etc.
>> What real CPUs do, after talking to someone in the know at AMD, is that
>> they gather up one group of flags, about 4 if I recall, and treat those
>> as a unit. The others are handled individually. The group of 4 is still
>> not 100% treated as a unit since some instructions modify just one of
>> them, for instance, but it's pretty close, optimizes for the common
>> case, and the odd cases can still work like they do today.
>>
>> The difficulty implementing this is that exactly which condition code
>> bits to set and which to check for conditional microops are decided at
>> the microcode level and are arbitrary combinations. They don't need to
>> be completely arbitrary, but that means that microops really effectively
>> know which, how many, etc., condition code registers they need at
>> construction time as apposed to compile time. So what we'd need to do is
>> to allow the constructor for a microop to look at the flags it was being
>> given and to use that to more programatically figure out which registers
>> it had as sources or destinations, and how many. The body of the
>> instructions themselves would need to be sophisticated enough to pull
>> together the different source registers, whatever they are, and to
>> process them appropriately with a consistent bit of code (and not 18
>> different parameters to some function where 14 aren't used at any
>> particular time). It would also have to know how to split things back up
>> again when writing out the results.
>>
>> What I did to move us a little bit in this direction is to make the
>> types of operands much more flexible so that we can have structures,
>> typedefs, etc. What we'd still need is truely composite operand types
>> where a single operand, for instance the condition code bits, is built
>> from a set of registers (determined in some way appropriate to the
>> operand) and/or written back to a set of registers, but which could be
>> handled easily as a single value inside the code blob. Then we can avoid
>> having 100(s) of versions of microops for all the different combinations
>> of flag bits, which would be a terrible thing to have to live with.
>>
>> As far as easier ways to deal with this, there is only one which is what
>> I was alluding to in what I think was my earliest email, and that's to
>> just hack around it so the instructions you know you're using in the
>> performance sensitive part behave incorrectly generally speaking, but do
>> what you expect for the benchmark. Maybe they'd even have to know where
>> they were running from, that they were in a range of ticks, etc. A gross
>> and terrible hack unfit to check in, but something that would get the
>> poster unstuck for now. Doing things the "right" way will take some
>> infrastructure work, and that may not be very quick. I don't think
>> there's any real shortcut around doing the infrastructure work that
>> doesn't have a pretty heavy cost (like blowing up the number of microop
>> classes 100 fold).
>>
>> Gabe
>>
>> On 04/07/12 11:32, Steve Reinhardt wrote:
>>> Hi Gabe,
>>>
>>> Your earlier email said "I've made some changes over time which
>>> should make
>>> it easier to do this like a real x86 CPU would".  Could you expand
>>> on that?
>>>  It sounded like you had some sort of plan or direction at least. 
>>> If we're
>>> going to start working on this ourselves, it would be best if we can
>>> benefit from whatever insights you've had or preliminary work you've
>>> done.
>>>
>>> I see your later email says "I don't have any ideas for how to make
>>> it much
>>> simpler", but that seems to contradict what you said at first.  In
>>> particular, you also earlier said "If you have an idea of how to get
>>> it to
>>> do what you want locally, feel free. That will get you going, and
>>> when I
>>> get it fixed for real then you can start using that.".  I'd like to
>>> explicitly reject that idea... for one thing, I'm not sure what a
>>> "local"
>>> solution would look like, and more importantly, this issue seems
>>> complicated enough that us doing some sort of temporary or stopgap
>>> solution
>>> like you're implying, only to throw it away once you've done it "for
>>> real",
>>> seems like a huge waste of effort.  So overall I'd like to be sure
>>> we're in
>>> sync with whatever you're thinking to make sure that our efforts are
>>> additive and complementary and not redundant.
>>>
>>> Thanks,
>>>
>>> Steve
>>>
>>> On Fri, Apr 6, 2012 at 3:43 PM, Watanabe, Yasuko
>>> <[email protected]>wrote:
>>>
>>>> Hi Gabe,
>>>>
>>>> I also went through the code and got a sense of changes that need
>>>> to be
>>>> made. You are right. The current infrastructure makes it difficult
>>>> to fix
>>>> this issue.
>>>>
>>>> Yasuko
>>>>
>>>> -----Original Message-----
>>>> From: [email protected] [mailto:[email protected]] On
>>>> Behalf Of Gabe Black
>>>> Sent: Friday, April 06, 2012 12:18 PM
>>>> To: [email protected]
>>>> Subject: Re: [gem5-dev] Data dependency caused by flags
>>>>
>>>> It's complicated. Looking at it again I reminded myself of all the
>>>> ways it
>>>> doesn't fit into the way the ISA parser does things, so it's going
>>>> to quite
>>>> a bit of work to fix properly. I don't have any ideas for how to
>>>> make it
>>>> much simpler that would be at all practical.
>>>>
>>>> Gabe
>>>>
>>>> On 04/05/12 21:10, Watanabe, Yasuko wrote:
>>>>> Hi Gabe,
>>>>>
>>>>> Do you already have an idea of how to fix this? If so, can you
>>>>> give me
>>>> some pointers?
>>>>> Yasuko
>>>>>
>>>>> -----Original Message-----
>>>>> From: [email protected] [mailto:[email protected]] On
>>>>> Behalf Of Gabe Black
>>>>> Sent: Thursday, April 05, 2012 6:12 PM
>>>>> To: [email protected]
>>>>> Subject: Re: [gem5-dev] Data dependency caused by flags
>>>>>
>>>>> Yes, you guys are right. This is a recognized problem, and I've made
>>>> some changes over time which should make it easier to do this like
>>>> a real
>>>> x86 CPU would. I haven't yet, but it's on the horizon. I tend to be
>>>> very
>>>> busy, although circumstances may mean I have a little more or less
>>>> time
>>>> than normal for a little while so I don't know for sure when I'll
>>>> get it
>>>> fixed. If you have an idea of how to get it to do what you want
>>>> locally,
>>>> feel free. That will get you going, and when I get it fixed for
>>>> real then
>>>> you can start using that.
>>>>> Gabe
>>>>>
>>>>> On 04/05/12 17:18, Watanabe, Yasuko wrote:
>>>>>> Nilay,
>>>>>>
>>>>>> I agree with you. I think the dependencies of those flag bits
>>>>>> should be
>>>> evaluated at bit level.
>>>>>> Gabe and others,
>>>>>>
>>>>>> This change seems invasive. Do you know the best way to handle this?
>>>>>>
>>>>>> Yasuko
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: [email protected]
>>>>>> [mailto:[email protected]] On
>>>>>> Behalf Of Nilay Vaish
>>>>>> Sent: Thursday, April 05, 2012 3:35 AM
>>>>>> To: gem5 Developer List
>>>>>> Subject: Re: [gem5-dev] Data dependency caused by flags
>>>>>>
>>>>>> The code for the function genFlags() in
>>>> src/arch/x86/insts/microregop.cc suggests that the values of flag
>>>> bits not
>>>> updated by the ADD instruction need to be retained. This means that
>>>> the
>>>> previous values need to be read and written again, which means the
>>>> second
>>>> ADD can be dependent on a value written by the first ADD. If the
>>>> dependencies were evaulated at bit level, then these instructions
>>>> would not
>>>> be dependent.
>>>>>> -- 
>>>>>> Nilay
>>>>>>
>>>>>> On Thu, 5 Apr 2012, Watanabe, Yasuko wrote:
>>>>>>
>>>>>>> I ran O3 CPU in FS mode in x86 with a simple microbenchmark and got
>>>>>>> a much lower IPC than the theoretical IPC. The issue seems to be
>>>>>>> data dependencies caused by (control) flags, not registers, and
>>>>>>> I am
>>>>>>> wondering if anyone has come across the same issue.
>>>>>>>
>>>>>>> The microbenchmark has many data independent ADD instructions
>>>>>>> (http://repo.gem5.org/gem5/file/570b44fe6e04/src/arch/x86/isa/insts/
>>>>>>>
>>>>>>> g
>>>>>>> e
>>>>>>> neral_purpose/arithmetic/add_and_subtract.py#l41)
>>>>>>> in a loop. On a 2-wide out-of-order machine with enough resources,
>>>>>>> the IPC should be two at a steady stated. However, the IPC only
>>>>>>> goes
>>>>>>> up to one. What is happening is that even though the ADDs have two
>>>>>>> source and one destination registers and a flag to set in x86, gem5
>>>>>>> adds one extra flag source register to the ADDs. As a result, each
>>>>>>> ADD becomes dependent on the earlier ADD's destination flag,
>>>>>>> constraining the achievable IPC to one.
>>>>>>>
>>>>>>> Here is an example sequence with physical register mappings:
>>>>>>> ADD: S1=98, S2=9, S3=2, D1=82, D2=105 (flag)
>>>>>>> ADD: S1=92, S2=9, S3=105 (flag), D1=79, D2=90 ...
>>>>>>>
>>>>>>> Physical registers 98, 9, and 92 are ready when those two ADDs are
>>>>>>> renamed; however, as you can see, the second ADD has to wait for
>>>>>>> the
>>>>>>> first ADD because of the extra flag source register S3. When I
>>>>>>> removed those flags in the macroop definition, the IPC jumped up
>>>>>>> from
>>>> 1 to 1.7.
>>>>>>> Does anyone know why the ADD has to read the flags, even though the
>>>>>>> x86 manual does not say that? Those flags should just cause
>>>>>>> write-after-write dependency, not read-after-write.
>>>>>>>
>>>>>>> Yasuko
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> gem5-dev mailing list
>>>>>>> [email protected]
>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-dev mailing list
>>>>>> [email protected]
>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-dev mailing list
>>>>>> [email protected]
>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>> _______________________________________________
>>>>> gem5-dev mailing list
>>>>> [email protected]
>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gem5-dev mailing list
>>>>> [email protected]
>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> [email protected]
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>
>>>>
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> [email protected]
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>
>>> _______________________________________________
>>> gem5-dev mailing list
>>> [email protected]
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
>> _______________________________________________
>> gem5-dev mailing list
>> [email protected]
>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
> _______________________________________________
> gem5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/gem5-dev

_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Re: [gem5-dev] Data dependency caused by flags

Reply via email to