It seems to me if we can cover the following cases most of the instructions would get covered --
1. All six condition codes -- (OF, SF, ZF, AF, PF, CF)
2. Two classes of five condition codes -- (OF,SF,ZF,PF,CF), (OF, SF, ZF, AF, PF)
3. One class of two condition codes -- (OF,CF)

Yasuko's current problem about ADD instructions will get solved if we just handle the first case i.e. specify that if an instruction is writing all the condition codes, then do not assume condition code register to be a source register.

--
Nilay

On Sat, 7 Apr 2012, Gabe Black wrote:

Yeah, I think we've talked about this topic in the past, but it was a
while ago and I don't remember exactly what all we talked about or the
conclusion(s) we reached.

The problem at the ISA level is that there are lots of instructions in
x86 which are pretty basic and used a lot (adds, subtracts, etc.) which
compute condition codes every time in case you need them. That, combined
with the fact that the instructions which update the condition codes
update somewhat erratic combinations of bits, means that lots of
instructions write the condition code bits, and those same common
instructions read them too so they can do a partial update.

This has happened to a lesser extent before where there are control like
bits and condition code like bits in the same register. To my knowledge
that's happened at least on SPARC, ARM, and x86. That's dealt with by
splitting the condition code bits out into their own register, which is
treated as a renamed integer register, and the control bits which are
treated as a misc reg with all the overhead and special precautions.
That doesn't entirely work on x86, though, because even among the
condition code bits there are a lot of partial accesses as described
above. The cc bits could be broken down into individual bits, but that's
pretty cumbersome since there are, including the two artificial ones for
microcode, 8 of them I believe? That would be a lot of registers to
rename, would slow down the simulator, wouldn't be that realistic, etc.
What real CPUs do, after talking to someone in the know at AMD, is that
they gather up one group of flags, about 4 if I recall, and treat those
as a unit. The others are handled individually. The group of 4 is still
not 100% treated as a unit since some instructions modify just one of
them, for instance, but it's pretty close, optimizes for the common
case, and the odd cases can still work like they do today.

The difficulty implementing this is that exactly which condition code
bits to set and which to check for conditional microops are decided at
the microcode level and are arbitrary combinations. They don't need to
be completely arbitrary, but that means that microops really effectively
know which, how many, etc., condition code registers they need at
construction time as apposed to compile time. So what we'd need to do is
to allow the constructor for a microop to look at the flags it was being
given and to use that to more programatically figure out which registers
it had as sources or destinations, and how many. The body of the
instructions themselves would need to be sophisticated enough to pull
together the different source registers, whatever they are, and to
process them appropriately with a consistent bit of code (and not 18
different parameters to some function where 14 aren't used at any
particular time). It would also have to know how to split things back up
again when writing out the results.

What I did to move us a little bit in this direction is to make the
types of operands much more flexible so that we can have structures,
typedefs, etc. What we'd still need is truely composite operand types
where a single operand, for instance the condition code bits, is built
from a set of registers (determined in some way appropriate to the
operand) and/or written back to a set of registers, but which could be
handled easily as a single value inside the code blob. Then we can avoid
having 100(s) of versions of microops for all the different combinations
of flag bits, which would be a terrible thing to have to live with.

As far as easier ways to deal with this, there is only one which is what
I was alluding to in what I think was my earliest email, and that's to
just hack around it so the instructions you know you're using in the
performance sensitive part behave incorrectly generally speaking, but do
what you expect for the benchmark. Maybe they'd even have to know where
they were running from, that they were in a range of ticks, etc. A gross
and terrible hack unfit to check in, but something that would get the
poster unstuck for now. Doing things the "right" way will take some
infrastructure work, and that may not be very quick. I don't think
there's any real shortcut around doing the infrastructure work that
doesn't have a pretty heavy cost (like blowing up the number of microop
classes 100 fold).

Gabe

On 04/07/12 11:32, Steve Reinhardt wrote:
Hi Gabe,

Your earlier email said "I've made some changes over time which should make
it easier to do this like a real x86 CPU would".  Could you expand on that?
 It sounded like you had some sort of plan or direction at least.  If we're
going to start working on this ourselves, it would be best if we can
benefit from whatever insights you've had or preliminary work you've done.

I see your later email says "I don't have any ideas for how to make it much
simpler", but that seems to contradict what you said at first.  In
particular, you also earlier said "If you have an idea of how to get it to
do what you want locally, feel free. That will get you going, and when I
get it fixed for real then you can start using that.".  I'd like to
explicitly reject that idea... for one thing, I'm not sure what a "local"
solution would look like, and more importantly, this issue seems
complicated enough that us doing some sort of temporary or stopgap solution
like you're implying, only to throw it away once you've done it "for real",
seems like a huge waste of effort.  So overall I'd like to be sure we're in
sync with whatever you're thinking to make sure that our efforts are
additive and complementary and not redundant.

Thanks,

Steve

On Fri, Apr 6, 2012 at 3:43 PM, Watanabe, Yasuko <[email protected]>wrote:

Hi Gabe,

I also went through the code and got a sense of changes that need to be
made. You are right. The current infrastructure makes it difficult to fix
this issue.

Yasuko

-----Original Message-----
From: [email protected] [mailto:[email protected]] On
Behalf Of Gabe Black
Sent: Friday, April 06, 2012 12:18 PM
To: [email protected]
Subject: Re: [gem5-dev] Data dependency caused by flags

It's complicated. Looking at it again I reminded myself of all the ways it
doesn't fit into the way the ISA parser does things, so it's going to quite
a bit of work to fix properly. I don't have any ideas for how to make it
much simpler that would be at all practical.

Gabe

On 04/05/12 21:10, Watanabe, Yasuko wrote:
Hi Gabe,

Do you already have an idea of how to fix this? If so, can you give me
some pointers?
Yasuko

-----Original Message-----
From: [email protected] [mailto:[email protected]] On
Behalf Of Gabe Black
Sent: Thursday, April 05, 2012 6:12 PM
To: [email protected]
Subject: Re: [gem5-dev] Data dependency caused by flags

Yes, you guys are right. This is a recognized problem, and I've made
some changes over time which should make it easier to do this like a real
x86 CPU would. I haven't yet, but it's on the horizon. I tend to be very
busy, although circumstances may mean I have a little more or less time
than normal for a little while so I don't know for sure when I'll get it
fixed. If you have an idea of how to get it to do what you want locally,
feel free. That will get you going, and when I get it fixed for real then
you can start using that.
Gabe

On 04/05/12 17:18, Watanabe, Yasuko wrote:
Nilay,

I agree with you. I think the dependencies of those flag bits should be
evaluated at bit level.
Gabe and others,

This change seems invasive. Do you know the best way to handle this?

Yasuko

-----Original Message-----
From: [email protected] [mailto:[email protected]] On
Behalf Of Nilay Vaish
Sent: Thursday, April 05, 2012 3:35 AM
To: gem5 Developer List
Subject: Re: [gem5-dev] Data dependency caused by flags

The code for the function genFlags() in
src/arch/x86/insts/microregop.cc suggests that the values of flag bits not
updated by the ADD instruction need to be retained. This means that the
previous values need to be read and written again, which means the second
ADD can be dependent on a value written by the first ADD. If the
dependencies were evaulated at bit level, then these instructions would not
be dependent.
--
Nilay

On Thu, 5 Apr 2012, Watanabe, Yasuko wrote:

I ran O3 CPU in FS mode in x86 with a simple microbenchmark and got
a much lower IPC than the theoretical IPC. The issue seems to be
data dependencies caused by (control) flags, not registers, and I am
wondering if anyone has come across the same issue.

The microbenchmark has many data independent ADD instructions
(http://repo.gem5.org/gem5/file/570b44fe6e04/src/arch/x86/isa/insts/
g
e
neral_purpose/arithmetic/add_and_subtract.py#l41)
in a loop. On a 2-wide out-of-order machine with enough resources,
the IPC should be two at a steady stated. However, the IPC only goes
up to one. What is happening is that even though the ADDs have two
source and one destination registers and a flag to set in x86, gem5
adds one extra flag source register to the ADDs. As a result, each
ADD becomes dependent on the earlier ADD's destination flag,
constraining the achievable IPC to one.

Here is an example sequence with physical register mappings:
ADD: S1=98, S2=9, S3=2, D1=82, D2=105 (flag)
ADD: S1=92, S2=9, S3=105 (flag), D1=79, D2=90 ...

Physical registers 98, 9, and 92 are ready when those two ADDs are
renamed; however, as you can see, the second ADD has to wait for the
first ADD because of the extra flag source register S3. When I
removed those flags in the macroop definition, the IPC jumped up from
1 to 1.7.
Does anyone know why the ADD has to read the flags, even though the
x86 manual does not say that? Those flags should just cause
write-after-write dependency, not read-after-write.

Yasuko

_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev


_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev


_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev


_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to