Quoting Steve Reinhardt <[email protected]>:
On Fri, Oct 22, 2010 at 10:57 AM, Gabe Black <[email protected]> wrote:
Is this just to get STUPD to be a single uop instead of two
uops that communicate via a temp reg, without forcing dependent
instructions to wait for the STUPD to commit to get the updated base
value?
I wouldn't say "just", but essentially yes.
So it seems like the overriding question is: is all this hassle really
worth it? How often do we use a STUPD uop dynamically anyway?
That is a good question. Since the operation isn't visible
architecturally (at least as far as I remember), I think its use boils
down to stack pushes and perhaps some other microcoded operations
using the stack like constructing exception stack frames and that sort
of thing. Pushes and pops would normally be considered a core part of
the ISA, but I think compilers generally just add or subtract from the
stack pointer instead. There are some things where the only expedient
mechanism to get them is through pushes and pops like the flags
register, so it's at least partially unavoidable.
I'll instrument the simulator in the near future and measure what
percentage of instructions are stupds. I'll run SE and FS workloads
which will likely be different.
Do we need another execution phase like completeTrans() that can be
overridden here? Generally it's not unreasonable to say that any
exception that occurs post-translation on a store is imprecise... I
don't know if x86 specifically has any exceptions to that rule.
I think that would be a fairly major change, and 99% of the time
completeTrans either wouldn't be used or wouldn't do anything, depending
on how it's implemented
I'm not overwhelmingly concerned about that... O3 is slow enough that
doing one more virtual function call per dynamic memory access (that
will typically hit in the BTB if all the no-op versions point to the
same base implementation) probably won't make a major difference.
Same with calling completeAcc() on stores, though in that case I agree
that it still isn't really the right point to do the update. In fact,
since O3 explicitly checks to see if an instruction is a
store-conditional to know whether to call completeAcc(), it might even
be faster to call completeAcc() unconditionally and let the virtual
function call replace that if test.
That's true.
I don't think we're talking about exceptions
post translation, just during translation.
Yea, what I meant was that if you do the update post translation
(including waiting for a delayed translation, so you know the
translation didn't fault), then you don't have to worry about rolling
it back because the instruction won't take a later exception, so it
would be safe to "commit" the value at that point. That does force
the update to potentially wait for a page-table walk though which is
still not ideal.
So one annoying thing is that there's no benefit to doing the update
in initiateAcc() for TImingSimpleCPU; the only reason to make that
work is so that we can do it in initiateAcc() in O3 and have the same
code work in both places. It seems like the problem is that we either
call execute() or initiateAcc()/completeAcc(), and in this case we
really want to continue to call execute() to do the update in addition
to using initiateAcc()/completeAcc(). Again, the easy way to do this
is to use two uops. If we really feel we need an alternative, it
still feels to me like the right thing to do is to define some new
StaticInst method that gets called when initiateAcc() gets called in
O3, but gets called when the instruction commits in TimingSimpleCPU.
Either that or find a way for the instruction to know which model it's
in, and do the update in initiateAcc() for O3 and in completeAcc() for
TImingSimpleCPU. (I really don't like that last one, but I still like
it better than implementing speculation via a temp reg inside the
instruction definition itself.)
Yeah, I -really- don't like the last one :-). The stupd measurement
will likely help shape the conversation here, and if Ali were willing
to look at how often this sort of thing happens in ARM that would be
good to know too. Since it's architecturally visible there I'd guess a
lot more often. I'd like to try to do this with a really light
mechanism, but an extra static inst method may be the way to go in the
end.
Gabe
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev