Quoting Steve Reinhardt <[email protected]>:
On Fri, Oct 22, 2010 at 11:14 AM, Gabe Black <[email protected]> wrote:
One complication here are instructions that cross (pick unit)
boundaries. [...]
That's not a fundamental problem, but it would mean that we'd have to
have some extra mechanism to do the aggregation.
I agree, that doesn't strike me as so hard to handle. You can compare
the cached raw machine instruction with the fetch byte stream up to
the point you've fetched, and if they differ before then, you know
it's a miss; if they match up to that point, you know that the
predecoder would have had to fetch more bytes anyway to distinguish
the cached StaticInst from any other possibilities. So there'd be
some restructuring but that's it.
I'd have to think about this more, but my instinct is that it might be
messier than it first appears.
A nice property of putting that in the predecoder is that it hides the
special handling from all the other parts of the CPU, etc.
Even for RISC ISAs we should still be grabbing a cache line at a time
from the icache/memory system, IMO. (Do we do this already or not?)
We do in O3 but not in the simple CPUs if I remember correctly. I
don't remember what InOrder does. There are some complications doing
things this way with self modifying code. I remember one instance at
VMware where someone mentioned that the Intel manuals supposedly said
that between control flow instructions code wouldn't necessarily be
checked for modification, and a recently ex-AMD engineer was surprised
by that. There's apparently some ambiguity in how that sort of thing
is supposed to work.
One downside of putting it in the predecoder, which you've brought up,
is that that means there are two levels of caching and one level just
feeds a lookup in the next. Most of the time there would be a hit, I'd
imagine, and it would be nice to short circuit that and just go directly
to the StaticInst. I don't see a good way to capture both of these
benefits, and choosing between the performance boost or the
cleaner/simpler/more compartmentalized implementation I'd go with the
later. If you -do- see a way, please let me know.
GIven the potential magnitude of the performance increase, I think
compartmentalizing things by having two versions of the code (one for
fixed-length ISAs and one for variable-length ISAs) and a
ISA-dependent flag to choose between them is good enough. Obviously
more modularity is better, but even if that's the best we can do I
wouldn't throw the idea out on that basis alone. And as I said above,
a decoupled block-at-a-time fetch stage might end up being a good idea
even for the RISC ISAs.
I have the start of an idea floating around in my head of putting this
sort of caching mechanism in the predocoder but then making it
cooperate with the inst cache somehow. Or maybe having an index into
the cache based on either the ExtMachInst or the byte stream. I think
splitting things out would really make things a lot bigger and harder
to understand. Also having multiple scenarios and hence sets or
requirements something can run under (be that the CPUs or the ISAs)
makes development harder. We're seeing that with the base update
stuff, I think.
The idea behind having multiple decoders is that rather than have one
decoder that decides every time what mode you're in, there would be
multiple decoders one for each mode. When the control state changed that
dictated a different decoder be used, then the other one would be
switched in. Similarly for predecoders information like the default size
of various registers, etc., could be baked in and a variant selected
ahead of time based on the current mode. In both cases the individual
decoder/predecoder could maintain its own separate cache or selectively
share when possible and not worry about the contextualizing info as
much. Basically this maintains the one to one mapping into a particular
pool of instructions but drops it globally.
I guess I understand the idea but I don't really grasp the
significance. If you structure your single decoder like:
decode(ExtMachInst i) {
if (i.context.inFooMode()) {
decode_foo(i);
} else if (i.context.inBarMode()) {
decode_bar(i);
} ...
}
or even if you split out the context so the signature is:
decode(DecodeContext ctx, MachInst i)
then what is the distinction between that and having multiple decoders?
For instance, right now the predecoder in x86 computes, for every
single instruction that passes through it, what the operand size,
address size, stack size, and mode should be. Some of that information
may not change in hours of simulation, and at most would change very
infrequently, but it gets rediscovered over and over and over and over
to contextualize ExtMachInsts for the regular decoder. It would be a
huge performance win, I think, considering how often that's called, if
the 64bit long mode predecoder could be installed and called through a
virtual function that already knew all that stuff and just plunked it
in place with a simple copy. This is the strongest use case, I think.
Doesn't decode(DecodeContext *this, MachInst i) look like the C
version of DecodeContext::decode(MachInst i)? :-)
The regular decoder is less that way since it feeds a lot of that
information into the instructions where they use it, but it does still
have to make an occasional decision about what mode its in.
Steve
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev