I've been tinkering around with this (reworking the
predecode/decode/caching flow), and I think we'll need to be somewhat
clever with how the pipeline is split up and how the pieces are hooked
together to enable code reuse between ISAs and keep things efficient.
The way things work now is as follows:
1. MachInsts come in from fetch. These are a fixed size chunk of raw
bytes, on Alpha, MIPS and Power this is the same as an instruction in
memory.
2. The predecoder takes in MachInsts and uses one or more MachInsts to
generate one or more ExtMachInsts. One MachInst may generate several
ExtMachInst, and one ExtMachInst may come from several MachInsts. An
ExtMachInst is all the information in the source MachInsts plus
contextualizing state (CPU mode, etc.) all put in a fixed layout structure.
3. Page cache. This is a cache of page size arrays of StaticInstPtrs
(aka one StaticInstPtr per byte). The appropriate array is selected, the
StaticInstPtr is read out based on page offset, and if there's something
there, it's ExtMachInst is compared against the incoming one. If there's
a match, that StaticInstPtr is returned and decoding stops.
4. ExtMachInst based hash. The incoming ExtMachInst is used to index
into a hash ignoring PC. If there's a hit, the resulting StaticInstPtr
is used to update the page cache, is returned, and decoding stops.
5. The actual decode function. The ExtMachInst is passed to the ISA
defined decode function. The result is used to update the ExtMachInst
hash, the page cache, is returned, and this is the end of the line for
decoding.
In summary:
(MachInsts) -> predecoder -> (ExtMachInsts) -> page cache -> ExtMachInst
hash -> decode function.
There are several problems with this pipeline that we talked about in
the earlier email. First, caching doesn't kick in until the middle of
the process. For ISAs like Alpha where the "predecoder" is trivially
oring in a bit or not, this still hides most of the work on hits. In x86
and to some extent ARM, the predocder does a lot of work to find the end
of an instruction, identify its parts, and add contextualizing
information. All this is done every single instruction, whether or not
there's a hit in the cache. Also, the predecoder copies the same
information into the ExtMachInst over and over and over to communicate
it to later stages of the decode pipeline. That means lots of moving
data around when we could just put it in one place and let the relevant
parties see it.
>From a design perspective, there are two separate phases of decode, the
predecoder and decoder, but the CPU doesn't care about the step in the
middle and just funnels the intermediate values, the ExtMachInsts,
between the two steps. That makes life harder for the CPU, and it also
makes things less flexible for the ISA because it has to maintain an
interface in the middle of its decode process.
There are several changes I want to make to how decoding works. First, I
want to refactor the decoding interface so that there's a single object
which handles all of the decoding on the ISA side. The front end
interface to it will be same as the current predecoder, and it will
handle passing things to the decode pieces internally.
Building on that, I want to make the decode function a member of the
ISA's Decoder class. Then it can use information stored in the object as
context when decoding, preventing that information from having to be
copied into every ExtMachInst. To handle the loss of context when using
the ExtMachInst as a hash key or a tag in the page cache, the caches
themselves would be selected based on the particular current
configuration of the contextualizing state.
Because the decoder can now contain state, it no longer has to read the
global FS/SE bool when decoding instructions, it can use its own local
copy. That means different decoders in the same simulation can operate
in different modes at the same time.
To address the problem of the predecoder running all the time, I want to
put a cache in front of it which knows three things about the
instructions it's already seen, where they started, how long they were,
and what bytes went into them. As the MachInsts come in, if the current
PC matches one that was already seen, and the bytes that were seen last
time are the same as the incoming bytes, then the whole decode process
can be short circuited and the StaticinstPtr that was generated the last
time can be returned again. Steve suggests this may make one of the
other caches redundant, like the page cache. I think this is quite
possibly true.
The pipeline would then look like this:
(MachInsts) -> MachInst cache[ctx] -> predecoder -> (ExtMachInsts) ->
?page cache? -> ExtMachInst hash[ctx] -> decode function.
Here [ctx] is a reminder that the specific instance of the cache/hash
has been selected based on the current contextual state.
Up to here is what we've talked about before in various emails/threads.
This new layout makes good sense for x86, but I don't want to make
things less efficient for ISAs that have fixed size instructions or
trivial predecoders. On the Alpha side, it may well be more efficient to
leave out the MachInst cache and continue to use the page cache. If we
do keep the MachInst cache, then it would be a waste to have machinery
in there that stored/checked the size of the incoming instruction and
moved bytes around to compare only the relevant parts. To me this all
sounds like these pieces should be interchangeable and replaceable so
that we can have a standard toolbox of steps and apply them
intelligently per ISA.
Along those lines, it would be nice to organize the decode flow as a
call stack. Without one, it looks like this:
A
If (A hit) {
return result;
}
B
if (B hit) {
update A;
return result;
}
C
if (C hit) {
update B;
update A;
return result;
}
if (D hit) {
update C;
update B;
update A;
return result;
}
That's simple, but also clumsy and inflexible. When organized as a call
stack it looks more like this:
A()
{
if (!hit) {
B()
update;
}
return result;
}
B()
{
if (!hit) {
C()
update;
}
return result;
}
That lends itself to being more modular, I think. The tricky part is
that A and B and C, etc., need to be plugged into each other at compile
time, probably with templates, and because each step has somewhat
different function signatures it gets a little more complicated. This
may not be as complicated as I imagine it is, but a clean implementation
hasn't dropped out of this yet.
So that's basically the state of things right now. Please feel free to
let me know your thoughts.
Gabe
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev