Re: [m5-dev] X86 performance

Gabriel Michael Black Fri, 22 Oct 2010 15:35:16 -0700

Quoting Steve Reinhardt <[email protected]>:

On Fri, Oct 22, 2010 at 11:14 AM, Gabe Black <[email protected]> wrote:

One complication here are instructions that cross (pick unit)
boundaries. [...]
That's not a fundamental problem, but it would mean that we'd have to
have some extra mechanism to do the aggregation.


I agree, that doesn't strike me as so hard to handle.  You can compare
the cached raw machine instruction with the fetch byte stream up to
the point you've fetched, and if they differ before then, you know
it's a miss; if they match up to that point, you know that the
predecoder would have had to fetch more bytes anyway to distinguish
the cached StaticInst from any other possibilities.  So there'd be
some restructuring but that's it.

I'd have to think about this more, but my instinct is that it might bemessier than it first appears.

A nice property of putting that in the predecoder is that it hides the
special handling from all the other parts of the CPU, etc.


Even for RISC ISAs we should still be grabbing a cache line at a time
from the icache/memory system, IMO.  (Do we do this already or not?)

We do in O3 but not in the simple CPUs if I remember correctly. Idon't remember what InOrder does. There are some complications doingthings this way with self modifying code. I remember one instance atVMware where someone mentioned that the Intel manuals supposedly saidthat between control flow instructions code wouldn't necessarily bechecked for modification, and a recently ex-AMD engineer was surprisedby that. There's apparently some ambiguity in how that sort of thingis supposed to work.

One downside of putting it in the predecoder, which you've brought up,
is that that means there are two levels of caching and one level just
feeds a lookup in the next. Most of the time there would be a hit, I'd
imagine, and it would be nice to short circuit that and just go directly
to the StaticInst. I don't see a good way to capture both of these
benefits, and choosing between the performance boost or the
cleaner/simpler/more compartmentalized implementation I'd go with the
later. If you -do- see a way, please let me know.


GIven the potential magnitude of the performance increase, I think
compartmentalizing things by having two versions of the code (one for
fixed-length ISAs and one for variable-length ISAs) and a
ISA-dependent flag to choose between them is good enough.  Obviously
more modularity is better, but even if that's the best we can do I
wouldn't throw the idea out on that basis alone.  And as I said above,
a decoupled block-at-a-time fetch stage might end up being a good idea
even for the RISC ISAs.

I have the start of an idea floating around in my head of putting thissort of caching mechanism in the predocoder but then making itcooperate with the inst cache somehow. Or maybe having an index intothe cache based on either the ExtMachInst or the byte stream. I thinksplitting things out would really make things a lot bigger and harderto understand. Also having multiple scenarios and hence sets orrequirements something can run under (be that the CPUs or the ISAs)makes development harder. We're seeing that with the base updatestuff, I think.


The idea behind having multiple decoders is that rather than have one
decoder that decides every time what mode you're in, there would be
multiple decoders one for each mode. When the control state changed that
dictated a different decoder be used, then the other one would be
switched in. Similarly for predecoders information like the default size
of various registers, etc., could be baked in and a variant selected
ahead of time based on the current mode. In both cases the individual
decoder/predecoder could maintain its own separate cache or selectively
share when possible and not worry about the contextualizing info as
much. Basically this maintains the one to one mapping into a particular
pool of instructions but drops it globally.


I guess I understand the idea but I don't really grasp the
significance.  If you structure your single decoder like:

decode(ExtMachInst i) {
  if (i.context.inFooMode()) {
    decode_foo(i);
  } else if (i.context.inBarMode()) {
    decode_bar(i);
  } ...
}

or even if you split out the context so the signature is:
  decode(DecodeContext ctx, MachInst i)

then what is the distinction between that and having multiple decoders?

For instance, right now the predecoder in x86 computes, for everysingle instruction that passes through it, what the operand size,address size, stack size, and mode should be. Some of that informationmay not change in hours of simulation, and at most would change veryinfrequently, but it gets rediscovered over and over and over and overto contextualize ExtMachInsts for the regular decoder. It would be ahuge performance win, I think, considering how often that's called, ifthe 64bit long mode predecoder could be installed and called through avirtual function that already knew all that stuff and just plunked itin place with a simple copy. This is the strongest use case, I think.

Doesn't decode(DecodeContext *this, MachInst i) look like the Cversion of DecodeContext::decode(MachInst i)? :-)

The regular decoder is less that way since it feeds a lot of thatinformation into the instructions where they use it, but it does stillhave to make an occasional decision about what mode its in.


Steve
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev



_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] X86 performance

Reply via email to