Re: [m5-dev] X86 performance

Steve Reinhardt Fri, 22 Oct 2010 15:01:42 -0700

On Fri, Oct 22, 2010 at 11:14 AM, Gabe Black <[email protected]> wrote:
> One complication here are instructions that cross (pick unit)
> boundaries. [...]
> That's not a fundamental problem, but it would mean that we'd have to
> have some extra mechanism to do the aggregation.


I agree, that doesn't strike me as so hard to handle.  You can compare
the cached raw machine instruction with the fetch byte stream up to
the point you've fetched, and if they differ before then, you know
it's a miss; if they match up to that point, you know that the
predecoder would have had to fetch more bytes anyway to distinguish
the cached StaticInst from any other possibilities.  So there'd be
some restructuring but that's it.

> A nice property of putting that in the predecoder is that it hides the
> special handling from all the other parts of the CPU, etc.

Even for RISC ISAs we should still be grabbing a cache line at a time
from the icache/memory system, IMO.  (Do we do this already or not?)

> One downside of putting it in the predecoder, which you've brought up,
> is that that means there are two levels of caching and one level just
> feeds a lookup in the next. Most of the time there would be a hit, I'd
> imagine, and it would be nice to short circuit that and just go directly
> to the StaticInst. I don't see a good way to capture both of these
> benefits, and choosing between the performance boost or the
> cleaner/simpler/more compartmentalized implementation I'd go with the
> later. If you -do- see a way, please let me know.

GIven the potential magnitude of the performance increase, I think
compartmentalizing things by having two versions of the code (one for
fixed-length ISAs and one for variable-length ISAs) and a
ISA-dependent flag to choose between them is good enough.  Obviously
more modularity is better, but even if that's the best we can do I
wouldn't throw the idea out on that basis alone.  And as I said above,
a decoupled block-at-a-time fetch stage might end up being a good idea
even for the RISC ISAs.

>
> The idea behind having multiple decoders is that rather than have one
> decoder that decides every time what mode you're in, there would be
> multiple decoders one for each mode. When the control state changed that
> dictated a different decoder be used, then the other one would be
> switched in. Similarly for predecoders information like the default size
> of various registers, etc., could be baked in and a variant selected
> ahead of time based on the current mode. In both cases the individual
> decoder/predecoder could maintain its own separate cache or selectively
> share when possible and not worry about the contextualizing info as
> much. Basically this maintains the one to one mapping into a particular
> pool of instructions but drops it globally.

I guess I understand the idea but I don't really grasp the
significance.  If you structure your single decoder like:

decode(ExtMachInst i) {
  if (i.context.inFooMode()) {
    decode_foo(i);
  } else if (i.context.inBarMode()) {
    decode_bar(i);
  } ...
}

or even if you split out the context so the signature is:
  decode(DecodeContext ctx, MachInst i)

then what is the distinction between that and having multiple decoders?

Steve
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] X86 performance

Reply via email to