Re: [m5-dev] X86 performance

Gabe Black Fri, 22 Oct 2010 11:14:25 -0700

Steve Reinhardt wrote:
> So I've always felt that for x86 we should move to having the raw
> instruction bytes and a length field in the StaticInst so that we can
> check for decode page cache hits without going through the predecoder.
>  Wouldn't this solve both of your problems, since in the hit case you
> would neither call the predecoder to generate an ExtMachInst nor need
> to compare ExtMachInsts to see if you have a hit?
>
> I agree this requires some different handling of decode context info,
> since you'll need to compare that along with the raw bytes.  I didn't
> completely follow your earlier argument about having multiple decoders
> and why that's better than what we have now, but maybe we should get
> back into that.
>
> Steve
> _______________________________________________
> m5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/m5-dev
>



One complication here are instructions that cross (pick unit)
boundaries. One of the functions of the predecoder is to collect these
instructions as you go along and aggregate them into one blob of bytes.
X86 instructions can theoretically be up to 15 bytes, but with some of
the new encodings they're using even that might have to be bumped up.
That's not a fundamental problem, but it would mean that we'd have to
have some extra mechanism to do the aggregation. I was actually thinking
we'd need something like that for a predecoder cache too, and then
something to feed the possibly partially collected blob of bytes into
the predecoder when there's a miss.

A nice property of putting that in the predecoder is that it hides the
special handling from all the other parts of the CPU, etc. This sort of
mechanism would be purely overhead in ISAs like SPARC and Alpha, but it
would just go away with the change in predecoder. Also, the predecoder
would know what contextualizing state was in effect, so it could manage
that part of things as well.

One downside of putting it in the predecoder, which you've brought up,
is that that means there are two levels of caching and one level just
feeds a lookup in the next. Most of the time there would be a hit, I'd
imagine, and it would be nice to short circuit that and just go directly
to the StaticInst. I don't see a good way to capture both of these
benefits, and choosing between the performance boost or the
cleaner/simpler/more compartmentalized implementation I'd go with the
later. If you -do- see a way, please let me know.

The idea behind having multiple decoders is that rather than have one
decoder that decides every time what mode you're in, there would be
multiple decoders one for each mode. When the control state changed that
dictated a different decoder be used, then the other one would be
switched in. Similarly for predecoders information like the default size
of various registers, etc., could be baked in and a variant selected
ahead of time based on the current mode. In both cases the individual
decoder/predecoder could maintain its own separate cache or selectively
share when possible and not worry about the contextualizing info as
much. Basically this maintains the one to one mapping into a particular
pool of instructions but drops it globally.

Gabe
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] X86 performance

Reply via email to