Quoting Steve Reinhardt <[email protected]>:

On Tue, Jul 13, 2010 at 11:20 AM, Gabe Black <[email protected]> wrote:
I can't say it was -the- reason, but one reason is that the TLBs as is
don't actually send the packets for the CPU, so they can't split
anything into multiple transactions easily. I'm intrigued by the idea of
putting the TLB behind a port or port like interface, maybe even
exporting the TLB outside of the CPU's guts and putting it inline with
external accesses.

I see from your subsequent email that you've already thought of some
drawbacks to this... I agree it's nice in the common case, but has the
problem that it constrains the pipeline design perhaps more than you
really want to.  We're probably better off finding a way to embed two
physical addresses in a Request.

I'd rather not pollute the Request objects with this stuff. What would happen if (and I'm not saying we'd want to) we decide we need to support accesses split into three pieces? Then we'd have all these request objects with three addresses in them when 99% of the time they only needed one. I like the idea of the separation happening before the packets/requests are sent out with just enough baggage attached to put it all back together when the pieces come back.


There are three problems with that, though. First,
the TLB would likely need some alternative way to pass a fault back to
the CPU. Maybe the request would have a fault pointer field?

Adding a field to contain a fault code seems pretty simple.

Second, the
TLB is the thing that recognizes when an access is to memory mapped
control state within the CPU. It would need a way to communicate with
the CPU to get/set those values.

Or better yet just to communicate back to the CPU that it needs to
access its internal state.  Is it possible to remap this memory-mapped
state to virtual addresses?  If not, we could even move that check out
of the TLB and into the CPU (not saying that's the best thing, just
that it would be a possibility).

There are accesses like wrmsr and rdmsr in x86 that know they're going for internal state with an address like index and purposefully flag the virtual address as such, but then there are also regions of the per CPU physical address space like the local APIC page that pretty much need to be checked after translation. You could do this in two passes, one before and after translation, but that's a less flexible approach and seems more cumbersome compared to doing it in the TLB where all information is avaiable at once.

I think this problem basically goes away, though, if the TLB isn't the last step before memory. The request object is marked by a flag that says it's for memory mapped state (we're missing a p in there in some places, I think), and then the CPU knows to handle it specially instead of actually accessing memory. This does contribute to some of the complexity in the memory chopping up and gluing back together code, but no better mechanism jumps to mind right away.


Third, the control state that actually
-runs- the TLB is maintained by the CPU, namely what mode it's in, etc.

I think you're on to something with the discussion below...

This also brings up another idea I've been rolling around for a while.
Why is all the control state local to the miscregfile/it's decendant the
ISA object? Why don't we put control state that matters to the TLB, or
at least a copy of it, in the TLB itself and then communicate it back
and forth as necessary? That would be easier to code (or at least I'm
guessing) since you'd just have the state right there, faster since it
avoids calling out for it, and would more conceptually match real
hardware where all the control state isn't put in one huge blob
someplace.

When I discovered that your x86 implementation has 200+ miscregs I
began to think that there was a problem here :-).  I agree that
finding a way to spread it out makes sense.  Just putting the
indirection in readMiscRegs/writeMiscRegs would be one way to do it, I
guess, but it would be nice to clean things up further to avoid this
giant linear index space (like you were alluding to in a previous
email).

Yeah, x86 just has gobs and gobs of control state. A significant portion of that are what's called MSRs which are model specific registers, although a lot of those are specified in the architecture manual and one, the EFER, is required to enter 64 bit mode, so they aren't necessarily all that model specific. Some of those control like the MTRRs or memory type range registers control whether regions of memory are cachable, etc. etc, so those could probably go in the TLBs. Unfortunately these are accessed with a 32 bit index, so it's not clear we could chop them up into different register files easily.

The thing I was getting at before would be more for situations were you'd have separately indexable register spaces like the x87/mmx 80 bit/64 bit registers (those overlap) vs. the 128 bit/256 bit XMM/YMM registers (those also overlap) vs. the integer GPRs vs. the pseudo integer control registers vs. the MSRs vs. the artificially numbered non-MSR control state, vs. the segmentation related registers vs. the control registers (CRn) vs. the debug registers (DRn) vs. the performance counter registers. It would be great not to have these all artificially squished together into only three groups, but more importantly not squished into only one and then possibly ambiguosly reseparated. One is a little ugly, the other has frequently been the source of bugs.

I'm imagining a utopia where you'd specify the control registers both YMM and MMX register files were floating point, the GPRs and the pseudo integer control registers were integer, the various other control register files were integer but couldn't be written non-speculatively or be renamed, that certain register groups had side effects when written but not when read, blah blah blah. They could each have their own register disambiguation function so you wouldn't have to do so much work to figure out that the condition codes don't do anything interesting but the GPRs might. Basically I'm hoping for a richer and more flexible system for describing the register architecture of an ISA than putting everyhing into one of three predefined buckets. It's not bad the way it is (except the squish everything down to one index space thing), especially since we got rid of the somewhat sketchy situation with the ISA defined integer and floating point register files, but I think it could be improved so it fits the ISAs a little more naturally.

It would also be nice but probably too hard to be able to store non-integral or floating point values in the register files. One common optimization for x86 simulators, I'm told, is to put off calculating flags until the last minute. For us to do that we'd need to keep around all the information needed to actually compute the fault. I've always imagined just keeping the StaticInst pointer around in a "register" and calling a computeFlags function on it when needed. There are problems with this like checkpoints and general complication, and I'm not 100% convinced it would actually make enough of a difference (or maybe any, after the overhead) to be worthwhile.



The same thing could be done for other structures like the
interrupt controller, and maybe the decoder and/or predecoder. Speaking
of the decoder, it would be nice to make that a little stateful as well.
As it is in, say, ARM, the decoder has to rediscover what mode it's in
over and over. I'm guessing it would be better to explicitly switch it's
state (or it entirely) when changing modes instead, although that might
add a fair amount of complexity. Perhaps the decoder should be an object
instead of a bare function? I'm less sure how that would work. It could,
hypothetically, allow us to return the two PC bits commandeered to
signal the mode.

The predecoder is already stateful, right?  I'm not so convinced about
the decoder; you still need a way to externalize the state that
influences the decode process to allow the decode cache to work.  But
it seems like you could easily build a stateful decoder if you wanted
by calling the stateless decode function via an object that contains
the additional state.

Yes, the predecoder is sort of stateful. It's stateful in that it keeps state, but no attempt was made (yet, maybe) to make it work with mispredicts, for instance. The idea was that it could speculatively update state to keep the instructions flowing without having to stall every time the decoding context changed, which in ARM is potentially very often. This is a hard and related problem, but is a little different.

Maybe the decode cache should be instantiated statically by the decode object? If you're, say, decoding 64 bit instructions, there's no reason to have a bunch of 32 bit instructions in the cache getting in the way. The decode object could instantiate a cache for each decoding "mode", leave out the contextualizing state, and just start with the right batch of instructions. To get sharing in multi CPU or multi core simulations they'd be static so all decode objects would have access to the same cache per mode. For something as heavy weight as x86's various mode changes there could be a stall to update the decoder mode, but then again in ARM where every add instruction might switch to thumb mode (correct me if that's a mischaracterization, Ali) you can't introduce all those stalls and get realistic performance.


Steve
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev



_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev

Reply via email to