Re: [m5-dev] cleaning up TimingSimpleCPU

Gabriel Michael Black Tue, 13 Jul 2010 17:47:14 -0700

Quoting Steve Reinhardt <[email protected]>:

On Tue, Jul 13, 2010 at 11:20 AM, Gabe Black <[email protected]> wrote:

I can't say it was -the- reason, but one reason is that the TLBs as is
don't actually send the packets for the CPU, so they can't split
anything into multiple transactions easily. I'm intrigued by the idea of
putting the TLB behind a port or port like interface, maybe even
exporting the TLB outside of the CPU's guts and putting it inline with
external accesses.


I see from your subsequent email that you've already thought of some
drawbacks to this... I agree it's nice in the common case, but has the
problem that it constrains the pipeline design perhaps more than you
really want to.  We're probably better off finding a way to embed two
physical addresses in a Request.

I'd rather not pollute the Request objects with this stuff. What wouldhappen if (and I'm not saying we'd want to) we decide we need tosupport accesses split into three pieces? Then we'd have all theserequest objects with three addresses in them when 99% of the time theyonly needed one. I like the idea of the separation happening beforethe packets/requests are sent out with just enough baggage attached toput it all back together when the pieces come back.

There are three problems with that, though. First,
the TLB would likely need some alternative way to pass a fault back to
the CPU. Maybe the request would have a fault pointer field?


Adding a field to contain a fault code seems pretty simple.

Second, the
TLB is the thing that recognizes when an access is to memory mapped
control state within the CPU. It would need a way to communicate with
the CPU to get/set those values.


Or better yet just to communicate back to the CPU that it needs to
access its internal state.  Is it possible to remap this memory-mapped
state to virtual addresses?  If not, we could even move that check out
of the TLB and into the CPU (not saying that's the best thing, just
that it would be a possibility).

There are accesses like wrmsr and rdmsr in x86 that know they're goingfor internal state with an address like index and purposefully flagthe virtual address as such, but then there are also regions of theper CPU physical address space like the local APIC page that prettymuch need to be checked after translation. You could do this in twopasses, one before and after translation, but that's a less flexibleapproach and seems more cumbersome compared to doing it in the TLBwhere all information is avaiable at once.

I think this problem basically goes away, though, if the TLB isn't thelast step before memory. The request object is marked by a flag thatsays it's for memory mapped state (we're missing a p in there in someplaces, I think), and then the CPU knows to handle it speciallyinstead of actually accessing memory. This does contribute to some ofthe complexity in the memory chopping up and gluing back togethercode, but no better mechanism jumps to mind right away.

Third, the control state that actually
-runs- the TLB is maintained by the CPU, namely what mode it's in, etc.


I think you're on to something with the discussion below...

This also brings up another idea I've been rolling around for a while.
Why is all the control state local to the miscregfile/it's decendant the
ISA object? Why don't we put control state that matters to the TLB, or
at least a copy of it, in the TLB itself and then communicate it back
and forth as necessary? That would be easier to code (or at least I'm
guessing) since you'd just have the state right there, faster since it
avoids calling out for it, and would more conceptually match real
hardware where all the control state isn't put in one huge blob
someplace.


When I discovered that your x86 implementation has 200+ miscregs I
began to think that there was a problem here :-).  I agree that
finding a way to spread it out makes sense.  Just putting the
indirection in readMiscRegs/writeMiscRegs would be one way to do it, I
guess, but it would be nice to clean things up further to avoid this
giant linear index space (like you were alluding to in a previous
email).

Yeah, x86 just has gobs and gobs of control state. A significantportion of that are what's called MSRs which are model specificregisters, although a lot of those are specified in the architecturemanual and one, the EFER, is required to enter 64 bit mode, so theyaren't necessarily all that model specific. Some of those control likethe MTRRs or memory type range registers control whether regions ofmemory are cachable, etc. etc, so those could probably go in the TLBs.Unfortunately these are accessed with a 32 bit index, so it's notclear we could chop them up into different register files easily.

The thing I was getting at before would be more for situations wereyou'd have separately indexable register spaces like the x87/mmx 80bit/64 bit registers (those overlap) vs. the 128 bit/256 bit XMM/YMMregisters (those also overlap) vs. the integer GPRs vs. the pseudointeger control registers vs. the MSRs vs. the artificially numberednon-MSR control state, vs. the segmentation related registers vs. thecontrol registers (CRn) vs. the debug registers (DRn) vs. theperformance counter registers. It would be great not to have these allartificially squished together into only three groups, but moreimportantly not squished into only one and then possibly ambiguoslyreseparated. One is a little ugly, the other has frequently been thesource of bugs.

I'm imagining a utopia where you'd specify the control registers bothYMM and MMX register files were floating point, the GPRs and thepseudo integer control registers were integer, the various othercontrol register files were integer but couldn't be writtennon-speculatively or be renamed, that certain register groups had sideeffects when written but not when read, blah blah blah. They couldeach have their own register disambiguation function so you wouldn'thave to do so much work to figure out that the condition codes don'tdo anything interesting but the GPRs might. Basically I'm hoping for aricher and more flexible system for describing the registerarchitecture of an ISA than putting everyhing into one of threepredefined buckets. It's not bad the way it is (except the squisheverything down to one index space thing), especially since we got ridof the somewhat sketchy situation with the ISA defined integer andfloating point register files, but I think it could be improved so itfits the ISAs a little more naturally.

It would also be nice but probably too hard to be able to storenon-integral or floating point values in the register files. Onecommon optimization for x86 simulators, I'm told, is to put offcalculating flags until the last minute. For us to do that we'd needto keep around all the information needed to actually compute thefault. I've always imagined just keeping the StaticInst pointer aroundin a "register" and calling a computeFlags function on it when needed.There are problems with this like checkpoints and generalcomplication, and I'm not 100% convinced it would actually make enoughof a difference (or maybe any, after the overhead) to be worthwhile.

The same thing could be done for other structures like the
interrupt controller, and maybe the decoder and/or predecoder. Speaking
of the decoder, it would be nice to make that a little stateful as well.
As it is in, say, ARM, the decoder has to rediscover what mode it's in
over and over. I'm guessing it would be better to explicitly switch it's
state (or it entirely) when changing modes instead, although that might
add a fair amount of complexity. Perhaps the decoder should be an object
instead of a bare function? I'm less sure how that would work. It could,
hypothetically, allow us to return the two PC bits commandeered to
signal the mode.


The predecoder is already stateful, right?  I'm not so convinced about
the decoder; you still need a way to externalize the state that
influences the decode process to allow the decode cache to work.  But
it seems like you could easily build a stateful decoder if you wanted
by calling the stateless decode function via an object that contains
the additional state.

Yes, the predecoder is sort of stateful. It's stateful in that itkeeps state, but no attempt was made (yet, maybe) to make it work withmispredicts, for instance. The idea was that it could speculativelyupdate state to keep the instructions flowing without having to stallevery time the decoding context changed, which in ARM is potentiallyvery often. This is a hard and related problem, but is a littledifferent.

Maybe the decode cache should be instantiated statically by the decodeobject? If you're, say, decoding 64 bit instructions, there's noreason to have a bunch of 32 bit instructions in the cache getting inthe way. The decode object could instantiate a cache for each decoding"mode", leave out the contextualizing state, and just start with theright batch of instructions. To get sharing in multi CPU or multi coresimulations they'd be static so all decode objects would have accessto the same cache per mode. For something as heavy weight as x86'svarious mode changes there could be a stall to update the decodermode, but then again in ARM where every add instruction might switchto thumb mode (correct me if that's a mischaracterization, Ali) youcan't introduce all those stalls and get realistic performance.


Steve
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev



_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] cleaning up TimingSimpleCPU

Reply via email to