Quoting Steve Reinhardt <[email protected]>:
On Tue, Jul 13, 2010 at 11:20 AM, Gabe Black <[email protected]> wrote:
I can't say it was -the- reason, but one reason is that the TLBs as is
don't actually send the packets for the CPU, so they can't split
anything into multiple transactions easily. I'm intrigued by the idea of
putting the TLB behind a port or port like interface, maybe even
exporting the TLB outside of the CPU's guts and putting it inline with
external accesses.
I see from your subsequent email that you've already thought of some
drawbacks to this... I agree it's nice in the common case, but has the
problem that it constrains the pipeline design perhaps more than you
really want to. We're probably better off finding a way to embed two
physical addresses in a Request.
I'd rather not pollute the Request objects with this stuff. What would
happen if (and I'm not saying we'd want to) we decide we need to
support accesses split into three pieces? Then we'd have all these
request objects with three addresses in them when 99% of the time they
only needed one. I like the idea of the separation happening before
the packets/requests are sent out with just enough baggage attached to
put it all back together when the pieces come back.
There are three problems with that, though. First,
the TLB would likely need some alternative way to pass a fault back to
the CPU. Maybe the request would have a fault pointer field?
Adding a field to contain a fault code seems pretty simple.
Second, the
TLB is the thing that recognizes when an access is to memory mapped
control state within the CPU. It would need a way to communicate with
the CPU to get/set those values.
Or better yet just to communicate back to the CPU that it needs to
access its internal state. Is it possible to remap this memory-mapped
state to virtual addresses? If not, we could even move that check out
of the TLB and into the CPU (not saying that's the best thing, just
that it would be a possibility).
There are accesses like wrmsr and rdmsr in x86 that know they're going
for internal state with an address like index and purposefully flag
the virtual address as such, but then there are also regions of the
per CPU physical address space like the local APIC page that pretty
much need to be checked after translation. You could do this in two
passes, one before and after translation, but that's a less flexible
approach and seems more cumbersome compared to doing it in the TLB
where all information is avaiable at once.
I think this problem basically goes away, though, if the TLB isn't the
last step before memory. The request object is marked by a flag that
says it's for memory mapped state (we're missing a p in there in some
places, I think), and then the CPU knows to handle it specially
instead of actually accessing memory. This does contribute to some of
the complexity in the memory chopping up and gluing back together
code, but no better mechanism jumps to mind right away.
Third, the control state that actually
-runs- the TLB is maintained by the CPU, namely what mode it's in, etc.
I think you're on to something with the discussion below...
This also brings up another idea I've been rolling around for a while.
Why is all the control state local to the miscregfile/it's decendant the
ISA object? Why don't we put control state that matters to the TLB, or
at least a copy of it, in the TLB itself and then communicate it back
and forth as necessary? That would be easier to code (or at least I'm
guessing) since you'd just have the state right there, faster since it
avoids calling out for it, and would more conceptually match real
hardware where all the control state isn't put in one huge blob
someplace.
When I discovered that your x86 implementation has 200+ miscregs I
began to think that there was a problem here :-). I agree that
finding a way to spread it out makes sense. Just putting the
indirection in readMiscRegs/writeMiscRegs would be one way to do it, I
guess, but it would be nice to clean things up further to avoid this
giant linear index space (like you were alluding to in a previous
email).
Yeah, x86 just has gobs and gobs of control state. A significant
portion of that are what's called MSRs which are model specific
registers, although a lot of those are specified in the architecture
manual and one, the EFER, is required to enter 64 bit mode, so they
aren't necessarily all that model specific. Some of those control like
the MTRRs or memory type range registers control whether regions of
memory are cachable, etc. etc, so those could probably go in the TLBs.
Unfortunately these are accessed with a 32 bit index, so it's not
clear we could chop them up into different register files easily.
The thing I was getting at before would be more for situations were
you'd have separately indexable register spaces like the x87/mmx 80
bit/64 bit registers (those overlap) vs. the 128 bit/256 bit XMM/YMM
registers (those also overlap) vs. the integer GPRs vs. the pseudo
integer control registers vs. the MSRs vs. the artificially numbered
non-MSR control state, vs. the segmentation related registers vs. the
control registers (CRn) vs. the debug registers (DRn) vs. the
performance counter registers. It would be great not to have these all
artificially squished together into only three groups, but more
importantly not squished into only one and then possibly ambiguosly
reseparated. One is a little ugly, the other has frequently been the
source of bugs.
I'm imagining a utopia where you'd specify the control registers both
YMM and MMX register files were floating point, the GPRs and the
pseudo integer control registers were integer, the various other
control register files were integer but couldn't be written
non-speculatively or be renamed, that certain register groups had side
effects when written but not when read, blah blah blah. They could
each have their own register disambiguation function so you wouldn't
have to do so much work to figure out that the condition codes don't
do anything interesting but the GPRs might. Basically I'm hoping for a
richer and more flexible system for describing the register
architecture of an ISA than putting everyhing into one of three
predefined buckets. It's not bad the way it is (except the squish
everything down to one index space thing), especially since we got rid
of the somewhat sketchy situation with the ISA defined integer and
floating point register files, but I think it could be improved so it
fits the ISAs a little more naturally.
It would also be nice but probably too hard to be able to store
non-integral or floating point values in the register files. One
common optimization for x86 simulators, I'm told, is to put off
calculating flags until the last minute. For us to do that we'd need
to keep around all the information needed to actually compute the
fault. I've always imagined just keeping the StaticInst pointer around
in a "register" and calling a computeFlags function on it when needed.
There are problems with this like checkpoints and general
complication, and I'm not 100% convinced it would actually make enough
of a difference (or maybe any, after the overhead) to be worthwhile.
The same thing could be done for other structures like the
interrupt controller, and maybe the decoder and/or predecoder. Speaking
of the decoder, it would be nice to make that a little stateful as well.
As it is in, say, ARM, the decoder has to rediscover what mode it's in
over and over. I'm guessing it would be better to explicitly switch it's
state (or it entirely) when changing modes instead, although that might
add a fair amount of complexity. Perhaps the decoder should be an object
instead of a bare function? I'm less sure how that would work. It could,
hypothetically, allow us to return the two PC bits commandeered to
signal the mode.
The predecoder is already stateful, right? I'm not so convinced about
the decoder; you still need a way to externalize the state that
influences the decode process to allow the decode cache to work. But
it seems like you could easily build a stateful decoder if you wanted
by calling the stateless decode function via an object that contains
the additional state.
Yes, the predecoder is sort of stateful. It's stateful in that it
keeps state, but no attempt was made (yet, maybe) to make it work with
mispredicts, for instance. The idea was that it could speculatively
update state to keep the instructions flowing without having to stall
every time the decoding context changed, which in ARM is potentially
very often. This is a hard and related problem, but is a little
different.
Maybe the decode cache should be instantiated statically by the decode
object? If you're, say, decoding 64 bit instructions, there's no
reason to have a bunch of 32 bit instructions in the cache getting in
the way. The decode object could instantiate a cache for each decoding
"mode", leave out the contextualizing state, and just start with the
right batch of instructions. To get sharing in multi CPU or multi core
simulations they'd be static so all decode objects would have access
to the same cache per mode. For something as heavy weight as x86's
various mode changes there could be a stall to update the decoder
mode, but then again in ARM where every add instruction might switch
to thumb mode (correct me if that's a mischaracterization, Ali) you
can't introduce all those stalls and get realistic performance.
Steve
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev