Re: [m5-dev] cleaning up TimingSimpleCPU

Steve Reinhardt Tue, 13 Jul 2010 23:00:23 -0700

On Tue, Jul 13, 2010 at 5:47 PM, Gabriel Michael Black
<[email protected]> wrote:
> Quoting Steve Reinhardt <[email protected]>:
>
>> On Tue, Jul 13, 2010 at 11:20 AM, Gabe Black <[email protected]>
>> wrote:
>>>
>>> I can't say it was -the- reason, but one reason is that the TLBs as is
>>> don't actually send the packets for the CPU, so they can't split
>>> anything into multiple transactions easily. I'm intrigued by the idea of
>>> putting the TLB behind a port or port like interface, maybe even
>>> exporting the TLB outside of the CPU's guts and putting it inline with
>>> external accesses.
>>
>> I see from your subsequent email that you've already thought of some
>> drawbacks to this... I agree it's nice in the common case, but has the
>> problem that it constrains the pipeline design perhaps more than you
>> really want to.  We're probably better off finding a way to embed two
>> physical addresses in a Request.
>
> I'd rather not pollute the Request objects with this stuff. What would
> happen if (and I'm not saying we'd want to) we decide we need to support
> accesses split into three pieces? Then we'd have all these request objects
> with three addresses in them when 99% of the time they only needed one. I
> like the idea of the separation happening before the packets/requests are
> sent out with just enough baggage attached to put it all back together when
> the pieces come back.


I'm pretty comfortable assuming that x86 is the most arcanely complex
ISA we'll implement, so if it only needs two, then I think a max of
two is reasonable.  If someone ever wants to do VAX they're on their
own.

The real issue is, if the TLB response goes back to the CPU before
going to the cache (which I think we are all recognizing is the right
way to do it), how do we return two translations when we only sent one
request.

> The thing I was getting at before would be more for situations were you'd
> have separately indexable register spaces like the x87/mmx 80 bit/64 bit
> registers (those overlap) vs. the 128 bit/256 bit XMM/YMM registers (those
> also overlap) vs. the integer GPRs vs. the pseudo integer control registers
> vs. the MSRs vs. the artificially numbered non-MSR control state, vs. the
> segmentation related registers vs. the control registers (CRn) vs. the debug
> registers (DRn) vs. the performance counter registers. It would be great not
> to have these all artificially squished together into only three groups, but
> more importantly not squished into only one and then possibly ambiguosly
> reseparated. One is a little ugly, the other has frequently been the source
> of bugs.

My thought was that this was nicely complementary in that, once you
split the regs into an arbitrary number of independent spaces, then
you could have some of those spaces be owned by components outside the
CPU core (like the TLB).

> I'm imagining a utopia where you'd specify the control registers both YMM
> and MMX register files were floating point, the GPRs and the pseudo integer
> control registers were integer, the various other control register files
> were integer but couldn't be written non-speculatively or be renamed, that
> certain register groups had side effects when written but not when read,
> blah blah blah. They could each have their own register disambiguation
> function so you wouldn't have to do so much work to figure out that the
> condition codes don't do anything interesting but the GPRs might. Basically
> I'm hoping for a richer and more flexible system for describing the register
> architecture of an ISA than putting everyhing into one of three predefined
> buckets. It's not bad the way it is (except the squish everything down to
> one index space thing), especially since we got rid of the somewhat sketchy
> situation with the ISA defined integer and floating point register files,
> but I think it could be improved so it fits the ISAs a little more
> naturally.

Sounds nice to me!

> It would also be nice but probably too hard to be able to store non-integral
> or floating point values in the register files. One common optimization for
> x86 simulators, I'm told, is to put off calculating flags until the last
> minute. For us to do that we'd need to keep around all the information
> needed to actually compute the fault. I've always imagined just keeping the
> StaticInst pointer around in a "register" and calling a computeFlags
> function on it when needed. There are problems with this like checkpoints
> and general complication, and I'm not 100% convinced it would actually make
> enough of a difference (or maybe any, after the overhead) to be worthwhile.

I don't think this is worth pursuing... lazy flag evaluation makes
sense if you're pressing hard for performance (like Transmeta say) but
we're light years from that being an issue for us, and we really have
no serious intent to close that gap.  (Not that I wouldn't mind being
faster, but frankly from where I'm sitting it would be both redundant
and foolish to try and duplicate something like SimNow.)

> Yes, the predecoder is sort of stateful. It's stateful in that it keeps
> state, but no attempt was made (yet, maybe) to make it work with
> mispredicts, for instance. The idea was that it could speculatively update
> state to keep the instructions flowing without having to stall every time
> the decoding context changed, which in ARM is potentially very often. This
> is a hard and related problem, but is a little different.

Hmm, that brief description makes it sound to me like the statefulness
of the predecoder is the problem, not the solution; if you keep the
state external and pass it in on each call, then that makes it easier
for something like O3 to manage multiple versions of it without
imposing that burden on (say) SimpleCPU.

> Maybe the decode cache should be instantiated statically by the decode
> object? If you're, say, decoding 64 bit instructions, there's no reason to
> have a bunch of 32 bit instructions in the cache getting in the way. The
> decode object could instantiate a cache for each decoding "mode", leave out
> the contextualizing state, and just start with the right batch of
> instructions. To get sharing in multi CPU or multi core simulations they'd
> be static so all decode objects would have access to the same cache per
> mode.

Yea, that's an interesting thought... if we could get rid of
ExtMachInst entirely that might be worth it.

Steve
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] cleaning up TimingSimpleCPU

Reply via email to