OK,
it looks like I'm caught in the trap of implementing functionality that
doesnt buy us enough to get away with significantly breaking the current M5
way of instruction<->cpu interaction and similarly isn't useful enough
(considering realistic usage & other architectural inaccuracies) to encode a
special instruction function (like translate()) to handle this case.
Unfortunately my reasons weren't strong enough to keep that feature, but
that's OK, the discussion cleared up some things for me...

(Steve's spiel about wasting time coding "general" features still haunts my
dreams! ... haha)

In initially designing the model, I was trying to allow objects within the
CPU model to be as independent as possible in order to allow the future
flexibility of people to add new things or edit things. One of the reasons
O3 might seem so daunting for new users is that it has so many
tightly-weaved objects that interact that it's hard to tweak anything with
confidence (but I do admit, modeling an out-of-order simulator in itself is
complicated)...

Anyway, I can conform to what people seem to want and that is just merge the
TLB access with the Memory access, no biggie :)

Doing so will be relatively quick to implement, but I plan on holding off on
it a bit, since I'm so close with these regressions I dont want to waste too
much time just merging code upstream into patches.

If anyone needs a peek at what I have now for the InOrder model, just let me
know and there are about 10 patches or so that you can fix up your tree and
so far get the gzip regression working right. Currently, I am debugging eon
and some of the FP instructions that are causing problems.


On Wed, Apr 22, 2009 at 12:47 AM, Steve Reinhardt <[email protected]> wrote:

>
>
> On Tue, Apr 21, 2009 at 6:26 PM, Korey Sewell <[email protected]> wrote:
>
>>
>> That's not what I mean.  What I'm saying is, simulate the timing of a
>>> TLB stage, but do the functional access with the memory stage.  I.e.
>>> split it for timing purposes, but leave it together for functional
>>> reasons.  I'd be surprised if this does not work since the timing of
>>> TLB accesses at that granularity shouldn't have much of an impact on
>>> the program.  I think Steve agreed with me on this one. (right Steve?)
>>
>> Yea, I think we are misunderstanding each other here. I guess I'm not
>> exactly sure what you are getting at or what point you are arguing for
>> (since the conversation got restarted again I may be lost in translation)
>>
>> For your point about TLB accesses and timing, I thought that we had
>> resolved that the timing of TLBs does have a impact on the program which is
>> why Gabe went through making a "translateTiming" access for the TLB and then
>> also making SE mode use a TLB.
>>
>
> The timing of a TLB miss definitely has an impact.  For all the ISAs we've
> done prior to x86, TLB misses were handled in software; the translate()
> method either was a hit or just signaled a miss to be handled later, so we
> didn't need to do anything special to model their latency.  For x86, TLB
> misses are handled in hardware, and translate() could encapsulate a
> HW-serviced TLB miss and page table walk.  That's why we needed to add
> translateTiming().
>
> The timing of TLB hits doesn't matter as much; it's a small fixed delay
> like integer ALU accesses or L1 cache hits, and that latency is pretty much
> designed into the pipeline, so as long as your pipeline design accounts for
> that latency properly you don't have to model it very explicitly.
>
> Nate's point is that the spot in the pipeline where you account for the
> latency of a TLB hit doesn't need to be exactly the same spot where you
> functionally do the TLB access.  For hits, I don't think there's any loss of
> accuracy at all.  Likewise for misses on ISAs with SW TLB miss handling,
> since the TLB miss handler won't get invoked until the instruction commits
> anyway.  The only case where there might be a slight inaccuracy is for TLB
> misses on x86, which might get kicked off a cycle or two later than they
> should, but relative to the overall cost of a TLB miss this is very much in
> the noise (and certainly swamped by other inaccuracies we don't even know
> about).
>
>
>> Also, I figure that if there are situations where you dont want to use the
>> TLB then it makes sense to not continously access the TLB object.
>>
>
> I'd be willing to bet that there are no longer any interesting platforms
> anywhere that don't use TLBs at all.  Really low-end systems may play some
> tricks with a small number of fixed large pages to eliminate most TLB
> misses, but anything that wants to provide security among multiple processes
> needs virtual memory of some form, and even relatively cheap cell phones
> still let you download java games, so I bet they're using their TLBs.
>
>
>> And then lastly, in situations where you have a a situation of a # of
>> dependent memory accesses waiting, it might be better for them to translate
>> early if there is going to possibly be a time associated with a TLB miss/hit
>> (situation that gets exacerbated with more threads on 1 CPU I would
>> assume)...
>>
>
> This seems like a pretty unlikely design, but even if you did want to model
> this, you could still just model the timing effects of the early TLB access
> and defer the functional TLB access until later, with the same impact on
> timing accuracy I mentioned above: I believe only x86 TLB misses would see
> any inaccuracy at all, and even then it would be minor.
>
>
>> So thats why currently I have instructions being having to request the TLB
>> and the Cache as separate entities and which forced me to add "getMemFlags"
>> and "memAccSize" to the Instruction. That implementation I have now works
>> well but potentially there's a better non-intrusive solution to the
>> instruction object.
>>
>> Or it sounds like people want to just X out that functionality and always
>> force a memory access to be tied to a TLB access on that same cycle.
>>
>
> It's not that we *want* the TLB access to be tied to the memory access,
> just that what you have now is a significant departure from the way the
> StaticInst objects currently interact with the CPU model, and what we're
> arguing is that these new functions and this additional complication seem
> unnecessary.  If it was the case that it seemed truly important to be able
> to separate the functional TLB access from the functional memory access, or
> that there was a clean way to do that consistent with the way things work
> currently, then I don't think we'd be objecting.
>
> Steve
>
>
>
>
> _______________________________________________
> m5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/m5-dev
>
>


-- 
----------
Korey L Sewell
Graduate Student - PhD Candidate
Computer Science & Engineering
University of Michigan
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev

Reply via email to