Re: [m5-dev] [PATCH] CPU: Defer completing an access until we're no longer running out of initiateAcc

Gabe Black Sun, 17 May 2009 00:59:01 -0700

The tricky part here is picking what main() is and how you get to/from 
it. If main() is just one of the steps in the process, then y() is 
equivalent. If you can get from main() to y() without stalling, then you 
can do that from y() to whatever its subordinate step is, say z(). You 
can get into the same bad situation of going around and around forever 
this way.

One soution is to put a cap on on how deep you go. Say in this example 
that's z(). When you get to z(), you never attempt to do the next thing, 
say a(), you simply return back to main() and let it continue. That 
works really well if z() is the natural end of what you're doing, in 
this case the end of the life of the current instruction. The problem 
here is that you don't know whether you actually came from main() in the 
first place, or if you came from a callback somewhere in the middle. If 
you did, you'd return to the callback, it would end, and there would be 
nothing responsible for the next action of the CPU. If you make the 
callback smarter, then all the callbacks start to be mini main()s and 
the complexity goes way up.

To solve -this- issue, you can simply keep track of whether you've 
gotten where you are from main or from a callback after it. If you get 
to z() and you're not from main(), you run it and record that you are 
now from main. If you are, you return to it. This way, you can end up at 
most twice the deepest call depth of a particular instructions life 
cycle because you can never have main() appear in it twice and you 
always have to start an instruction in main(). This gets us back to 
basically what I had the first time around, except that it's now a flag 
for the whole CPU rather than for a single step.

The next problem is dealing with calls to what would be y() in your 
example from code that is not part of the CPU. That would happen when in 
an instruction that calls read() or write(). In these bits of code we 
can't return whether we're expecting a callback since the value will be 
lost and will have to instead record what's going on in the CPU 
someplace to read back when we get control again.

Now that we're recording whether read() or write() are going to callback 
or not, we could record the fact that they were called in the first 
place and defer the work until after initiateAcc entirely. That leads to 
more global state, though, and adds complexity. I had been thinking this 
was the way to go for a while, but looking back I changed my mind.

So in the end, it looks like there need to be two global flags. One to 
say whether the call stack is rooted most recently in main() or a 
callback, and one to say whether the CPU should perform "the next step" 
or if it should wait for a callback to pick things up again.

Now we run into another complication, namely that there isn't 
necessarily a single "the next step" to go to. After a translation we 
might, for instance, need to actually access memory, or we might need to 
invoke a fault for whatever reason. Our second flag now has to not only 
indicate if the CPU should perform "the next step" but also what that 
next step should be.

At this point, we've basically wound up with an enum of what, if you 
squint, look like states for the CPU. Those states describe what to do 
next, ie. translate request X, send packet Y to memory, etc. This is 
what I was talking about before as far as making the CPU work like a 
state machine, although that may not have been clear.

So is this the way to go, or did I mangle/misinterpret something?

Gabe

Steve Reinhardt wrote:
> I actually looked at the code a bit this time; and I have a hypothesis 
> that the problem arises from two similar but fundamentally different 
> models of "bypassing" potential event-based delays:
>
> main() {
>     x_will_callback = x();
>     if (!x_will_callback) y();
> }
>
> x() {
>     if (...) { sched_callback(&cb); return true; }
>     else { return false; }
> }
>
> cb() { y(); }
>
> as opposed to:
>
> main() { x(); }
>
> x() {
>     if (...) { sched_callback(&cb); }
>     else { y(); /* or cb(); */ }
> }
>
> cb() { y(); }
>
> Both of these have the overall effect of calling x() and then y(), 
> sometimes with a delay and sometimes not.  However in the latter case 
> y() is called from inside the call to x(), which leads to problems 
> when that's not expected... basically this is the root of the 
> initiateAcc/completeAcc problem.  Also if there's a cycle (like there 
> is in our pipeline) where you do x,y,z,x,y,z,x,y,z then as Gabe points 
> out you can run into stack overflow problems too.
>
> My hypothesis is that the old TimingSimpleCPU code worked because it 
> always did the former, and Gabe has introduced two points that do the 
> latter: one in timingTranslate(), and one in fetch().  I think the 
> right solution is that for each of these we should either change it 
> into the first model or eliminate the bypass option altogether and 
> always do a separately scheduled callback.
>
> I think the distinction of having main() call y() directly rather than 
> x_cb() is potentially important, as this gives you points where you 
> can do slightly different things depending on whether you did the 
> event or bypassed it.  It also (to me) provides some logical 
> separation between "what comes next" (the code in y()) and how you got 
> there.
>
> Coming at this from a different angle, while the code is getting 
> increasingly messy (or maybe just inherently complex), I'd say a 
> significant fraction of the complexity is dealing with 
> cache/page-crossing memory operations, which I don't think would be 
> significantly improved by a global restructuring.  (Let me know if 
> anyone thinks otherwise.)  Thus I'm not too keen on doing a 
> significant restructuring since I think the code will still be messy 
> afterward.
>
> On Wed, May 6, 2009 at 11:42 AM, Gabriel Michael Black 
> <[email protected] <mailto:[email protected]>> wrote:
>
>     The example I mentioned would be if
>     you have a microcode loop that doesn't touch memory to, for instance,
>     stall until you get an interrupt or a countdown expires for a small
>     delay. 
>
>
> Although I agree that it's good to avoid this possibility altogether, 
> I'd argue that any microcode loop like you describe is broken.  If for 
> no other reason than power dissipation I don't think you'd ever want 
> to busy-wait in a real system, and certainly even if you did we 
> wouldn't want to write it that way in m5 for performance reasons.
>
> Steve
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> m5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/m5-dev
>   

_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] [PATCH] CPU: Defer completing an access until we're no longer running out of initiateAcc

Reply via email to