The tricky part here is picking what main() is and how you get to/from
it. If main() is just one of the steps in the process, then y() is
equivalent. If you can get from main() to y() without stalling, then you
can do that from y() to whatever its subordinate step is, say z(). You
can get into the same bad situation of going around and around forever
this way.
One soution is to put a cap on on how deep you go. Say in this example
that's z(). When you get to z(), you never attempt to do the next thing,
say a(), you simply return back to main() and let it continue. That
works really well if z() is the natural end of what you're doing, in
this case the end of the life of the current instruction. The problem
here is that you don't know whether you actually came from main() in the
first place, or if you came from a callback somewhere in the middle. If
you did, you'd return to the callback, it would end, and there would be
nothing responsible for the next action of the CPU. If you make the
callback smarter, then all the callbacks start to be mini main()s and
the complexity goes way up.
To solve -this- issue, you can simply keep track of whether you've
gotten where you are from main or from a callback after it. If you get
to z() and you're not from main(), you run it and record that you are
now from main. If you are, you return to it. This way, you can end up at
most twice the deepest call depth of a particular instructions life
cycle because you can never have main() appear in it twice and you
always have to start an instruction in main(). This gets us back to
basically what I had the first time around, except that it's now a flag
for the whole CPU rather than for a single step.
The next problem is dealing with calls to what would be y() in your
example from code that is not part of the CPU. That would happen when in
an instruction that calls read() or write(). In these bits of code we
can't return whether we're expecting a callback since the value will be
lost and will have to instead record what's going on in the CPU
someplace to read back when we get control again.
Now that we're recording whether read() or write() are going to callback
or not, we could record the fact that they were called in the first
place and defer the work until after initiateAcc entirely. That leads to
more global state, though, and adds complexity. I had been thinking this
was the way to go for a while, but looking back I changed my mind.
So in the end, it looks like there need to be two global flags. One to
say whether the call stack is rooted most recently in main() or a
callback, and one to say whether the CPU should perform "the next step"
or if it should wait for a callback to pick things up again.
Now we run into another complication, namely that there isn't
necessarily a single "the next step" to go to. After a translation we
might, for instance, need to actually access memory, or we might need to
invoke a fault for whatever reason. Our second flag now has to not only
indicate if the CPU should perform "the next step" but also what that
next step should be.
At this point, we've basically wound up with an enum of what, if you
squint, look like states for the CPU. Those states describe what to do
next, ie. translate request X, send packet Y to memory, etc. This is
what I was talking about before as far as making the CPU work like a
state machine, although that may not have been clear.
So is this the way to go, or did I mangle/misinterpret something?
Gabe
Steve Reinhardt wrote:
> I actually looked at the code a bit this time; and I have a hypothesis
> that the problem arises from two similar but fundamentally different
> models of "bypassing" potential event-based delays:
>
> main() {
> x_will_callback = x();
> if (!x_will_callback) y();
> }
>
> x() {
> if (...) { sched_callback(&cb); return true; }
> else { return false; }
> }
>
> cb() { y(); }
>
> as opposed to:
>
> main() { x(); }
>
> x() {
> if (...) { sched_callback(&cb); }
> else { y(); /* or cb(); */ }
> }
>
> cb() { y(); }
>
> Both of these have the overall effect of calling x() and then y(),
> sometimes with a delay and sometimes not. However in the latter case
> y() is called from inside the call to x(), which leads to problems
> when that's not expected... basically this is the root of the
> initiateAcc/completeAcc problem. Also if there's a cycle (like there
> is in our pipeline) where you do x,y,z,x,y,z,x,y,z then as Gabe points
> out you can run into stack overflow problems too.
>
> My hypothesis is that the old TimingSimpleCPU code worked because it
> always did the former, and Gabe has introduced two points that do the
> latter: one in timingTranslate(), and one in fetch(). I think the
> right solution is that for each of these we should either change it
> into the first model or eliminate the bypass option altogether and
> always do a separately scheduled callback.
>
> I think the distinction of having main() call y() directly rather than
> x_cb() is potentially important, as this gives you points where you
> can do slightly different things depending on whether you did the
> event or bypassed it. It also (to me) provides some logical
> separation between "what comes next" (the code in y()) and how you got
> there.
>
> Coming at this from a different angle, while the code is getting
> increasingly messy (or maybe just inherently complex), I'd say a
> significant fraction of the complexity is dealing with
> cache/page-crossing memory operations, which I don't think would be
> significantly improved by a global restructuring. (Let me know if
> anyone thinks otherwise.) Thus I'm not too keen on doing a
> significant restructuring since I think the code will still be messy
> afterward.
>
> On Wed, May 6, 2009 at 11:42 AM, Gabriel Michael Black
> <[email protected] <mailto:[email protected]>> wrote:
>
> The example I mentioned would be if
> you have a microcode loop that doesn't touch memory to, for instance,
> stall until you get an interrupt or a countdown expires for a small
> delay.
>
>
> Although I agree that it's good to avoid this possibility altogether,
> I'd argue that any microcode loop like you describe is broken. If for
> no other reason than power dissipation I don't think you'd ever want
> to busy-wait in a real system, and certainly even if you did we
> wouldn't want to write it that way in m5 for performance reasons.
>
> Steve
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> m5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/m5-dev
>
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev