On Fri, Feb 26, 2010 at 6:06 AM, <[email protected]> wrote: > Quoting Steve Reinhardt <[email protected]>: > >> For timing mode, our vision has always been that the interaction >> between threads would happen inside of certain specially written >> SimObjects and not at the port interface. As it looks like you're >> finding out, parallelization has enough complications that it's not >> clear you can handle it at all (much less efficiently) in a generic >> fashion between arbitrary objects. For the situation you describe >> (parallelizing across CPUs with their private caches, talking to a >> shared cache or memory), you'd want to redo the Bus object to enable >> parallelism. The good news is that very little if anything outside >> the Bus should have to change. The basic idea is that the Bus itself >> would be written to allow the recvTiming methods on its various ports >> to be called concurrently from different threads, and would schedule >> events that call sendTiming on the same event queue that's used for >> the object on the other side of the port so that the event gets >> processed from the same thread. Thus all interactions across ports >> occur on a single thread. I'd expect that the recvTiming method would >> typically just save the sent packet in some internal Bus buffer so >> that some other thread can look at it later. The ack/nack return >> value of a sendTiming to the Bus would then depend solely on state >> that the calling thread could safely examine (e.g., the availability >> of per-thread buffer inside the bus). Of course there are a lot of >> details that remain to be fleshed out here, but I still think this >> basic structure is the best starting place. > > > If I understand correctly, this solution implies that the caches will > block when calling the bus port and that the bus is rewritten so that > it can handle asynchronous returns of true or false when such a > port-calling event is processed. This is suboptimal in the sense that > the cache would still be waiting while it's eventqueue could process > other useful events however in practice it will probably be an > acceptable overhead and possibly the preferred solution since it might > benefit accuracy as well. I think, it would have been great hadn't the > recvTiming return value been a return value but an asynchronously > returned nack package. This way neither would need to block which > could improve performance and since the memory system would likely > reply quickly accuracy wouldn't be a problem. Sadly this isn't the > case which complicates matters.
I'm not sure I'm understanding your concern. In the method I'm proposing, the cache (or more precisely the thread that the cache event is running on) would not block when it sent a request to the bus. The call through the port into the bus via sendTiming()/recvTiming() would happen on the same thread without any (non-trivial) blocking. In the typical case, the recvTiming() call would record the packet and the timestamp of its arrival somewhere, probably in some thread-safe data structure which would require locking (hence possibly some trivial waiting if the data-structure lock is already held), then return to the cache. There would be no heavyweight two-way handshake with another thread though. If there is no storage in the bus object to record the packet's arrival, the bus would return false and have to schedule a retry later. (No different than it does today.) Ideally the storage in the bus would be such that this lack of storage would correspond to situations in which the bus is architecturally blocked as well so we wouldn't be introducing anything artificial. (Again, just as it is today.) > There is an additional problem that comes to mind. The curTick > context. Since my research focuses on relaxed synchronization, calls > will reach the bus out of order and past events might get processed > after future events. [...] Yes, this is the really fun part... of course there is a fair amount of literature on parallel discrete-event simulation that applies, and lots of known concepts like conservative vs. optimistic synchronization, lookahead, null messages, etc. I agree there's still room for new thinking & understanding on accuracy/performance tradeoffs, particularly since the CMP-on-CMP domain is different from what a lot of the prior work has looked at. Our assumption when we've talked about this has been that we'd just do a conservative simulation with relatively frequent synchronization (based on the limited lookahead you can get on a bus), and hypothesize that on a CMP the sync overhead wouldn't be that bad. I think it's worth doing that as a baseline and then seeing to what extent that hypothesis is true before diving into a more complex technique. > A solution could be to forward the scheduling time of past events > scheduled by a slow root queue on some fast concurrent other queue, > track all child-events spawning from its processing (together they > would form a chain of events, actually more like a tree of events), > while saving the original root time, and then shift back the > scheduling time of those events in the chain which get scheduled back > on the original root queue. Similarly, future events originating from > fast threads would get scheduled in the past. Local time shouldn't > matter only the time an event chain spent in a certain timezone. I > hope this makes sense. I think a basic approach that simply uses synchronization to prevent fast threads from getting too far ahead is the best place to start (where how you define "too far ahead" is basically your performance/accuracy tradeoff). >> For atomic mode, things are of course very different. [...] > > Doesn't this require a sort of transaction-like cache operation which > could get rolled back when a deadlock gets detected? An easier way > would be to detect a miss before an atomic call is made to the L1 > cache (by means of functional calls, or some other form of cache > inspection), that way we could lock before we change L1 state? Is > there any easy way to detect this with minor modifications to the > cache implementation? Yes, that's right; all the code that could get retried would have to be idempotent. I was ignoring this before because I believe that as far as the operational cache state goes (tags, etc.) this is probably already true (though I haven't checked). Now that you mention it though, I see that that's definitely not true for statistics. One possible solution is just to have a retry flag that indicates if this is a retry, and put all the stats updates inside an "if (retry)" block. I think the two-pass thing is less desirable because (1) functional accesses don't really understand the coherence protocol and (2) most of your accesses will be L1 hits, and I'd guess that most L1 misses won't encounter contention, so for performance you really want to optimistically assume that it's going to be fine the first time and shift the correctness burden to the less common case of having to retry. >> As far as your more specific questions: >> - Returning 'false' to a sendTiming to handle deadlock won't work in >> general, as there are some messages (like snoop responses, IIRC) that >> aren't allowed to be nacked in that fashion for protocol deadlock >> reasons. > > What about always returning true and handling false separately? I don't follow... you have to have some low-level flow control on the port, and if the packet has been accepted, it's good to know ASAP that the sender is done with it. In a sense you can't return true and then later change your mind to false, you have to always return false (don't assume it's been accepted) if you're always going to return the same thing. Maybe in light of the further discussion above you'll agree that this isn't the crux of the problem. Steve _______________________________________________ m5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/m5-dev
