Hi Sitjn, Glad to hear you're making progress. It's always the case that once you really get into something you run into issues that you hadn't anticipated, so I'm glad you're out there blazing the trail. I hadn't really given atomic mode any thought as far as parallelization, so I'm glad you brought it up. Now that you have, I think timing and atomic modes are sufficiently different that they'll have to be treated separately.
For timing mode, our vision has always been that the interaction between threads would happen inside of certain specially written SimObjects and not at the port interface. As it looks like you're finding out, parallelization has enough complications that it's not clear you can handle it at all (much less efficiently) in a generic fashion between arbitrary objects. For the situation you describe (parallelizing across CPUs with their private caches, talking to a shared cache or memory), you'd want to redo the Bus object to enable parallelism. The good news is that very little if anything outside the Bus should have to change. The basic idea is that the Bus itself would be written to allow the recvTiming methods on its various ports to be called concurrently from different threads, and would schedule events that call sendTiming on the same event queue that's used for the object on the other side of the port so that the event gets processed from the same thread. Thus all interactions across ports occur on a single thread. I'd expect that the recvTiming method would typically just save the sent packet in some internal Bus buffer so that some other thread can look at it later. The ack/nack return value of a sendTiming to the Bus would then depend solely on state that the calling thread could safely examine (e.g., the availability of per-thread buffer inside the bus). Of course there are a lot of details that remain to be fleshed out here, but I still think this basic structure is the best starting place. Another benefit of this approach is that the bus has a latency that can be used to provide some look-ahead to reduce inter-thread synchronization. For atomic mode, things are of course very different. I think you want to keep the model of having a single thread traversing through the entire transaction across multiple objects; that's what makes atomic mode fast, and if you're going to give up on that then you might as well just do timing mode (in my opinion). Off the top of my head, the easiest thing here would be to put a lock on every object, and have recvTiming() (or something immediately downstream of that) acquire the lock before proceeding. Some objects might not need locking; for example, if we simplify/relax the way the bus does cycle accounting in atomic mode (which isn't really used anyway), you could probably avoid acquiring a lock just to traverse the bus in the common case. Unfortunately a naive implementation of this approach would deadlock on snoops, I think, since two CPUs would lock their L1s to do an access, both generate a miss, and then want to lock everyone else's L1 to handle the snoop on the miss. You might be able to solve this with a more sophisticated locking protocol; for example, let's say there's a lock on the bus after all, and if you acquire the lock on your L1 cache but miss and the lock on the bus is already taken then you have to relinquish your L1 cache lock, acquire the bus lock, then restart your L1 access. (You could also suspend the L1 access and continue it after you acquire the bus lock, but then you'd have to deal with interleavings that would get ugly and destroy another advantage of atomic mode, so I'd just start over from the top.) So a general rule might be that you have to acquire locks starting at the point closest to main memory and work backward from there, and when you find that you need another lock that's further out (and that lock is already held by someone else) you have to release all your other locks and start over. You'd have to generalize that somewhat for NUMA systems that don't have a single centralized main memory. As far as your more specific questions: - Returning 'false' to a sendTiming to handle deadlock won't work in general, as there are some messages (like snoop responses, IIRC) that aren't allowed to be nacked in that fashion for protocol deadlock reasons. - There aren't (or at least shouldn't be) any data pointers that are shared in an uncontrolled fashion between MemObjects. It is legal for a requester to put a pointer into the data field of a packet that points to its own internal storage (so that the responder can deposit it directly there), but the requester shouldn't be looking at that buffer until it gets a response, and the responder shouldn't be holding on to that pointer after it's sent the response. Other than that case, I don't believe there's any way for MemObjects to share pointers at all. Hope that helps... feel free to continue this discusison; we're all excited about the possibility of parallelizing the simulator and are glad to help. Steve On Thu, Feb 25, 2010 at 8:06 AM, <[email protected]> wrote: > Hello again, > > It's been a while since my last email and I've made some progress in getting > parts of m5 to run in parallel but I've reached a critical phase and have some > questions of which I'm sure some of the people reading will be able to help me > with. > > My main objective has always been to get CPU cores and their private caches to > get simulated in parallel as well as the rest of the shared memory. The > easiest way to do this is to place an element in between port interfaces that > handles concurrency by basically forwarding member calls to the port on one > side to the peer-port on the other side by means of a remotely processed > event. Kind of like a remote procedure call. > > The problem is that these procedures have a return value, with the exception > of functional calls. The only way to generically maintain consistency is to > block the call on one thread and wait for the return value from the other > thread. However if two calls from opposing ports are executed at the same time > you will get a deadlock. This clearly is a big problem since in general you > can't interrupt one of the calls. > > This makes parallel atomic calls pretty much impossible between private caches > unless I rewrote the port interface and the implementing classes or had some > guarantee that the state of the objects remained consistent if for example one > atomic call got executed while another was blocking. > > Timing calls are interesting. They return a value as well but it only signals > point-to-point acceptance. So in theory, in case of a deadlock I could simply > return false and send a retry a few ticks later after which the call would > start over. However the semantics of "false" and it's effect on the simulation > are unclear to me. I would like to know if this could have an effect on the > accuracy or even functional correctness? Maybe it would even be possible to > return true when a deadlock is detected and handle the retry separately in > case the remote end would return false, this would be more efficient. > > If parallelizing timing calls is also impossible this way then I'm going to > have to recode some large, complicated chunks of m5 so I'm hoping it won't > come to that. > > Assuming that the former problem has been solved the question remains if parts > of the memory system can even safely run concurrently since I'm guessing > pointers to data are shared in between MemObjects. In theory the cache > coherence protocol should prohibit concurrent, incoherent read/writes but I > don't know the code that well. > > thanks in advance, > > Stijn > > ---------------------------------------------------------------- > This message was sent using IMP, the Internet Messaging Program. > > _______________________________________________ > m5-dev mailing list > [email protected] > http://m5sim.org/mailman/listinfo/m5-dev > _______________________________________________ m5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/m5-dev
