Hi Sitjn,

Glad to hear you're making progress.  It's always the case that once
you really get into something you run into issues that you hadn't
anticipated, so I'm glad you're out there blazing the trail.  I hadn't
really given atomic mode any thought as far as parallelization, so I'm
glad you brought it up.  Now that you have, I think timing and atomic
modes are sufficiently different that they'll have to be treated
separately.

For timing mode, our vision has always been that the interaction
between threads would happen inside of certain specially written
SimObjects and not at the port interface.  As it looks like you're
finding out, parallelization has enough complications that it's not
clear you can handle it at all (much less efficiently) in a generic
fashion between arbitrary objects.  For the situation you describe
(parallelizing across CPUs with their private caches, talking to a
shared cache or memory), you'd want to redo the Bus object to enable
parallelism.  The good news is that very little if anything outside
the Bus should have to change.  The basic idea is that the Bus itself
would be written to allow the recvTiming methods on its various ports
to be called concurrently from different threads, and would schedule
events that call sendTiming on the same event queue that's used for
the object on the other side of the port so that the event gets
processed from the same thread.  Thus all interactions across ports
occur on a single thread.  I'd expect that the recvTiming method would
typically just save the sent packet in some internal Bus buffer so
that some other thread can look at it later.  The ack/nack return
value of a sendTiming to the Bus would then depend solely on state
that the calling thread could safely examine (e.g., the availability
of per-thread buffer inside the bus).  Of course there are a lot of
details that remain to be fleshed out here, but I still think this
basic structure is the best starting place.

Another benefit of this approach is that the bus has a latency that
can be used to provide some look-ahead to reduce inter-thread
synchronization.

For atomic mode, things are of course very different.  I think you
want to keep the model of having a single thread traversing through
the entire transaction across multiple objects; that's what makes
atomic mode fast, and if you're going to give up on that then you
might as well just do timing mode (in my opinion).  Off the top of my
head, the easiest thing here would be to put a lock on every object,
and have recvTiming() (or something immediately downstream of that)
acquire the lock before proceeding.  Some objects might not need
locking; for example, if we simplify/relax the way the bus does cycle
accounting in atomic mode (which isn't really used anyway), you could
probably avoid acquiring a lock just to traverse the bus in the common
case.

Unfortunately a naive implementation of this approach would deadlock
on snoops, I think, since two CPUs would lock their L1s to do an
access, both generate a miss, and then want to lock everyone else's L1
to handle the snoop on the miss.  You might be able to solve this with
a more sophisticated locking protocol; for example, let's say there's
a lock on the bus after all, and if you acquire the lock on your L1
cache but miss and the lock on the bus is already taken then you have
to relinquish your L1 cache lock, acquire the bus lock, then restart
your L1 access.  (You could also suspend the L1 access and continue it
after you acquire the bus lock, but then you'd have to deal with
interleavings that would get ugly and destroy another advantage of
atomic mode, so I'd just start over from the top.)  So a general rule
might be that you have to acquire locks starting at the point closest
to main memory and work backward from there, and when you find that
you need another lock that's further out (and that lock is already
held by someone else) you have to release all your other locks and
start over.  You'd have to generalize that somewhat for NUMA systems
that don't have a single centralized main memory.

As far as your more specific questions:
- Returning 'false' to a sendTiming to handle deadlock won't work in
general, as there are some messages (like snoop responses, IIRC) that
aren't allowed to be nacked in that fashion for protocol deadlock
reasons.
- There aren't (or at least shouldn't be) any data pointers that are
shared in an uncontrolled fashion between MemObjects.  It is legal for
a requester to put a pointer into the data field of a packet that
points to its own internal storage (so that the responder can deposit
it directly there), but the requester shouldn't be looking at that
buffer until it gets a response, and the responder shouldn't be
holding on to that pointer after it's sent the response.  Other than
that case, I don't believe there's any way for MemObjects to share
pointers at all.

Hope that helps... feel free to continue this discusison; we're all
excited about the possibility of parallelizing the simulator and are
glad to help.

Steve

On Thu, Feb 25, 2010 at 8:06 AM,  <[email protected]> wrote:
> Hello again,
>
> It's been a while since my last email and I've made some progress in getting
> parts of m5 to run in parallel but I've reached a critical phase and have some
> questions of which I'm sure some of the people reading will be able to help me
> with.
>
> My main objective has always been to get CPU cores and their private caches to
> get simulated in parallel as well as the rest of the shared memory. The
> easiest way to do this is to place an element in between port interfaces that
> handles concurrency by basically forwarding member calls to the port on one
> side to the peer-port on the other side by means of a remotely processed
> event. Kind of like a remote procedure call.
>
> The problem is that these procedures have a return value, with the exception
> of functional calls. The only way to generically maintain consistency is to
> block the call on one thread and wait for the return value from the other
> thread. However if two calls from opposing ports are executed at the same time
> you will get a deadlock. This clearly is a big problem since in general you
> can't interrupt one of the calls.
>
> This makes parallel atomic calls pretty much impossible between private caches
> unless I rewrote the port interface and the implementing classes or had some
> guarantee that the state of the objects remained consistent if for example one
> atomic call got executed while another was blocking.
>
> Timing calls are interesting. They return a value as well but it only signals
> point-to-point acceptance. So in theory, in case of a deadlock I could simply
> return false and send a retry a few ticks later after which the call would
> start over. However the semantics of "false" and it's effect on the simulation
> are unclear to me. I would like to know if this could have an effect on the
> accuracy or even functional correctness? Maybe it would even be possible to
> return true when a deadlock is detected and handle the retry separately in
> case the remote end would return false, this would be more efficient.
>
> If parallelizing timing calls is also impossible this way then I'm going to
> have to recode some large, complicated chunks of m5 so I'm hoping it won't
> come to that.
>
> Assuming that the former problem has been solved the question remains if parts
> of the memory system can even safely run concurrently since I'm guessing
> pointers to data are shared in between MemObjects. In theory the cache
> coherence protocol should prohibit concurrent, incoherent read/writes but I
> don't know the code that well.
>
> thanks in advance,
>
> Stijn
>
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
>
> _______________________________________________
> m5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/m5-dev
>
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev

Reply via email to