Re: [m5-dev] [Parallelizing m5] memory questions

Stijn . Souffriau Fri, 26 Feb 2010 06:06:21 -0800

Quoting Steve Reinhardt <[email protected]>:

> For timing mode, our vision has always been that the interaction
> between threads would happen inside of certain specially written
> SimObjects and not at the port interface.  As it looks like you're
> finding out, parallelization has enough complications that it's not
> clear you can handle it at all (much less efficiently) in a generic
> fashion between arbitrary objects.  For the situation you describe
> (parallelizing across CPUs with their private caches, talking to a
> shared cache or memory), you'd want to redo the Bus object to enable
> parallelism.  The good news is that very little if anything outside
> the Bus should have to change.  The basic idea is that the Bus itself
> would be written to allow the recvTiming methods on its various ports
> to be called concurrently from different threads, and would schedule
> events that call sendTiming on the same event queue that's used for
> the object on the other side of the port so that the event gets
> processed from the same thread.  Thus all interactions across ports
> occur on a single thread.  I'd expect that the recvTiming method would
> typically just save the sent packet in some internal Bus buffer so
> that some other thread can look at it later.  The ack/nack return
> value of a sendTiming to the Bus would then depend solely on state
> that the calling thread could safely examine (e.g., the availability
> of per-thread buffer inside the bus).  Of course there are a lot of
> details that remain to be fleshed out here, but I still think this
> basic structure is the best starting place.

If I understand correctly, this solution implies that the caches will  
block when calling the bus port and that the bus is rewritten so that  
it can handle asynchronous returns of true or false when such a  
port-calling event is processed. This is suboptimal in the sense that  
the cache would still be waiting while it's eventqueue could process  
other useful events however in practice it will probably be an  
acceptable overhead and possibly the preferred solution since it might  
benefit accuracy as well. I think, it would have been great hadn't the  
recvTiming return value been a return value but an asynchronously  
returned nack package. This way neither would need to block which  
could improve performance and since the memory system would likely  
reply quickly accuracy wouldn't be a problem. Sadly this isn't the  
case which complicates matters.

There is an additional problem that comes to mind. The curTick  
context. Since my research focuses on relaxed synchronization, calls  
will reach the bus out of order and past events might get processed  
after future events. (Keep in mind that some threads could temporarily  
advance faster the others, if you let them) In any case, this leads to  
tiny, manageable inaccuracies but furthermore the curTick context in  
which calls from the different L1 caches to the bus are processed  
can't simply be that of the calling thread. Since individual memory  
components might save a curTick value at some point. This could lead  
to inconsistencies. Calls to ports might also lead to the creation of  
chains of events which, given their different time frames, could lead  
to inconsistent interleaving and all sorts of nasty side effects. Even  
if this wasn't a problem and functional correctness would somehow be  
guaranteed (which might very well be the case, I have no idea!) then  
accuracy could suffer since the processing of past events, e.g.  
scheduled by one lagging L1 cache simulating thread, could add  
"unreal" delay to the simulation of the future events, scheduled by  
fast L1 threads (example: a faster cache might have to wait for such  
an unreasonable amount of time for a reply that it would get the reply  
in its the past as well) or future-event bus-locking might add unreal  
delay to past-event locking attempts (example: assuming a buscall had  
lead to the occupation of the bus which is scheduled to get freed by  
an event a few ticks later then it will appear to buscalls by  agging  
L1 caches, far in the past, as if the bus is locked for a large  
amountof time, this cache could then have to keep retrying until the  
freeing event isreached far into the future (the unblocking event  
could also get executed  between attempts but anything is possible). I  
have no idea if this could  actually happen in the current  
implementation but again, in theory, anything is possible.)

The desired behaviour is that it would appear to the shared memory  
objects as if there only was one timeframe. That way parallel,  
simultaneous calls made by a lagging and a fast L1 cache would get  
executed/scheduled as if they were called in the same Tick. This would  
of course still lead to inaccuracies but tolerable ones. The worst we  
could get is a couple of faulty hits/misses and bus-conflicts  
depending on the distance between the times of the different threads  
but no more large delays due to local time differences.

A solution could be to forward the scheduling time of past events  
scheduled by a slow root queue on some fast concurrent other queue,  
track all child-events spawning from its processing (together they  
would form a chain of events, actually more like a tree of events),  
while saving the original root time, and then shift back the  
scheduling time of those events in the chain which get scheduled back  
on the original root queue. Similarly, future events originating from  
fast threads would get scheduled in the past. Local time shouldn't  
matter only the time an event chain spent in a certain timezone. I  
hope this makes sense.

> For atomic mode, things are of course very different.  I think you
> want to keep the model of having a single thread traversing through
> the entire transaction across multiple objects; that's what makes
> atomic mode fast, and if you're going to give up on that then you
> might as well just do timing mode (in my opinion).  Off the top of my
> head, the easiest thing here would be to put a lock on every object,
> and have recvTiming() (or something immediately downstream of that)
> acquire the lock before proceeding.  Some objects might not need
> locking; for example, if we simplify/relax the way the bus does cycle
> accounting in atomic mode (which isn't really used anyway), you could
> probably avoid acquiring a lock just to traverse the bus in the common
> case.
>

I agree about the exclusive access part but I fear we should always  
lock before an event gets processed that will lead to an atomic call  
to the bus.

> Unfortunately a naive implementation of this approach would deadlock
> on snoops, I think, since two CPUs would lock their L1s to do an
> access, both generate a miss, and then want to lock everyone else's L1
> to handle the snoop on the miss.  You might be able to solve this with
> a more sophisticated locking protocol; for example, let's say there's
> a lock on the bus after all, and if you acquire the lock on your L1
> cache but miss and the lock on the bus is already taken then you have
> to relinquish your L1 cache lock, acquire the bus lock, then restart
> your L1 access.  (You could also suspend the L1 access and continue it
> after you acquire the bus lock, but then you'd have to deal with
> interleavings that would get ugly and destroy another advantage of
> atomic mode, so I'd just start over from the top.)  So a general rule
> might be that you have to acquire locks starting at the point closest
> to main memory and work backward from there, and when you find that
> you need another lock that's further out (and that lock is already
> held by someone else) you have to release all your other locks and
> start over.  You'd have to generalize that somewhat for NUMA systems
> that don't have a single centralized main memory.
>

Doesn't this require a sort of transaction-like cache operation which  
could get rolled back when a deadlock gets detected? An easier way  
would be to detect a miss before an atomic call is made to the L1  
cache (by means of functional calls, or some other form of cache  
inspection), that way we could lock before we change L1 state? Is  
there any easy way to detect this with minor modifications to the  
cache implementation?

> As far as your more specific questions:
> - Returning 'false' to a sendTiming to handle deadlock won't work in
> general, as there are some messages (like snoop responses, IIRC) that
> aren't allowed to be nacked in that fashion for protocol deadlock
> reasons.

What about always returning true and handling false separately?

> - There aren't (or at least shouldn't be) any data pointers that are
> shared in an uncontrolled fashion between MemObjects.  It is legal for
> a requester to put a pointer into the data field of a packet that
> points to its own internal storage (so that the responder can deposit
> it directly there), but the requester shouldn't be looking at that
> buffer until it gets a response, and the responder shouldn't be
> holding on to that pointer after it's sent the response.  Other than
> that case, I don't believe there's any way for MemObjects to share
> pointers at all.
>
> Hope that helps... feel free to continue this discusison; we're all
> excited about the possibility of parallelizing the simulator and are
> glad to help.
>
> Steve
>
> On Thu, Feb 25, 2010 at 8:06 AM,  <[email protected]> wrote:
>> Hello again,
>>
>> It's been a while since my last email and I've made some progress in getting
>> parts of m5 to run in parallel but I've reached a critical phase  
>> and have some
>> questions of which I'm sure some of the people reading will be able  
>> to help me
>> with.
>>
>> My main objective has always been to get CPU cores and their  
>> private caches to
>> get simulated in parallel as well as the rest of the shared memory. The
>> easiest way to do this is to place an element in between port  
>> interfaces that
>> handles concurrency by basically forwarding member calls to the port on one
>> side to the peer-port on the other side by means of a remotely processed
>> event. Kind of like a remote procedure call.
>>
>> The problem is that these procedures have a return value, with the exception
>> of functional calls. The only way to generically maintain consistency is to
>> block the call on one thread and wait for the return value from the other
>> thread. However if two calls from opposing ports are executed at  
>> the same time
>> you will get a deadlock. This clearly is a big problem since in general you
>> can't interrupt one of the calls.
>>
>> This makes parallel atomic calls pretty much impossible between  
>> private caches
>> unless I rewrote the port interface and the implementing classes or had some
>> guarantee that the state of the objects remained consistent if for  
>> example one
>> atomic call got executed while another was blocking.
>>
>> Timing calls are interesting. They return a value as well but it  
>> only signals
>> point-to-point acceptance. So in theory, in case of a deadlock I  
>> could simply
>> return false and send a retry a few ticks later after which the call would
>> start over. However the semantics of "false" and it's effect on the  
>> simulation
>> are unclear to me. I would like to know if this could have an effect on the
>> accuracy or even functional correctness? Maybe it would even be possible to
>> return true when a deadlock is detected and handle the retry separately in
>> case the remote end would return false, this would be more efficient.
>>
>> If parallelizing timing calls is also impossible this way then I'm going to
>> have to recode some large, complicated chunks of m5 so I'm hoping it won't
>> come to that.
>>
>> Assuming that the former problem has been solved the question  
>> remains if parts
>> of the memory system can even safely run concurrently since I'm guessing
>> pointers to data are shared in between MemObjects. In theory the cache
>> coherence protocol should prohibit concurrent, incoherent read/writes but I
>> don't know the code that well.
>>
>> thanks in advance,
>>
>> Stijn
>>
>> ----------------------------------------------------------------
>> This message was sent using IMP, the Internet Messaging Program.
>>
>> _______________________________________________
>> m5-dev mailing list
>> [email protected]
>> http://m5sim.org/mailman/listinfo/m5-dev
>>
> _______________________________________________
> m5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/m5-dev
>

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] [Parallelizing m5] memory questions

Reply via email to