Re: [m5-dev] [Parallelizing m5] memory questions

Steve Reinhardt Fri, 26 Feb 2010 07:59:56 -0800

On Fri, Feb 26, 2010 at 6:06 AM,  <[email protected]> wrote:
> Quoting Steve Reinhardt <[email protected]>:
>
>> For timing mode, our vision has always been that the interaction
>> between threads would happen inside of certain specially written
>> SimObjects and not at the port interface.  As it looks like you're
>> finding out, parallelization has enough complications that it's not
>> clear you can handle it at all (much less efficiently) in a generic
>> fashion between arbitrary objects.  For the situation you describe
>> (parallelizing across CPUs with their private caches, talking to a
>> shared cache or memory), you'd want to redo the Bus object to enable
>> parallelism.  The good news is that very little if anything outside
>> the Bus should have to change.  The basic idea is that the Bus itself
>> would be written to allow the recvTiming methods on its various ports
>> to be called concurrently from different threads, and would schedule
>> events that call sendTiming on the same event queue that's used for
>> the object on the other side of the port so that the event gets
>> processed from the same thread.  Thus all interactions across ports
>> occur on a single thread.  I'd expect that the recvTiming method would
>> typically just save the sent packet in some internal Bus buffer so
>> that some other thread can look at it later.  The ack/nack return
>> value of a sendTiming to the Bus would then depend solely on state
>> that the calling thread could safely examine (e.g., the availability
>> of per-thread buffer inside the bus).  Of course there are a lot of
>> details that remain to be fleshed out here, but I still think this
>> basic structure is the best starting place.
>
>
> If I understand correctly, this solution implies that the caches will
> block when calling the bus port and that the bus is rewritten so that
> it can handle asynchronous returns of true or false when such a
> port-calling event is processed. This is suboptimal in the sense that
> the cache would still be waiting while it's eventqueue could process
> other useful events however in practice it will probably be an
> acceptable overhead and possibly the preferred solution since it might
> benefit accuracy as well. I think, it would have been great hadn't the
> recvTiming return value been a return value but an asynchronously
> returned nack package. This way neither would need to block which
> could improve performance and since the memory system would likely
> reply quickly accuracy wouldn't be a problem. Sadly this isn't the
> case which complicates matters.


I'm not sure I'm understanding your concern.  In the method I'm
proposing, the cache (or more precisely the thread that the cache
event is running on) would not block when it sent a request to the
bus.  The call through the port into the bus via
sendTiming()/recvTiming() would happen on the same thread without any
(non-trivial) blocking.  In the typical case, the recvTiming() call
would record the packet and the timestamp of its arrival somewhere,
probably in some thread-safe data structure which would require
locking (hence possibly some trivial waiting if the data-structure
lock is already held), then return to the cache.  There would be no
heavyweight two-way handshake with another thread though.  If there is
no storage in the bus object to record the packet's arrival, the bus
would return false and have to schedule a retry later.  (No different
than it does today.)  Ideally the storage in the bus would be such
that this lack of storage would correspond to situations in which the
bus is architecturally blocked as well so we wouldn't be introducing
anything artificial.  (Again, just as it is today.)

> There is an additional problem that comes to mind. The curTick
> context. Since my research focuses on relaxed synchronization, calls
> will reach the bus out of order and past events might get processed
> after future events. [...]

Yes, this is the really fun part... of course there is a fair amount
of literature on parallel discrete-event simulation that applies, and
lots of known concepts like conservative vs. optimistic
synchronization, lookahead, null messages, etc.  I agree there's still
room for new thinking & understanding on accuracy/performance
tradeoffs, particularly since the CMP-on-CMP domain is different from
what a lot of the prior work has looked at.

Our assumption when we've talked about this has been that we'd just do
a conservative simulation with relatively frequent synchronization
(based on the limited lookahead you can get on a bus), and hypothesize
that on a CMP the sync overhead wouldn't be that bad.  I think it's
worth doing that as a baseline and then seeing to what extent that
hypothesis is true before diving into a more complex technique.

> A solution could be to forward the scheduling time of past events
> scheduled by a slow root queue on some fast concurrent other queue,
> track all child-events spawning from its processing (together they
> would form a chain of events, actually more like a tree of events),
> while saving the original root time, and then shift back the
> scheduling time of those events in the chain which get scheduled back
> on the original root queue. Similarly, future events originating from
> fast threads would get scheduled in the past. Local time shouldn't
> matter only the time an event chain spent in a certain timezone. I
> hope this makes sense.

I think a basic approach that simply uses synchronization to prevent
fast threads from getting too far ahead is the best place to start
(where how you define "too far ahead" is basically your
performance/accuracy tradeoff).

>> For atomic mode, things are of course very different. [...]
>
> Doesn't this require a sort of transaction-like cache operation which
> could get rolled back when a deadlock gets detected? An easier way
> would be to detect a miss before an atomic call is made to the L1
> cache (by means of functional calls, or some other form of cache
> inspection), that way we could lock before we change L1 state? Is
> there any easy way to detect this with minor modifications to the
> cache implementation?

Yes, that's right; all the code that could get retried would have to
be idempotent.  I was ignoring this before because I believe that as
far as the operational cache state goes (tags, etc.) this is probably
already true (though I haven't checked).  Now that you mention it
though, I see that that's definitely not true for statistics.  One
possible solution is just to have a retry flag that indicates if this
is a retry, and put all the stats updates inside an "if (retry)"
block.

I think the two-pass thing is less desirable because (1) functional
accesses don't really understand the coherence protocol and (2) most
of your accesses will be L1 hits, and I'd guess that most L1 misses
won't encounter contention, so for performance you really want to
optimistically assume that it's going to be fine the first time and
shift the correctness burden to the less common case of having to
retry.

>> As far as your more specific questions:
>> - Returning 'false' to a sendTiming to handle deadlock won't work in
>> general, as there are some messages (like snoop responses, IIRC) that
>> aren't allowed to be nacked in that fashion for protocol deadlock
>> reasons.
>
> What about always returning true and handling false separately?

I don't follow... you have to have some low-level flow control on the
port, and if the packet has been accepted, it's good to know ASAP that
the sender is done with it.  In a sense you can't return true and then
later change your mind to false, you have to always return false
(don't assume it's been accepted) if you're always going to return the
same thing.  Maybe in light of the further discussion above you'll
agree that this isn't the crux of the problem.

Steve
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] [Parallelizing m5] memory questions

Reply via email to