Re: [m5-dev] [Parallelizing m5] memory questions

Stijn Souffriau Fri, 26 Feb 2010 11:15:41 -0800

On Friday 26 February 2010 04:59:52 pm Steve wrote:
> I'm not sure I'm understanding your concern.  In the method I'm
> proposing, the cache (or more precisely the thread that the cache
> event is running on) would not block when it sent a request to the
> bus.  The call through the port into the bus via
> sendTiming()/recvTiming() would happen on the same thread without any
> (non-trivial) blocking.  In the typical case, the recvTiming() call
> would record the packet and the timestamp of its arrival somewhere,
> probably in some thread-safe data structure which would require
> locking (hence possibly some trivial waiting if the data-structure
> lock is already held), then return to the cache.  There would be no
> heavyweight two-way handshake with another thread though.  If there is
> no storage in the bus object to record the packet's arrival, the bus
> would return false and have to schedule a retry later.  (No different
> than it does today.)  Ideally the storage in the bus would be such
> that this lack of storage would correspond to situations in which the
> bus is architecturally blocked as well so we wouldn't be introducing
> anything artificial.  (Again, just as it is today.)


Thanks for clarifying. I feared the true/false calculation were more 
complicated and thus more complicated to handle concurrently.

> 
> > There is an additional problem that comes to mind. The curTick
> > context. Since my research focuses on relaxed synchronization, calls
> > will reach the bus out of order and past events might get processed
> > after future events. [...]
> 
> Yes, this is the really fun part... of course there is a fair amount
> of literature on parallel discrete-event simulation that applies, and
> lots of known concepts like conservative vs. optimistic
> synchronization, lookahead, null messages, etc.  I agree there's still
> room for new thinking & understanding on accuracy/performance
> tradeoffs, particularly since the CMP-on-CMP domain is different from
> what a lot of the prior work has looked at.
> 
> Our assumption when we've talked about this has been that we'd just do
> a conservative simulation with relatively frequent synchronization
> (based on the limited lookahead you can get on a bus), and hypothesize
> that on a CMP the sync overhead wouldn't be that bad.  I think it's
> worth doing that as a baseline and then seeing to what extent that
> hypothesis is true before diving into a more complex technique.
> 

Every article aiming for good speedups and near linear scalability I've read 
disproves this hypothesize (if you know others, feel free to tell me which). 
Basically the problem is the short lookahead of cache coherency and the 
inherent inpredictability of a cacheline changing to shared state. Depending 
on the simulator, threads need a couple of hundred cycles of space to run 
freely to achieve good performance. That's why I'm not aiming for conservative 
simulation (nor optimistic since I've yet to find a CMP article that says this 
would run smoothly). Instead I'm going to try allowing tiny errors.

Since timing is obviously more complex and my time is limited I'm going to 
stick to the locked atomic parallel simulation for now. Maybe if things go 
well I'll take a stab at the timing adventure which as you, described it, 
shouldn't be that hard, but I don't have that much time either. Atomic has the 
obvious advantage that you don't have to worry about the timing of events.

> > A solution could be to forward the scheduling time of past events
> > scheduled by a slow root queue on some fast concurrent other queue,
> > track all child-events spawning from its processing (together they
> > would form a chain of events, actually more like a tree of events),
> > while saving the original root time, and then shift back the
> > scheduling time of those events in the chain which get scheduled back
> > on the original root queue. Similarly, future events originating from
> > fast threads would get scheduled in the past. Local time shouldn't
> > matter only the time an event chain spent in a certain timezone. I
> > hope this makes sense.
> 
> I think a basic approach that simply uses synchronization to prevent
> fast threads from getting too far ahead is the best place to start
> (where how you define "too far ahead" is basically your
> performance/accuracy tradeoff).
> 

That's not an option for me.

> >> For atomic mode, things are of course very different. [...]
> >
> > Doesn't this require a sort of transaction-like cache operation which
> > could get rolled back when a deadlock gets detected? An easier way
> > would be to detect a miss before an atomic call is made to the L1
> > cache (by means of functional calls, or some other form of cache
> > inspection), that way we could lock before we change L1 state? Is
> > there any easy way to detect this with minor modifications to the
> > cache implementation?
> 
> Yes, that's right; all the code that could get retried would have to
> be idempotent.  I was ignoring this before because I believe that as
> far as the operational cache state goes (tags, etc.) this is probably
> already true (though I haven't checked).  Now that you mention it
> though, I see that that's definitely not true for statistics.  One
> possible solution is just to have a retry flag that indicates if this
> is a retry, and put all the stats updates inside an "if (retry)"
> block.

Seems easy enough (though somehow it never is). I'm going to check out the 
cache implementation as well. I would appreciate it if anyone pointed out any 
danger zones.

> 
> I think the two-pass thing is less desirable because (1) functional
> accesses don't really understand the coherence protocol and (2) most
> of your accesses will be L1 hits, and I'd guess that most L1 misses
> won't encounter contention, so for performance you really want to
> optimistically assume that it's going to be fine the first time and
> shift the correctness burden to the less common case of having to
> retry.
> 

Good point.

> >> As far as your more specific questions:
> >> - Returning 'false' to a sendTiming to handle deadlock won't work in
> >> general, as there are some messages (like snoop responses, IIRC) that
> >> aren't allowed to be nacked in that fashion for protocol deadlock
> >> reasons.
> >
> > What about always returning true and handling false separately?
> 
> I don't follow... you have to have some low-level flow control on the
> port, and if the packet has been accepted, it's good to know ASAP that
> the sender is done with it.  In a sense you can't return true and then
> later change your mind to false, you have to always return false
> (don't assume it's been accepted) if you're always going to return the
> same thing.  Maybe in light of the further discussion above you'll
> agree that this isn't the crux of the problem.
> 

I understand your point. I assumed it would be much hared to get over the 
true/false problem and that assuming a true return value and resending the 
original package from a special object if it did, after all, turn out to be 
false would have been a good temporary approximation.

> Steve
> _______________________________________________
> m5-dev mailing list
> m5-dev@m5sim.org
> http://m5sim.org/mailman/listinfo/m5-dev
> 
_______________________________________________
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] [Parallelizing m5] memory questions

Reply via email to