On Fri, Feb 26, 2010 at 11:15 AM, Stijn Souffriau <[email protected]> wrote: >> Our assumption when we've talked about this has been that we'd just do >> a conservative simulation with relatively frequent synchronization >> (based on the limited lookahead you can get on a bus), and hypothesize >> that on a CMP the sync overhead wouldn't be that bad. I think it's >> worth doing that as a baseline and then seeing to what extent that >> hypothesis is true before diving into a more complex technique. >> > > Every article aiming for good speedups and near linear scalability I've read > disproves this hypothesize (if you know others, feel free to tell me which).
Are there specific articles you can point me to? I haven't really kept on top of the field lately, so if this hypothesis has already been disproven I'd really like to learn more of the details. I agree there's no concrete evidence that it's true either, I just think it's premature to dismiss it without evidence either way. It's also a function of how much detail you have; for example, the relative overhead of syncing every N cycles is much lower if you have a very detailed out-of-order core model than if you're doing a mostly functional core simulation. I'd also point out that "good speedups" and "near linear scalability" aren't necessarily the same thing; we routinely need many gigabytes for a simulation, which means we'll be taking over all the DRAM on a 4-8 core machine, and it's just sad that 3-7 of those cores are sitting idle. Even if I only get 2-3X out of 8 cores it could still be worthwhile. See http://www.computer.org/portal/web/csdl/doi/10.1109/2.348002 for a more thorough (though dated) discussion. > Basically the problem is the short lookahead of cache coherency and the > inherent inpredictability of a cacheline changing to shared state. Depending > on the simulator, threads need a couple of hundred cycles of space to run > freely to achieve good performance. That's why I'm not aiming for conservative > simulation (nor optimistic since I've yet to find a CMP article that says this > would run smoothly). Instead I'm going to try allowing tiny errors. Don't get me wrong, I think this is a great idea... although 2-3X on 8 cores would make me happy, if I can get 6X with only a slight loss of accuracy I'll be even happier. > Since timing is obviously more complex and my time is limited I'm going to > stick to the locked atomic parallel simulation for now. OK, sounds reasonable. > Maybe if things go > well I'll take a stab at the timing adventure which as you, described it, > shouldn't be that hard, but I don't have that much time either. Atomic has the > obvious advantage that you don't have to worry about the timing of events. More specifically, "timing of events outside the CPUs", correct? >> I think a basic approach that simply uses synchronization to prevent >> fast threads from getting too far ahead is the best place to start >> (where how you define "too far ahead" is basically your >> performance/accuracy tradeoff). > > That's not an option for me. I don't follow... why not? You can still define quanta that are larger than the lookahead to increase performance at the cost of some error, but I think you're going to need some synchronization to prevent the errors from becoming unbounded. Even if you want to do something more sophisticated in the long run, I'd think this kind of approach would be good for a baseline to see if your complex algorithms are really buying you anything. Steve _______________________________________________ m5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/m5-dev
