On Fri, Feb 26, 2010 at 11:15 AM, Stijn Souffriau
<[email protected]> wrote:
>> Our assumption when we've talked about this has been that we'd just do
>> a conservative simulation with relatively frequent synchronization
>> (based on the limited lookahead you can get on a bus), and hypothesize
>> that on a CMP the sync overhead wouldn't be that bad.  I think it's
>> worth doing that as a baseline and then seeing to what extent that
>> hypothesis is true before diving into a more complex technique.
>>
>
> Every article aiming for good speedups and near linear scalability I've read
> disproves this hypothesize (if you know others, feel free to tell me which).

Are there specific articles you can point me to?  I haven't really
kept on top of the field lately, so if this hypothesis has already
been disproven I'd really like to learn more of the details.  I agree
there's no concrete evidence that it's true either, I just think it's
premature to dismiss it without evidence either way.  It's also a
function of how much detail you have; for example, the relative
overhead of syncing every N cycles is much lower if you have a very
detailed out-of-order core model than if you're doing a mostly
functional core simulation.

I'd also point out that "good speedups" and "near linear scalability"
aren't necessarily the same thing; we routinely need many gigabytes
for a simulation, which means we'll be taking over all the DRAM on a
4-8 core machine, and it's just sad that 3-7 of those cores are
sitting idle.  Even if I only get 2-3X out of 8 cores it could still
be worthwhile.  See
http://www.computer.org/portal/web/csdl/doi/10.1109/2.348002 for a
more thorough (though dated) discussion.

> Basically the problem is the short lookahead of cache coherency and the
> inherent inpredictability of a cacheline changing to shared state. Depending
> on the simulator, threads need a couple of hundred cycles of space to run
> freely to achieve good performance. That's why I'm not aiming for conservative
> simulation (nor optimistic since I've yet to find a CMP article that says this
> would run smoothly). Instead I'm going to try allowing tiny errors.

Don't get me wrong, I think this is a great idea... although 2-3X on 8
cores would make me happy, if I can get 6X with only a slight loss of
accuracy I'll be even happier.

> Since timing is obviously more complex and my time is limited I'm going to
> stick to the locked atomic parallel simulation for now.

OK, sounds reasonable.

> Maybe if things go
> well I'll take a stab at the timing adventure which as you, described it,
> shouldn't be that hard, but I don't have that much time either. Atomic has the
> obvious advantage that you don't have to worry about the timing of events.

More specifically, "timing of events outside the CPUs", correct?

>> I think a basic approach that simply uses synchronization to prevent
>> fast threads from getting too far ahead is the best place to start
>> (where how you define "too far ahead" is basically your
>> performance/accuracy tradeoff).
>
> That's not an option for me.

I don't follow... why not?  You can still define quanta that are
larger than the lookahead to increase performance at the cost of some
error, but I think you're going to need some synchronization to
prevent the errors from becoming unbounded.  Even if you want to do
something more sophisticated in the long run, I'd think this kind of
approach would be good for a baseline to see if your complex
algorithms are really buying you anything.

Steve
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev

Reply via email to