On Fri, Feb 26, 2010 at 12:44 PM, Stijn Souffriau
<[email protected]> wrote:
>
> To be fair, when I said disproves your hypothesis, I should have said that
> the assumption that synchronizing every couple of cycles would work fast is
> wrong and that high accuracy can still be obtained with much less
> synchronization.
>
> One of the questions I'm hoping to answer is what the speedup/accuracy
> difference is between a sluggish detailed simulator and a very fast not so
> detailed simulator.
>
> See:
>
> Slacksim
> Graphite

Thanks for the links (they did come through for me the first time).  I
think the Slacksim paper is most relevant (since they're not trying to
parallelize across machines), and Figure 11 is pretty much in line
with what I expected: the Q10 scheme (which is roughly what I had in
mind when I was talking about "conservative simulation") does give
good speedups (for my definition of "good" in my earlier message).
It's also true that relaxing the synchronization gives even better
speedups... not a surprise, and not something I was disputing.  Note
that the difference is smallest when you have a higher ratio of
simulated cores to host cores; they see a bigger gap at 8 host cores
in part because they are only simulating an 8-core target.  I'd be
very interested in what those graphs look like for a 64-core target;
the benefit of relaxing synchronization for 8 host cores will be
smaller.

Another consideration that these papers seem to completely overlook is
that conservative simulations can be made deterministic (not always
easy, but doable), while these "under-synchronized" simulations are
inherently non-deterministic.  That has implications both on the
stability of results and on debuggability.

On a more practical note, I don't think the Slacksim design is that
far off from (or at least not incompatible with) what we were
discussing for timing mode.  The difference is more one of software
engineering; what we would want to do is take some of that mechanism
and bury it inside the bus model, and also decouple threads from
objects.  (I'm really surprised that they have one Pthread per target
core (see p. 4), that's a lot of needless overhead when you're doing 8
target cores on 2 host cores.)  I'll try and sketch out how I'd do a
Slacksim-like model in M5... it'll take a bit, so I'll send that in a
separate message later.

Steve
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev

Reply via email to