Thanks Nilay and Joel for the information.
I've been playing around with this over the past few days and I can't
work out what the point of the flush is. The CacheRecorder already has
a copy of all the data blocks in the trace before the flush starts.
Removing the flush event and subsequent simulation produces exactly the
same system.ruby.cache.gz file as with it in, so I guess it's safe to
remove them....
So, with that out of the way, I can create checkpoints and exit the
simulator correctly. I'm not 100% sure about restoring the checkpoint
though, and it seems a little hacky. Is there a reason it has to
unserialise by inserting memory requests into the event queue - couldn't
it just write the data into the correct locations in the caches?
There's also a question about whether ruby should be recording its state
anyway. Shouldn't it be doing the same as the classic memory system
caches and implementing memWriteback() to flush all dirty data out
before checkpointing happens, then it doesn't need to trace anything?
(Maybe I'm opening a can of worms, but I thought I'd just ask!)
Cheers
Tim
On 13/06/2015 18:03, Joel Hestness wrote:
Hey guys,
I'm pretty sure Tim is correct that the checkpointing bugs were
introduced earlier than the changeset Nilay points to; gem5-gpu is
currently using gem5 rev 10645
<http://repo.gem5.org/gem5/rev/cd95d4d51659>, and we cannot get reliable
checkpoint and restore with it. Note that Tim's bug may not be the only
checkpointing bug that exists right now.
To answer Tim's question: While taking a checkpoint, Ruby commandeers
the event queue to inject flushing memory accesses into the caches. This
is used to generate a trace of cache contents, which can be used to warm
up the caches on checkpoint restore. To take over control of the event
queue, Ruby clears the event at the queue head (I think this assumes
there is only 1 event in the queue? This should probably be checked),
and then adds it's own event for the cache flushing operation. After the
caches have been flushed (simulate() call in RubySystem::serialize()),
Ruby restores the head event that was in the queue and rolls back the
current tick.
One way to check if this cooldown operation is at fault for
unreliable checkpointing is to simply comment out the event queue
commandeering, and try to take a checkpoint. You may also be able to
test checkpoint restore by commenting the cache warm-up code in
RubySystem::unserialize(). If checkpoint and restore work without the
event queue commandeering, it is likely that the event queue
manipulation is problematic.
I'd also recommend trying to take a checkpoint and restore with
simulation specifying the gem5 flag --debug-flag=RubyCacheTrace, which
will show what the cache flushing and warm-up are doing, respectively.
Joel
On Sat, Jun 13, 2015 at 9:48 AM, Nilay Vaish <[email protected]
<mailto:[email protected]>> wrote:
Your bisection is not right. You might want to take a look at the
following changeset:
changeset: 10756:f9c0692f73ec
user: Curtis Dunham <[email protected]
<mailto:[email protected]>>
date: Mon Mar 23 06:57:36 2015 -0400
summary: sim: Reuse the same limit_event in simulate()
I suggest that you revert this changeset in your repo while I think
about what needs to be done.
--
Nilay
On Sat, 13 Jun 2015, Timothy M Jones wrote:
Hi again,
Further to this message, I've used hg bisect to find the
revision that breaks checkpointing with ruby. It's revision
10524 that Nilay committed in November that's the first bad
changeset. It fails with the panic() on the missing event that
I wrote about previously.
I've scanned through the diff and can't immediately see any
reason why this would break serialisation, although it does
remove some of the code to serialise ruby state.
Could anyone (Nilay?) give me a hint as to why this might break
checkpointing with ruby?
I've compiled with the MOESI_hammer protocol for x86, then run
with this command line:
./build/X86/gem5.opt --remote-gdb-port=0 -d <outdir>
configs/example/fs.py -n 1 --kernel <my-kernel> --script
configs/boot/hack_back_ckpt.rcS --max-checkpoints 1
--checkpoint-dir <cptdir> --disk-image <my-disk-image>
--cpu-type timing --restore-with timing --ruby
Any help would be appreciated. I don't know ruby at all, so
trying to work out what's going on is slow....
Cheers
Tim
On 11/06/2015 20:48, Timothy M Jones wrote:
Hello,
Could someone tell me why we need to take the head event
off the event
queue in RubySystem::serialize() in
src/mem/ruby/system/System.cc?
Event* eventq_head = eventq->replaceHead(NULL);
The problem I'm getting is that when simulate() is called
a few lines
later, it tries to reschedule the simulate_limit_event,
but that causes
a panic because it's no longer on the event queue. This
is happening
when trying to take a checkpoint with ruby. I can't work
out from the
comments why the head event needs to be taken off in the
first place.
This is basically the reason behind the problems in this
thread:
https://www.mail-archive.com/[email protected]/msg11701.html
Thanks
Tim
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
_______________________________________________
gem5-dev mailing list
[email protected] <mailto:[email protected]>
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
[email protected] <mailto:[email protected]>
http://m5sim.org/mailman/listinfo/gem5-dev
--
Joel Hestness
PhD Candidate, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev