Thanks Nilay and Joel for the information.

I've been playing around with this over the past few days and I can't work out what the point of the flush is. The CacheRecorder already has a copy of all the data blocks in the trace before the flush starts. Removing the flush event and subsequent simulation produces exactly the same system.ruby.cache.gz file as with it in, so I guess it's safe to remove them....

So, with that out of the way, I can create checkpoints and exit the simulator correctly. I'm not 100% sure about restoring the checkpoint though, and it seems a little hacky. Is there a reason it has to unserialise by inserting memory requests into the event queue - couldn't it just write the data into the correct locations in the caches?

There's also a question about whether ruby should be recording its state anyway. Shouldn't it be doing the same as the classic memory system caches and implementing memWriteback() to flush all dirty data out before checkpointing happens, then it doesn't need to trace anything? (Maybe I'm opening a can of worms, but I thought I'd just ask!)

Cheers
Tim


On 13/06/2015 18:03, Joel Hestness wrote:
Hey guys,
   I'm pretty sure Tim is correct that the checkpointing bugs were
introduced earlier than the changeset Nilay points to; gem5-gpu is
currently using gem5 rev 10645
<http://repo.gem5.org/gem5/rev/cd95d4d51659>, and we cannot get reliable
checkpoint and restore with it. Note that Tim's bug may not be the only
checkpointing bug that exists right now.

   To answer Tim's question: While taking a checkpoint, Ruby commandeers
the event queue to inject flushing memory accesses into the caches. This
is used to generate a trace of cache contents, which can be used to warm
up the caches on checkpoint restore. To take over control of the event
queue, Ruby clears the event at the queue head (I think this assumes
there is only 1 event in the queue? This should probably be checked),
and then adds it's own event for the cache flushing operation. After the
caches have been flushed (simulate() call in RubySystem::serialize()),
Ruby restores the head event that was in the queue and rolls back the
current tick.

   One way to check if this cooldown operation is at fault for
unreliable checkpointing is to simply comment out the event queue
commandeering, and try to take a checkpoint. You may also be able to
test checkpoint restore by commenting the cache warm-up code in
RubySystem::unserialize(). If checkpoint and restore work without the
event queue commandeering, it is likely that the event queue
manipulation is problematic.

   I'd also recommend trying to take a checkpoint and restore with
simulation specifying the gem5 flag --debug-flag=RubyCacheTrace, which
will show what the cache flushing and warm-up are doing, respectively.

   Joel



On Sat, Jun 13, 2015 at 9:48 AM, Nilay Vaish <[email protected]
<mailto:[email protected]>> wrote:

    Your bisection is not right.  You might want to take a look at the
    following changeset:


    changeset:   10756:f9c0692f73ec
    user:        Curtis Dunham <[email protected]
    <mailto:[email protected]>>
    date:        Mon Mar 23 06:57:36 2015 -0400
    summary:     sim: Reuse the same limit_event in simulate()


    I suggest that you revert this changeset in your repo while I think
    about what needs to be done.

    --
    Nilay



    On Sat, 13 Jun 2015, Timothy M Jones wrote:

        Hi again,

        Further to this message, I've used hg bisect to find the
        revision that breaks checkpointing with ruby.  It's revision
        10524 that Nilay committed in November that's the first bad
        changeset.  It fails with the panic() on the missing event that
        I wrote about previously.

        I've scanned through the diff and can't immediately see any
        reason why this would break serialisation, although it does
        remove some of the code to serialise ruby state.

        Could anyone (Nilay?) give me a hint as to why this might break
        checkpointing with ruby?

        I've compiled with the MOESI_hammer protocol for x86, then run
        with this command line:

        ./build/X86/gem5.opt --remote-gdb-port=0 -d <outdir>
        configs/example/fs.py -n 1 --kernel <my-kernel> --script
        configs/boot/hack_back_ckpt.rcS --max-checkpoints 1
        --checkpoint-dir <cptdir> --disk-image <my-disk-image>
        --cpu-type timing --restore-with timing --ruby

        Any help would be appreciated.  I don't know ruby at all, so
        trying to work out what's going on is slow....

        Cheers
        Tim

        On 11/06/2015 20:48, Timothy M Jones wrote:

              Hello,

              Could someone tell me why we need to take the head event
            off the event
              queue in RubySystem::serialize() in
            src/mem/ruby/system/System.cc?

              Event* eventq_head = eventq->replaceHead(NULL);

              The problem I'm getting is that when simulate() is called
            a few lines
              later, it tries to reschedule the simulate_limit_event,
            but that causes
              a panic because it's no longer on the event queue.  This
            is happening
              when trying to take a checkpoint with ruby.  I can't work
            out from the
              comments why the head event needs to be taken off in the
            first place.

              This is basically the reason behind the problems in this
            thread:

            https://www.mail-archive.com/[email protected]/msg11701.html

              Thanks
              Tim


        --
        Timothy M. Jones
        http://www.cl.cam.ac.uk/~tmj32/
        _______________________________________________
        gem5-dev mailing list
        [email protected] <mailto:[email protected]>
        http://m5sim.org/mailman/listinfo/gem5-dev


    _______________________________________________
    gem5-dev mailing list
    [email protected] <mailto:[email protected]>
    http://m5sim.org/mailman/listinfo/gem5-dev




--
   Joel Hestness
   PhD Candidate, Computer Architecture
   Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/

--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to