Hey guys,
  I'm pretty sure Tim is correct that the checkpointing bugs were
introduced earlier than the changeset Nilay points to; gem5-gpu is
currently using gem5 rev 10645 <http://repo.gem5.org/gem5/rev/cd95d4d51659>,
and we cannot get reliable checkpoint and restore with it. Note that Tim's
bug may not be the only checkpointing bug that exists right now.

  To answer Tim's question: While taking a checkpoint, Ruby commandeers the
event queue to inject flushing memory accesses into the caches. This is
used to generate a trace of cache contents, which can be used to warm up
the caches on checkpoint restore. To take over control of the event queue,
Ruby clears the event at the queue head (I think this assumes there is only
1 event in the queue? This should probably be checked), and then adds it's
own event for the cache flushing operation. After the caches have been
flushed (simulate() call in RubySystem::serialize()), Ruby restores the
head event that was in the queue and rolls back the current tick.

  One way to check if this cooldown operation is at fault for unreliable
checkpointing is to simply comment out the event queue commandeering, and
try to take a checkpoint. You may also be able to test checkpoint restore
by commenting the cache warm-up code in RubySystem::unserialize(). If
checkpoint and restore work without the event queue commandeering, it is
likely that the event queue manipulation is problematic.

  I'd also recommend trying to take a checkpoint and restore with
simulation specifying the gem5 flag --debug-flag=RubyCacheTrace, which will
show what the cache flushing and warm-up are doing, respectively.

  Joel



On Sat, Jun 13, 2015 at 9:48 AM, Nilay Vaish <[email protected]> wrote:

> Your bisection is not right.  You might want to take a look at the
> following changeset:
>
>
> changeset:   10756:f9c0692f73ec
> user:        Curtis Dunham <[email protected]>
> date:        Mon Mar 23 06:57:36 2015 -0400
> summary:     sim: Reuse the same limit_event in simulate()
>
>
> I suggest that you revert this changeset in your repo while I think about
> what needs to be done.
>
> --
> Nilay
>
>
>
> On Sat, 13 Jun 2015, Timothy M Jones wrote:
>
>  Hi again,
>>
>> Further to this message, I've used hg bisect to find the revision that
>> breaks checkpointing with ruby.  It's revision 10524 that Nilay committed
>> in November that's the first bad changeset.  It fails with the panic() on
>> the missing event that I wrote about previously.
>>
>> I've scanned through the diff and can't immediately see any reason why
>> this would break serialisation, although it does remove some of the code to
>> serialise ruby state.
>>
>> Could anyone (Nilay?) give me a hint as to why this might break
>> checkpointing with ruby?
>>
>> I've compiled with the MOESI_hammer protocol for x86, then run with this
>> command line:
>>
>> ./build/X86/gem5.opt --remote-gdb-port=0 -d <outdir>
>> configs/example/fs.py -n 1 --kernel <my-kernel> --script
>> configs/boot/hack_back_ckpt.rcS --max-checkpoints 1 --checkpoint-dir
>> <cptdir> --disk-image <my-disk-image> --cpu-type timing --restore-with
>> timing --ruby
>>
>> Any help would be appreciated.  I don't know ruby at all, so trying to
>> work out what's going on is slow....
>>
>> Cheers
>> Tim
>>
>> On 11/06/2015 20:48, Timothy M Jones wrote:
>>
>>>  Hello,
>>>
>>>  Could someone tell me why we need to take the head event off the event
>>>  queue in RubySystem::serialize() in src/mem/ruby/system/System.cc?
>>>
>>>  Event* eventq_head = eventq->replaceHead(NULL);
>>>
>>>  The problem I'm getting is that when simulate() is called a few lines
>>>  later, it tries to reschedule the simulate_limit_event, but that causes
>>>  a panic because it's no longer on the event queue.  This is happening
>>>  when trying to take a checkpoint with ruby.  I can't work out from the
>>>  comments why the head event needs to be taken off in the first place.
>>>
>>>  This is basically the reason behind the problems in this thread:
>>>
>>>  https://www.mail-archive.com/[email protected]/msg11701.html
>>>
>>>  Thanks
>>>  Tim
>>>
>>>
>> --
>> Timothy M. Jones
>> http://www.cl.cam.ac.uk/~tmj32/
>> _______________________________________________
>> gem5-dev mailing list
>> [email protected]
>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
>>
>>  _______________________________________________
> gem5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/gem5-dev
>



-- 
  Joel Hestness
  PhD Candidate, Computer Architecture
  Dept. of Computer Science, University of Wisconsin - Madison
  http://pages.cs.wisc.edu/~hestness/
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to