Hello,

I have looked at the output message again, and it gave the following
message;
info: m5 checkpoint called with non-zero delay => triggering immediate
checkpoint (at the next sync)

So, I look at the source code that print out that message, and the
following is the code snippet,

*@ src/dev/net/dist_iface.cc*

*bool*
*DistIface::readyToCkpt(Tick delay, Tick period)*
*{*
*    bool ret = true;*
*    DPRINTF(DistEthernet, "DistIface::readyToCkpt() called, delay:%lu "*
*            "period:%lu\n", delay, period);*
*    if (master) {*
*        if (delay == 0) {*
*            inform("m5 checkpoint called with zero delay => triggering
collaborative "*
*                   "checkpoint\n");*
*            sync->requestCkpt(ReqType::collective);*
*        } else {*
*            inform("m5 checkpoint called with non-zero delay => triggering
immediate "*
*                   "checkpoint (at the next sync)\n");*
*            sync->requestCkpt(ReqType::immediate);*
*        }*
*        if (period != 0)*
*            inform("Non-zero period for m5_ckpt is ignored in "*
*                   "distributed gem5 runs\n");*
*        ret = false;*
*    }*
*    return ret;*
*}*

*@ src/sim/pseudo_inst.cc*

*void*
*m5checkpoint(ThreadContext *tc, Tick delay, Tick period)*
*{*
*    DPRINTF(PseudoInst, "PseudoInst::m5checkpoint(%i, %i)\n", delay,
period);*
*    if (!tc->getCpuPtr()->params()->do_checkpoint_insts)*
*        return;*

*    if (DistIface::readyToCkpt(delay, period)) {*
*        Tick when = curTick() + delay * SimClock::Int::ns;*
*        Tick repeat = period * SimClock::Int::ns;*
*        exitSimLoop("checkpoint", 0, when, repeat);*
*    }*
*}*

Since the checkpoint delay is non-zero value, it seems to force do
checkpointing at the next sync time rather than delay value.
In this simulation, I added 'dist-sync-start=1000000000000t', so I think
sync will be on every 1s in simulation time, right?

FYI, I have added 'echo' command, but it was not printed out, so I think
simulation did not reach that point.

Can you explain what is exactly happening in the dist-gem5 checkpoint
routine? Any suggestion or idea will be appreciated.

Thanks.

Dong Wan Kim


On Mon, Mar 5, 2018 at 6:01 PM, Mohammad Alian <m.alian1...@gmail.com>
wrote:

> Hi,
>
> What you have should work. Are you sure that you start the application
> after the checkpoint command (you don't block any where?)? E.g. what would
> be the output if you add an echo right before starting the MPI app:
>
> /sbin/m5 checkpoint 50000000000000
>
> /sbin/m5 loadsymbol
>
> /sbin/m5 resetstats
> *echo "start the app"*
> mpiexec -hosts=node1,node2 -np 2 ./cg.S.2
>
>
> Do you see immediate progress in your application if you remove "/sbin/m5
> checkpoint 50000000000000"?
>
> Best,
> Mohammad
>
>
> On Mon, Mar 5, 2018 at 11:59 AM, David Kim <dkim.t...@gmail.com> wrote:
>
>> Hello,
>>
>> I am trying to checkpoint dist-gem5 in the middle of the execution of the
>> application.
>> The following is my script file that used to run dist-gem5 (with 2 nodes)
>> after boot up Linux.
>>
>> < for node 1 (node1.rcS)>
>> *#!/bin/sh*
>>
>> *# Set up IP address for node 1*
>> */sbin/ifconfig eth0 hw ether 00:90:00:00:00:02*
>> */sbin/ifconfig eth0 192.168.0.2 netmask 255.255.255.0 up*
>>
>> *cd /root/NPB3.3.1/NPB3.3-MPI/bin*
>>
>> *#  checkpoint after delay (in ns, so the below delay represents 50000
>> seconds! I have also tested 0.1s,10s, and 100s delay)*
>> */sbin/m5 checkpoint 50000000000000*
>>
>> */sbin/m5 loadsymbol*
>>
>> */sbin/m5 resetstats*
>> *mpiexec -hosts=node1,node2 -np 2 ./cg.S.2*
>> */sbin/m5 exit*
>>
>> < for node 2  (node2.rcS) >
>> *#!/bin/sh*
>>
>>
>> * # Set up IP address for node 2 *
>> */sbin/ifconfig eth0 hw ether 00:90:00:00:00:03*
>> */sbin/ifconfig eth0 192.168.0.3 netmask 255.255.255.0 up*
>>
>> And, here is my commandline to run dist-gem5 (I did not use gem5-dist.sh
>> for some reason, and the following commandline works well in general)
>>
>> *For switch node,*
>>
>> *. /build/ARM/gem5.opt -d ./m5out.switch ./configs/dist/sw.py --is-switch
>> --dist-size=2 --dist-server-name=localhost --dist-server-port=2200*
>>
>> *For computer nodes (here is one for node1),*
>> */build/ARM/gem5.opt -d ./m5out.0 ./configs/example/fs.py
>> --machine-type=VExpress_EMM64
>> --disk-image=aarch64-ubuntu-trusty-headless.img
>> --kernel=vmlinux.aarch64.20140821
>> --dtb-filename=vexpress.aarch64.20140821.dtb --cpu-type=TimingSimpleCPU
>> --num-cpus=1 --caches --l2cache --mem-size=512MB --mem-channels=1
>> --mem-ranks=1 --script=./node1.rcS --dist --dist-rank=0 --dist-size=2
>> --dist-server-name=localhost --dist-server-port=2200
>> --dist-sync-start=1000000000000t*
>>
>> I have increased checkpoint delay to see if there is any change in my
>> checkpoint image, but seems to show same behavior; wait that amount of time
>> (not running an application) then do checkpoint (no progress is displayed
>> on console until checkpoint. Then, restoring gem5 prints out all the
>> application output from the beginning).
>>
>> To checkpoint in the middle of the running of an application, for
>> example, after 1 billion cycles after running an application, should I only
>> use m5_roi_begin() and m5_roi_end() call in the application's source code
>> (I did not test this yet, but guess it will work?), but cannot just add
>> some delay to checkpoint as shown above (and thus not change application
>> source code)?
>>
>> Any comment will be appreciated.
>>
>> Thanks.
>>
>> Regards,
>> Dong Wan Kim
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-users@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-users@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Reply via email to