Hi,

What you have should work. Are you sure that you start the application
after the checkpoint command (you don't block any where?)? E.g. what would
be the output if you add an echo right before starting the MPI app:

/sbin/m5 checkpoint 50000000000000

/sbin/m5 loadsymbol

/sbin/m5 resetstats
*echo "start the app"*
mpiexec -hosts=node1,node2 -np 2 ./cg.S.2


Do you see immediate progress in your application if you remove "/sbin/m5
checkpoint 50000000000000"?

Best,
Mohammad


On Mon, Mar 5, 2018 at 11:59 AM, David Kim <dkim.t...@gmail.com> wrote:

> Hello,
>
> I am trying to checkpoint dist-gem5 in the middle of the execution of the
> application.
> The following is my script file that used to run dist-gem5 (with 2 nodes)
> after boot up Linux.
>
> < for node 1 (node1.rcS)>
> *#!/bin/sh*
>
> *# Set up IP address for node 1*
> */sbin/ifconfig eth0 hw ether 00:90:00:00:00:02*
> */sbin/ifconfig eth0 192.168.0.2 netmask 255.255.255.0 up*
>
> *cd /root/NPB3.3.1/NPB3.3-MPI/bin*
>
> *#  checkpoint after delay (in ns, so the below delay represents 50000
> seconds! I have also tested 0.1s,10s, and 100s delay)*
> */sbin/m5 checkpoint 50000000000000*
>
> */sbin/m5 loadsymbol*
>
> */sbin/m5 resetstats*
> *mpiexec -hosts=node1,node2 -np 2 ./cg.S.2*
> */sbin/m5 exit*
>
> < for node 2  (node2.rcS) >
> *#!/bin/sh*
>
>
> * # Set up IP address for node 2 *
> */sbin/ifconfig eth0 hw ether 00:90:00:00:00:03*
> */sbin/ifconfig eth0 192.168.0.3 netmask 255.255.255.0 up*
>
> And, here is my commandline to run dist-gem5 (I did not use gem5-dist.sh
> for some reason, and the following commandline works well in general)
>
> *For switch node,*
>
> *. /build/ARM/gem5.opt -d ./m5out.switch ./configs/dist/sw.py --is-switch
> --dist-size=2 --dist-server-name=localhost --dist-server-port=2200*
>
> *For computer nodes (here is one for node1),*
> */build/ARM/gem5.opt -d ./m5out.0 ./configs/example/fs.py
> --machine-type=VExpress_EMM64
> --disk-image=aarch64-ubuntu-trusty-headless.img
> --kernel=vmlinux.aarch64.20140821
> --dtb-filename=vexpress.aarch64.20140821.dtb --cpu-type=TimingSimpleCPU
> --num-cpus=1 --caches --l2cache --mem-size=512MB --mem-channels=1
> --mem-ranks=1 --script=./node1.rcS --dist --dist-rank=0 --dist-size=2
> --dist-server-name=localhost --dist-server-port=2200
> --dist-sync-start=1000000000000t*
>
> I have increased checkpoint delay to see if there is any change in my
> checkpoint image, but seems to show same behavior; wait that amount of time
> (not running an application) then do checkpoint (no progress is displayed
> on console until checkpoint. Then, restoring gem5 prints out all the
> application output from the beginning).
>
> To checkpoint in the middle of the running of an application, for example,
> after 1 billion cycles after running an application, should I only use
> m5_roi_begin() and m5_roi_end() call in the application's source code (I
> did not test this yet, but guess it will work?), but cannot just add some
> delay to checkpoint as shown above (and thus not change application source
> code)?
>
> Any comment will be appreciated.
>
> Thanks.
>
> Regards,
> Dong Wan Kim
>
> _______________________________________________
> gem5-users mailing list
> gem5-users@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Reply via email to