Dear Josh First of all, thank you for your continuous attention on this issue.
About the problem, even though I followed what you had suggested like the below, the checkpoint did not work. So append this value to your $HOME/.openmpi/mca-params.conf file #----------------- mca_base_param_file_prefix=ft-enable-cr #----------------- Sincerely Thomas On Mon, Jan 11, 2010 at 2:21 PM, Josh Hursey <jjhur...@open-mpi.org> wrote: > (Sorry for the delay in replying. I am still sorting through a backlog of > holiday email buildup). > > > On Dec 10, 2009, at 7:32 PM, Chang IL Yoon wrote: > > Dear Josh. >> >> Thank your for keeping attention on this problem. >> >> >> On Wed, Dec 9, 2009 at 8:40 AM, Josh Hursey <jjhur...@open-mpi.org> >> wrote: >> >> On Dec 3, 2009, at 2:01 PM, Chang IL Yoon wrote: >> >> Dear Josh and Paul. >> >> First of all, thank you very much for your interesting on my problem. >> >> 1) I tested it again with MPIRUN_CMD as 'mpirun -am ft-enable-cr -np %N >> %P' >> But the checkpoint did not work. >> >> Is it giving the same error? >> >> Can you send me information on how you configured Open MPI on your system? >> >> Yes, it gives the same error. >> >> When was installing the open-mpi-1.3.3, I used the following >> configuration. >> >> ./configure --enable-ft-thread --with-ft=cr --enable-mpi- threads >> --with-blcr={BLCR_DIR} --with-blcr-libdir={BLCR_LIBDIR} -- >> prefix={OPENMPI_DIR} >> >> What kind of configuration information do you need? >> > > This looks fine to me. > > > >> 2) Here are the more information on my MPI configuration. >> - What version of Open MPI are you using? >> >> I am using Open-MPI ver 1.3.3 with BLCR ver 0.8.2 >> >> - How did you configure Open MPI? >> >> ./configure --enable-ft-thread --with-ft=cr --enable-mpi-threads >> --with-blcr={BLCR_DIR} --with-blcr-libdir={BLCR_LIBDIR} >> --prefix={OPENMPI_DIR} >> >> - What arguments are being passed to 'mpirun' when running with GASNet? >> >> mpirun -am ft-enable-cr --machinefile ./machinefile -np 1 ./personal >> >> The '-np 1' argument is a bit puzzling to me, don't you want this to be >1 >> normally. GASNet does not use any MPI dynamic process management interfaces >> (e.g., MPI_Comm_spawn), does it? >> >> Sorry, actually I do not know if GASNet uses a MPI dynamic process >> management or not. >> >> > It probably does not (not many applications do), but it could be a problem > if they do. > > > >> >> personal is the same probram, my-app.c except for using gasnet_init >> and gasnet_exit() instead of MPI_Init() and MPI_Finalize(). >> >> my-app.c is in http://osl.iu.edu/research/ft/ompi-cr/examples.php. >> >> gasnet_init() and gasnet_exit() use MPI_Init() and MPI_Finalize(). >> >> So you are using the program from the SELF checkpoint example? If Open MPI >> detects that the application has the appropriate function callbacks to use >> the SELF CRS (which this example does) then it will -not- use the BLCR >> component, but instead select the SELF component. >> >> Try using a simple counting program instead of that particular example. >> You could also just remove the opal_crs_self_user_* and my_personal_* >> functions form the example program to reduce it to one. >> >> I'm not sure why the checkpoint would not work even with the SELF CRS. >> I'll have to check on that. >> >> Even though I used a simple counting program, the check point did not >> work. >> > > Humm... Everything seems to be setup correctly, and the application is > still behaving like it is not getting the '-am ft-enable-cr' parameter. The > only other thing I can think of to try is to set this value in the > $HOME/.openmpi/mca-params.conf file. It looks a bit different but if you add > the following line it should work (as long as $HOME is mounted on all of the > machines). > > So append this value to your $HOME/.openmpi/mca-params.conf file and see if > that helps. > #----------------- > mca_base_param_file_prefix=ft-enable-cr > #----------------- > > If that doesn't work, I'll have to think a bit more about what might be > going wrong here. > > -- Josh > > > >> - Do you have any environment variables/MCA parameters set for Open MPI? >> >> yes >> $HOME/.openmpi/mca-params.conf >> # Local snapshot directory (not used in this scenario) >> crs_base_snapshot_dir=${HOME}/temp >> >> # Remote snapshot directory (globally mounted file system)) >> snapc_base_global_snapshot_dir=${HOME}/checkpoints >> >> - My network interconnects is Infiniband/OpenIB (IP over IB). >> >> These all look fine to me. >> >> >> >> 3) If there are something for me to solve this problem, please let me know >> without any hesitation. >> >> Thank you again for your reading >> >> Sincerely >> >> >> On Tue, Dec 1, 2009 at 1:49 PM, Paul H. Hargrove <phhargr...@lbl.gov> >> wrote: >> Thomas, >> >> I connection with Josh's question about mpirun arguments, I suggest you >> try setting >> MPIRUN_CMD='mpirun -am ft-enable-cr -np %N %P %A' >> in your environment before launching the GASNet application. This will >> instruct GASNet's wrapper around mpirun to include the flag Josh mentioned. >> >> -Paul >> >> >> Josh Hursey wrote: >> Thomas, >> >> I have not tried to use the checkpoint/restart feature with GASNet over >> MPI, so I cannot comment directly on how they interact. However, the >> combination should work as long as the proper arguments (-am ft-enable-cr) >> are passed along to the mpirun command, and Open MPI is configured properly. >> >> The error message that you copied seems to indicate that the local daemon >> on one of the nodes failed to start a checkpoint of the target application. >> Often this is caused by one of two things: >> - Open MPI was not configured with the fault tolerance thread, and the >> application is waiting for a long time in a computation loop (not entering >> the MPI library). >> - The '-am ft-enable-cr' flag was not provided to the mpirun process, so >> the MPI application did not activate the C/R specific code paths and is >> therefore denying the request to checkpoint. >> >> Can you send me a bit more information: >> - What version of Open MPI are you using? >> - How did you configure Open MPI? >> - What arguments are being passed to 'mpirun' when running with GASNet? >> - Do you have any environment variables/MCA parameters set for Open MPI? >> >> -- Josh >> >> On Nov 22, 2009, at 7:13 PM, Thomas CI Yoon wrote: >> >> Dear all. >> >> Thanks to developers of OPEN-MPI for Fault-Tolerance, I can use the >> checkpoint/restart function very well for my MPI applications. >> But its checkpoint does not work for my GASNet applications which use the >> MPI conduit. >> Is here anyone else to help me? >> I wrote some code with GASNet API (Global-Address Space Networking: >> http://gasnet.cs.berkeley.edu/) and used MPI conduit for my gasnet >> application, so my program ran well with open-mpirun. Thus I thought that I >> could also use the transparent checkpoint/restart function supported by BLCR >> in Open-mpi. As opposed to my idea, it does not work and show the following >> error message. >> -------------------------------------------------------------------------- >> Error: The process with PID 13896 is not checkpointable. >> This could be due to one of the following: >> - An application with this PID doesn't currently exist >> - The application with this PID isn't checkpointable >> - The application with this PID isn't an OPAL application. >> We were looking for the named files: >> /tmp/opal_cr_prog_write.13896 >> /tmp/opal_cr_prog_read.13896 >> -------------------------------------------------------------------------- >> 1 more process has sent help message help-opal-checkpoint.txt >> Set MCA parameter "orte_base_help_aggregate" to 0 to see all help >> 0] 13896) Step 53 >> 0] 15100) Step 53 >> 0] 13896) Step 54 >> 0] 15100) Step 54 >> 0] 13896) Step 55 >> >> In my application, the MPI_Initialized() says it is initialized. >> >> Thank you for your reading and have a great day. >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Future Technologies Group Tel: +1-510-495-2352 >> HPC Research Department Fax: +1-510-486-6900 >> Lawrence Berkeley National Laboratory >> >> >> >> >