Dear all. Thanks to developers of OPEN-MPI for Fault-Tolerance, I can use the checkpoint/restart function very well for my MPI applications. But its checkpoint does not work for my GASNet applications which use the MPI conduit. Is here anyone else to help me?
I wrote some code with GASNet API (Global-Address Space Networking: http://gasnet.cs.berkeley.edu/) and used MPI conduit for my gasnet application, so my program ran well with open-mpirun. Thus I thought that I could also use the transparent checkpoint/restart function supported by BLCR in Open-mpi. As opposed to my idea, it does not work and show the following error message. -------------------------------------------------------------------------- Error: The process with PID 13896 is not checkpointable. This could be due to one of the following: - An application with this PID doesn't currently exist - The application with this PID isn't checkpointable - The application with this PID isn't an OPAL application. We were looking for the named files: /tmp/opal_cr_prog_write.13896 /tmp/opal_cr_prog_read.13896 -------------------------------------------------------------------------- 1 more process has sent help message help-opal-checkpoint.txt Set MCA parameter "orte_base_help_aggregate" to 0 to see all help 0] 13896) Step 53 0] 15100) Step 53 0] 13896) Step 54 0] 15100) Step 54 0] 13896) Step 55 In my application, the MPI_Initialized() says it is initialized. Thank you for your reading and have a great day.