Greetings! I ran into errors when using Open MPI's checkpoint restart functionality. After debugging the application(ompi-restart) I found few variables overflow when running MPI application with more than 128 processes. I identified the places that cause an overflow and changed the definition of the concerned variables.
I have attached a detailed bug-report with the mail describing the error scenario and changes which I feel should be made. The patch files corresponding to 2 files which need to be changed are attached too. I request the community to review the changes and incorporate them in the code in an appropriate way. Please let me know if more information is need about this. Thank you. Kishor Kharbas
Environment: Open MPI version : openmpi-1.5.3 108 Node cluster. All machines are 2-way SMPs with AMD Opteron 6128 (Magny Core) processors with 8 cores per socket (16 cores per node) CentOS 5 Linux x86_64 Infiniband interconnect Error scenario: 1. Run a MPI application with checkpoint-restart enabled mpirun -n 144 -am ft-enable-cr ~/mpiexample Using pbs, I allocated 9 hosts with 16 slots each. 2. Checkpoint the mpi task using ompi-checkpoint <pid of mpirun> This gives a snapshot reference, example - ompi_global_snapshot_31745.ckpt 3. Restart using the above checkpoint as ompi-restart ompi_global_snapshot_31745.ckpt Output: Running ompi-restart gives error, [compute-0-70:01035] *** Process received signal *** [compute-0-70:01035] Signal: Segmentation fault (11) [compute-0-70:01035] Signal code: Invalid permissions (2) [compute-0-70:01035] Failing at address: 0x2b54efecd0d0 [compute-0-70:01035] [ 0] /lib64/libpthread.so.0 [0x327de0eb10] [compute-0-70:01035] [ 1] /usr/mpi/gcc/openmpi-1.5.1/lib64/libopen-rte.so.1(orte_odls_base_default_construct_child_list+0x64b) [0x2b54edbc0b4b] [compute-0-70:01035] [ 2] /usr/mpi/gcc/openmpi-1.5.1/lib64/openmpi/mca_odls_default.so [0x2b54ef6a4b6f] [compute-0-70:01035] [ 3] /usr/mpi/gcc/openmpi-1.5.1/lib64/libopen-rte.so.1 [0x2b54edbafb44] [compute-0-70:01035] [ 4] /usr/mpi/gcc/openmpi-1.5.1/lib64/libopen-rte.so.1(orte_daemon_cmd_processor+0x495) [0x2b54edbb2075] [compute-0-70:01035] [ 5] /usr/mpi/gcc/openmpi-1.5.1/lib64/libopen-rte.so.1 [0x2b54edbeb93e] [compute-0-70:01035] [ 6] /usr/mpi/gcc/openmpi-1.5.1/lib64/libopen-rte.so.1(orte_daemon+0xb3f) [0x2b54edbadeaf] [compute-0-70:01035] [ 7] orted [0x400899] [compute-0-70:01035] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x327d61d994] [compute-0-70:01035] [ 9] orted [0x400789] [compute-0-70:01035] *** End of error message *** Changes required in code : The reason for the error is that variables of type int8_t or uint8_t are used to hold unique integer ids assigned to each process which is being restarted(total 144 processes in above case). This causes the variables to overflow, and when used as array indexes lead to invalid memory. reference. uint8_t variables are wrapped back to 0 when number of processes > 256 Following are the places which need to be changed, all them them are 1. Member idx of struct orte_app_context_t may be assigned to values exceeding range of idx(int8_t). This requires increasing the size of member idx of the structure defined in orte/runtime/orte_globals.h Example can be found in file orte/tools/orterun/orterun.c:parse_local() where member idx of struct variable app (declared as orte_app_context_t *app) may overflow. 2. Similar change is required in member app_idx of struct orte_proc_t defined in orte/runtime/orte_globals.h 3. Few local variables used in orte/mca/odls/base/odls_base_default_fns.c. They hold the values from the member variables mentioned in the above lines. I have implemented and tested the changes in the code. I have increased the size of variables from int8_t/uint8_t to int16_t/uint16_t. Ideally it would be desirable to have a special typedef for these kind of variables. But I am not sure which is the right place to include it. I have generated 2 patch files corresponding to the 2 files which need to be changed. I request the community to review them and incorporate them in the code in an appropriate way.
odls_base_default_fns.c.patch
Description: Binary data
orte_globals.h.patch
Description: Binary data