Greetings!

I ran into errors when using Open MPI's checkpoint restart functionality.
After debugging the application(ompi-restart) I found
few variables overflow when running MPI application with more than 128
processes. I identified the places that cause an
overflow and changed the definition of the concerned variables.

I have attached a detailed bug-report with the mail describing the error
scenario and changes which I feel should be made.
The patch files corresponding to 2 files which need to be changed are
attached too.

I request the community to review the changes and incorporate them in the
code in an appropriate way.
Please let me know if more  information is need about this.

Thank you.
Kishor Kharbas
Environment:

Open MPI version : openmpi-1.5.3
108 Node cluster. All machines are 2-way SMPs with AMD Opteron 6128 (Magny 
Core) processors with 8 cores per socket (16 cores per node)
CentOS 5 Linux x86_64
Infiniband interconnect


Error scenario:


1. Run a MPI application with checkpoint-restart enabled
       mpirun -n 144 -am ft-enable-cr ~/mpiexample

   Using pbs, I allocated 9 hosts with 16 slots each.
   

2. Checkpoint the mpi task using
        ompi-checkpoint <pid of mpirun>

   This gives a snapshot reference, example - ompi_global_snapshot_31745.ckpt

3. Restart using the above checkpoint as
 
   ompi-restart ompi_global_snapshot_31745.ckpt


Output:

Running ompi-restart gives error,
        
[compute-0-70:01035] *** Process received signal ***
[compute-0-70:01035] Signal: Segmentation fault (11)
[compute-0-70:01035] Signal code: Invalid permissions (2)
[compute-0-70:01035] Failing at address: 0x2b54efecd0d0
[compute-0-70:01035] [ 0] /lib64/libpthread.so.0 [0x327de0eb10]
[compute-0-70:01035] [ 1] 
/usr/mpi/gcc/openmpi-1.5.1/lib64/libopen-rte.so.1(orte_odls_base_default_construct_child_list+0x64b)
 [0x2b54edbc0b4b]
[compute-0-70:01035] [ 2] 
/usr/mpi/gcc/openmpi-1.5.1/lib64/openmpi/mca_odls_default.so [0x2b54ef6a4b6f]
[compute-0-70:01035] [ 3] /usr/mpi/gcc/openmpi-1.5.1/lib64/libopen-rte.so.1 
[0x2b54edbafb44]
[compute-0-70:01035] [ 4] 
/usr/mpi/gcc/openmpi-1.5.1/lib64/libopen-rte.so.1(orte_daemon_cmd_processor+0x495)
 [0x2b54edbb2075]
[compute-0-70:01035] [ 5] /usr/mpi/gcc/openmpi-1.5.1/lib64/libopen-rte.so.1 
[0x2b54edbeb93e]
[compute-0-70:01035] [ 6] 
/usr/mpi/gcc/openmpi-1.5.1/lib64/libopen-rte.so.1(orte_daemon+0xb3f) 
[0x2b54edbadeaf]
[compute-0-70:01035] [ 7] orted [0x400899]
[compute-0-70:01035] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) 
[0x327d61d994]
[compute-0-70:01035] [ 9] orted [0x400789]
[compute-0-70:01035] *** End of error message ***


Changes required in code :

The reason for the error is that variables of type int8_t or uint8_t are used 
to hold unique integer ids assigned to each process which is being 
restarted(total 144 processes in above case). This causes the variables to 
overflow, and when used as array indexes lead to invalid memory.    reference. 
uint8_t variables are wrapped back to 0 when number of processes > 256

Following are the places which need to be changed, all them them are 

1. Member idx of struct orte_app_context_t may be assigned to values exceeding 
range of idx(int8_t). This requires increasing the size of member idx of the 
structure defined in orte/runtime/orte_globals.h
Example can be found in file orte/tools/orterun/orterun.c:parse_local() where 
member idx of struct variable app (declared as orte_app_context_t *app) may 
overflow. 

2. Similar change is required in member app_idx of struct orte_proc_t defined 
in orte/runtime/orte_globals.h

3. Few local variables used in orte/mca/odls/base/odls_base_default_fns.c. They 
hold the values from the member variables mentioned in the above lines.


I have implemented and tested the changes in the code. I have increased the 
size of variables from int8_t/uint8_t to int16_t/uint16_t. Ideally it would be 
desirable to have a special typedef for these kind of variables. But I am not 
sure which is the right place to include it.
I have generated 2 patch files corresponding to the 2 files which need to be 
changed. I request the community to review them and incorporate them in the 
code in an appropriate way.

Attachment: odls_base_default_fns.c.patch
Description: Binary data

Attachment: orte_globals.h.patch
Description: Binary data

Reply via email to