Hello Developers,

I am using Open MPI-1.5.3 for performing experiments with checkpoint and
restart.
However when the number of nodes is more than 128, restart fails with an
segmentation fault.

After debugging the code, I found that the cause of this error is that
variables of type int_8 are used at various places
for storing the "id"s of the application to be run on each node.

One example is in orte_odls_base_default_construct_child_list()  in
orte/mca/odls/base/odls_base_default_fns.c.
Here int8_t *app_idx is used as a pointer array of app_ids of the processes
in the job. In my case the app_ids exceed 127 and they are read
as a negative numbers.

I think there are many other places in the code where int8_t is used to hold
the application id.

I tried some tricks like changing configure so that int8_t and uint8_t are
no defined, hence int16_t is used instead.
But I think the function unpack still expects int8_t, looking at the error
which is raised -OPAL dss:unpack: got type 7 when expecting type 8

Can someone provide a solution to this.

Thank you.
Kishor

Reply via email to