On Apr 25, 2011, at 11:21 AM, Kishor Kharbas wrote: > Hello Developers, > > I am using Open MPI-1.5.3 for performing experiments with checkpoint and > restart. > However when the number of nodes is more than 128, restart fails with an > segmentation fault. > > After debugging the code, I found that the cause of this error is that > variables of type int_8 are used at various places > for storing the "id"s of the application to be run on each node. > > One example is in orte_odls_base_default_construct_child_list() in > orte/mca/odls/base/odls_base_default_fns.c. > Here int8_t *app_idx is used as a pointer array of app_ids of the processes > in the job. In my case the app_ids exceed 127 and they are read > as a negative numbers. > > I think there are many other places in the code where int8_t is used to hold > the application id. > > I tried some tricks like changing configure so that int8_t and uint8_t are no > defined, hence int16_t is used instead. > But I think the function unpack still expects int8_t, looking at the error > which is raised -OPAL dss:unpack: got type 7 when expecting type 8 > > Can someone provide a solution to this.
Probably won't happen for awhile - this is something peculiar to the restart mechanism. I'll make a note to look at it, but it would be a low priority. > > Thank you. > Kishor > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel