On Apr 25, 2011, at 11:21 AM, Kishor Kharbas wrote:

> Hello Developers,
> 
> I am using Open MPI-1.5.3 for performing experiments with checkpoint and 
> restart.
> However when the number of nodes is more than 128, restart fails with an 
> segmentation fault.
> 
> After debugging the code, I found that the cause of this error is that 
> variables of type int_8 are used at various places
> for storing the "id"s of the application to be run on each node.
> 
> One example is in orte_odls_base_default_construct_child_list()  in 
> orte/mca/odls/base/odls_base_default_fns.c.
> Here int8_t *app_idx is used as a pointer array of app_ids of the processes 
> in the job. In my case the app_ids exceed 127 and they are read
> as a negative numbers.
> 
> I think there are many other places in the code where int8_t is used to hold 
> the application id.
> 
> I tried some tricks like changing configure so that int8_t and uint8_t are no 
> defined, hence int16_t is used instead.
> But I think the function unpack still expects int8_t, looking at the error 
> which is raised -OPAL dss:unpack: got type 7 when expecting type 8
> 
> Can someone provide a solution to this.

Probably won't happen for awhile - this is something peculiar to the restart 
mechanism. I'll make a note to look at it, but it would be a low priority.

> 
> Thank you.
> Kishor
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to