Hi:
        I have a connection healthy detection problem, here is what I do.   
Rank 0 and Rank 1 setup a QP connection.
Rank 0 is waiting a message from rank 1, during this time, Rank 0 periodically 
sends a heart-beat message back to
Rank 1 to detect if the connection is OK, or if rank 1 has died.

        The heart-beat is a zero-byte RDMA message:

                sr.next = NULL;
                sr.wr_id = (uint64_t)(AULONG)rdmahdr;

                sr.sg_list = &ssg;
                sr.num_sge = 0;
                sr.opcode = IBV_WR_RDMA_WRITE;
                sr.send_flags = IBV_SEND_INLINE|IBV_SEND_SIGNALED;

        If this heart-beat message completes with success, I think, the 
connection is OK, and peer process is alive.

        However in Rank 1, fork() is called, and parent exit(), the child call 
sleep for 5 minutes. But in rank 0,
The hear-beat message is always success untill I kill rank 2's child.

        Further, rank 1 calls fork() and exits, the child calls
execl("/bin/sleep", "sleep", "300", (char *)0);

        In rank 0, the heart-beat is still success untill I kill the 'sleep' 
process.

        It is easy to understand that if only fork() is called, the child will 
hold QP resources from parent, rank 0 can NOT detect
anything wrong. But if child calls exec, everything in rank 1 has been 
destroyed, why can't rank 0 detect the connection is broken ?


        Thanks for any help.


--CQ Tang, HP-MPI
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to