Hmmm....well, the output indicates both daemons crashed, but doesn't really indicate where the crash occurs. If you have a core file, perhaps you can get a line number. Are you perhaps trying to send to someone who died?
One nit: in your vprotocol code, you re-use buffer in the send and recv. That's okay, but you need to OBJ_RELEASE the buffer after the send and before calling recv. On Mar 8, 2011, at 8:45 AM, Hugo Meyer wrote: > Yes, after the release is a break. I'm sending now all my output, maybe that > helps more. But the code is basically the one i sent. The normal execution > reaches to the send/receive between the orted_comm and the receiver. > > Best regards. > > Hugo > > 2011/3/8 Ralph Castain <r...@open-mpi.org> > The comm can most certainly be done - there are other sections of that code > that also send messages. I can't see the end of your new code section, but I > assume you ended it properly with a "break"? Otherwise, you'll execute > whatever lies below it as well. > > > On Mar 8, 2011, at 8:19 AM, Hugo Meyer wrote: > >> Yes, i set the value 31 and it is not duplicated. >> >> >> 2011/3/8 Ralph Castain <r...@open-mpi.org> >> What value did you set for this new command? Did you look at the cmds in >> orte/mca/odls/odls_types.h to ensure you weren't using a duplicate value? >> >> >> On Mar 8, 2011, at 6:15 AM, Hugo Meyer wrote: >> >>> Hello @ll. >>> >>> I've got a problem in a communication between the >>> v_protocol_receiver_component.c and the orted_comm.c. >>> >>> In the mca_vprotocol_receiver_component_init i've added a request that is >>> received correctly by the orte_daemon_process_commands but when i try to >>> reply to the sender i get the next error: >>> >>> [clus1:15593] [ 0] /lib64/libpthread.so.0 [0x2aaaabb03d40] >>> [clus1:15593] [ 1] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 >>> [0x2aaaaad760db] >>> [clus1:15593] [ 2] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 >>> [0x2aaaaad75aa4] >>> [clus1:15593] [ 3] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/openmpi/mca_errmgr_orted.so >>> [0x2aaaae2d2fdd] >>> [clus1:15593] [ 4] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_odls_base_notify_iof_complete+0x1da) >>> [0x2aaaaad42cb0] >>> [clus1:15593] [ 5] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_process_commands+0x1068) >>> [0x2aaaaad19ca6] >>> [clus1:15593] [ 6] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x81b) >>> [0x2aaaaad18a55] >>> [clus1:15593] [ 7] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 >>> [0x2aaaaad9710e] >>> [clus1:15593] [ 8] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 >>> [0x2aaaaad974bb] >>> [clus1:15593] [ 9] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_loop+0x1a) >>> [0x2aaaaad972ad] >>> [clus1:15593] [10] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_dispatch+0xe) >>> [0x2aaaaad97166] >>> [clus1:15593] [11] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon+0x2322) >>> [0x2aaaaad17556] >>> [clus1:15593] [12] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted >>> [0x4008a3] >>> [clus1:15593] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2aaaabd2d8a4] >>> [clus1:15593] [14] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted >>> [0x400799] >>> [clus1:15593] *** End of error message *** >>> >>> The code that i've added at the v_protocol_receiver_component.c is (in bold >>> the recv command that fails): >>> >>> int mca_vprotocol_receiver_request_protector(void) { >>> orte_daemon_cmd_flag_t command; >>> opal_buffer_t *buffer = NULL; >>> int n = 1; >>> >>> command = ORTE_DAEMON_REQUEST_PROTECTOR_CMD; >>> >>> buffer = OBJ_NEW(opal_buffer_t); >>> opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD); >>> >>> orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, buffer, ORTE_RML_TAG_DAEMON, >>> 0); >>> >>> orte_rml.recv_buffer(ORTE_PROC_MY_DAEMON, buffer, >>> ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0); >>> opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.jobid, &n, >>> OPAL_UINT32); >>> opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.vpid, &n, >>> OPAL_UINT32); >>> >>> orte_process_info.protector.jobid = >>> mca_vprotocol_receiver.protector.jobid; >>> orte_process_info.protector.vpid = >>> mca_vprotocol_receiver.protector.vpid; >>> >>> OBJ_RELEASE(buffer); >>> >>> return OMPI_SUCCESS; >>> >>> The code that i've added at the orted_comm.c is (in bold the send command >>> that fails): >>> >>> case ORTE_DAEMON_REQUEST_PROTECTOR_CMD: >>> if (orte_debug_daemons_flag) { >>> opal_output(0, "%s orted_recv: received request protector from >>> local proc %s", >>> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), >>> ORTE_NAME_PRINT(sender)); >>> } >>> /* Define the protector */ >>> protector = (uint32_t)ORTE_PROC_MY_NAME->vpid + 1; >>> if (protector >= (uint32_t)orte_process_info.num_procs) { >>> protector = 0; >>> } >>> >>> /* Pack the protector data */ >>> answer = OBJ_NEW(opal_buffer_t); >>> >>> if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, >>> &ORTE_PROC_MY_NAME->jobid, 1, OPAL_UINT32))) { >>> ORTE_ERROR_LOG(ret); >>> OBJ_RELEASE(answer); >>> goto CLEANUP; >>> } >>> if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, &protector, 1, >>> OPAL_UINT32))) { >>> ORTE_ERROR_LOG(ret); >>> OBJ_RELEASE(answer); >>> goto CLEANUP; >>> } >>> if (orte_debug_daemons_flag) { >>> opal_output(0, "EL PROTECTOR ASIGNADO para %s ES: %d\n", >>> ORTE_NAME_PRINT(sender), protector); >>> } >>> >>> /* Send the protector data */ >>> if (0 > orte_rml.send_buffer(sender, answer, >>> ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0)) { >>> ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE); >>> ret = ORTE_ERR_COMM_FAILURE; >>> OBJ_RELEASE(answer); >>> goto CLEANUP; >>> } >>> OBJ_RELEASE(answer); >>> >>> I assume by testing that the error is in the bolded section, maybe because >>> i'am missing some sentence when i try to communicate, or maybe this >>> communication cannot be done. Any help will be appreciated. >>> >>> Thanks a lot. >>> >>> Hugo Meyer >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > <output1>_______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel