Hmmm....well, the output indicates both daemons crashed, but doesn't really 
indicate where the crash occurs. If you have a core file, perhaps you can get a 
line number. Are you perhaps trying to send to someone who died?

One nit: in your vprotocol code, you re-use buffer in the send and recv. That's 
okay, but you need to OBJ_RELEASE the buffer after the send and before calling 
recv.


On Mar 8, 2011, at 8:45 AM, Hugo Meyer wrote:

> Yes, after the release is a break. I'm sending now all my output, maybe that 
> helps more. But the code is basically the one i sent. The normal execution 
> reaches to the send/receive between the orted_comm and the receiver.
> 
> Best regards.
> 
> Hugo
> 
> 2011/3/8 Ralph Castain <r...@open-mpi.org>
> The comm can most certainly be done - there are other sections of that code 
> that also send messages. I can't see the end of your new code section, but I 
> assume you ended it properly with a "break"? Otherwise, you'll execute 
> whatever lies below it as well.
> 
> 
> On Mar 8, 2011, at 8:19 AM, Hugo Meyer wrote:
> 
>> Yes, i set the value 31 and it is not duplicated.
>> 
>> 
>> 2011/3/8 Ralph Castain <r...@open-mpi.org>
>> What value did you set for this new command? Did you look at the cmds in 
>> orte/mca/odls/odls_types.h to ensure you weren't using a duplicate value?
>> 
>> 
>> On Mar 8, 2011, at 6:15 AM, Hugo Meyer wrote:
>> 
>>> Hello @ll.
>>> 
>>> I've got a problem in a communication between the 
>>> v_protocol_receiver_component.c and the orted_comm.c. 
>>> 
>>> In the mca_vprotocol_receiver_component_init  i've added a request that is 
>>> received correctly by the orte_daemon_process_commands but when i try to 
>>> reply to the sender i get the next error:
>>> 
>>> [clus1:15593] [ 0] /lib64/libpthread.so.0 [0x2aaaabb03d40]
>>> [clus1:15593] [ 1] 
>>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 
>>> [0x2aaaaad760db]
>>> [clus1:15593] [ 2] 
>>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 
>>> [0x2aaaaad75aa4]
>>> [clus1:15593] [ 3] 
>>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/openmpi/mca_errmgr_orted.so 
>>> [0x2aaaae2d2fdd]
>>> [clus1:15593] [ 4] 
>>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_odls_base_notify_iof_complete+0x1da)
>>>  [0x2aaaaad42cb0]
>>> [clus1:15593] [ 5] 
>>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_process_commands+0x1068)
>>>  [0x2aaaaad19ca6]
>>> [clus1:15593] [ 6] 
>>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x81b)
>>>  [0x2aaaaad18a55]
>>> [clus1:15593] [ 7] 
>>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 
>>> [0x2aaaaad9710e]
>>> [clus1:15593] [ 8] 
>>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 
>>> [0x2aaaaad974bb]
>>> [clus1:15593] [ 9] 
>>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_loop+0x1a)
>>>  [0x2aaaaad972ad]
>>> [clus1:15593] [10] 
>>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_dispatch+0xe)
>>>  [0x2aaaaad97166]
>>> [clus1:15593] [11] 
>>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon+0x2322)
>>>  [0x2aaaaad17556]
>>> [clus1:15593] [12] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted 
>>> [0x4008a3]
>>> [clus1:15593] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2aaaabd2d8a4]
>>> [clus1:15593] [14] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted 
>>> [0x400799]
>>> [clus1:15593] *** End of error message ***
>>> 
>>> The code that i've added at the v_protocol_receiver_component.c is (in bold 
>>> the recv command that fails):
>>> 
>>> int mca_vprotocol_receiver_request_protector(void) {
>>>     orte_daemon_cmd_flag_t command;
>>>     opal_buffer_t *buffer = NULL;
>>>     int n = 1;
>>> 
>>>     command = ORTE_DAEMON_REQUEST_PROTECTOR_CMD;
>>>     
>>>     buffer = OBJ_NEW(opal_buffer_t);
>>>     opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD);
>>>     
>>>     orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, buffer, ORTE_RML_TAG_DAEMON, 
>>> 0);
>>> 
>>>     orte_rml.recv_buffer(ORTE_PROC_MY_DAEMON, buffer, 
>>> ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0);
>>>     opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.jobid, &n, 
>>> OPAL_UINT32);
>>>     opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.vpid, &n, 
>>> OPAL_UINT32);
>>> 
>>>     orte_process_info.protector.jobid = 
>>> mca_vprotocol_receiver.protector.jobid;
>>>     orte_process_info.protector.vpid  = 
>>> mca_vprotocol_receiver.protector.vpid;
>>> 
>>>     OBJ_RELEASE(buffer);
>>> 
>>>     return OMPI_SUCCESS;
>>> 
>>> The code that i've added at the orted_comm.c is (in bold the send command 
>>> that fails):
>>> 
>>> case ORTE_DAEMON_REQUEST_PROTECTOR_CMD:
>>>         if (orte_debug_daemons_flag) {
>>>             opal_output(0, "%s orted_recv: received request protector from 
>>> local proc %s",
>>>                         ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), 
>>> ORTE_NAME_PRINT(sender));
>>>         }
>>>         /* Define the protector */
>>>         protector = (uint32_t)ORTE_PROC_MY_NAME->vpid + 1;
>>>         if (protector >= (uint32_t)orte_process_info.num_procs) {
>>>             protector = 0;
>>>         }
>>> 
>>>         /* Pack the protector data */
>>>         answer = OBJ_NEW(opal_buffer_t);
>>> 
>>>         if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, 
>>> &ORTE_PROC_MY_NAME->jobid, 1, OPAL_UINT32))) {
>>>             ORTE_ERROR_LOG(ret);
>>>             OBJ_RELEASE(answer);
>>>             goto CLEANUP;
>>>         }
>>>         if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, &protector, 1, 
>>> OPAL_UINT32))) {
>>>             ORTE_ERROR_LOG(ret);
>>>             OBJ_RELEASE(answer);
>>>             goto CLEANUP;
>>>         }
>>>         if (orte_debug_daemons_flag) {
>>>             opal_output(0, "EL PROTECTOR ASIGNADO para %s ES: %d\n",
>>>                         ORTE_NAME_PRINT(sender), protector);
>>>         }
>>> 
>>>         /* Send the protector data */
>>>         if (0 > orte_rml.send_buffer(sender, answer, 
>>> ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0)) {
>>>             ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);
>>>             ret = ORTE_ERR_COMM_FAILURE;
>>>             OBJ_RELEASE(answer);
>>>             goto CLEANUP;
>>>         }
>>>         OBJ_RELEASE(answer);
>>> 
>>> I assume by testing that the error is in the bolded section, maybe because 
>>> i'am missing some sentence when i try to communicate, or maybe this 
>>> communication cannot be done. Any help will be appreciated.
>>> 
>>> Thanks a lot.
>>> 
>>> Hugo Meyer
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> <output1>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to