Hi Mahesh,

I think only logging is needed as proposed in the patch, as some services are 
already handling dropped messages. This logging will help in
trouble shooting. Keeping TIPC_DEST_DROPPABLE to true will only make TIPC to 
silently drop messages, the original problem persists and needs investigation,
i.e. why the socket receive buffer is overloaded, one reason may be that the 
MDS poll/receive loop together with the "big" mutex lock, (ticket #520).
Did you check why MDS message loss mechanism doesn't detect on TIPC dropped 
messages, AMF 
do detect this via e.g "out of sync", "msg id mismatch" and so on?

/Regards HansN

-----Original Message-----
From: A V Mahesh [mailto:mahesh.va...@oracle.com] 
Sent: den 20 september 2016 12:29
To: Anders Widell <anders.wid...@ericsson.com>; Hans Nordebäck 
<hans.nordeb...@ericsson.com>
Cc: opensaf-devel@lists.sourceforge.net; mathi.naic...@oracle.com
Subject: Re: [PATCH 1 of 1] MDS: Log TIPC dropped messages [#1957]

HI Anders Widell / HansN,

On 9/16/2016 2:03 PM, Anders Widell wrote:
> The idea was to just log reception of error info messages, for 
> trouble-shooting purposes.

After multiple attempts,  i manged to simulate TIPC_ERR_OVERLOAD 
error.    After  TIPC_ERR_OVERLOAD error is hit
the cluster going to UN-recoverable state , because the send buffers are full.

So we have two options :

1)  Set  TIPC_DEST_DROPPABLE to false ,  log TIPC_ERR_OVERLOAD error and then  
graceful  exist of sender,
      which allows remaining nodes to be survived.

2)  keep the current configuration as it is ( TIPC_DEST_DROPPABLE to true )

=================================================================================================================
Sep 20 15:14:09 SC-1 osafamfd[3759]: NO Received node_up from 2040f: 
msg_id 1
Sep 20 15:14:09 SC-1 osafamfd[3759]: NO Node 'PL-4' joined the cluster Sep 20 
15:14:09 SC-1 osafimmnd[3695]: NO Implementer connected: 19
(MsgQueueService132111) <0, 2040f>
*Sep 20 15:16:59 SC-1 osafimmd[3684]: 77 MDTM: undelivered message condition 
ancillary data: TIPC_ERR_OVERLOAD* Sep 20 15:17:00 SC-1 osafimmnd[3695]: WA 
Director Service in NOACTIVE state - fevs replies pending:1 fevs highest 
processed:218744 Sep 20 15:17:00 SC-1 osafamfnd[3773]: NO 
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
Sep 20 15:17:00 SC-1 osafamfnd[3773]: ER 
safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast Sep 20 15:17:00 SC-1 osafamfnd[3773]: Rebooting OpenSAF NodeId 
= 131343 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131343, SupervisionTime = 60 Sep 20 15:17:00 SC-1 osafimmnd[3695]: 
WA DISCARD DUPLICATE FEVS
message:218744
Sep 20 15:17:00 SC-1 osafimmnd[3695]: WA Error code 2 returned for message type 
82 - ignoring Sep 20 15:17:00 SC-1 opensaf_reboot: Rebooting local node; 
timeout=60 Sep 20 15:17:00 SC-1 osafimmnd[3695]: WA SC Absence IS allowed:900 
IMMD service is DOWN Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO IMMD SERVICE IS 
DOWN, HYDRA IS CONFIGURED => UNREGISTERING IMMND form MDS Sep 20 15:17:00 SC-1 
osafntfimcnd[3742]: NO saImmOiDispatch() Fail SA_AIS_ERR_BAD_HANDLE (9) Sep 20 
15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:20002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 1 <2, 
2010f> (safLogService)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:d0d0002010f
sv_id:26
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:100002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 2 <16, 
2010f> (@safLogService_appl)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:130002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 3 <19, 
2010f> (@OpenSafImmReplicatorA)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:140002010f
sv_id:26
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:150002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 4 <21, 
2010f> (safClmService)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:1a0002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 5 <26, 
2010f> (safAmfService)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:1b0002010f
sv_id:26
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5bc0002010f
sv_id:26
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5bd0002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 6 <1469, 
2010f> (MsgQueueService131343) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO 
Removing client id:5c00002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 10 <1472, 
2010f> (safEvtService) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client 
id:5c40002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 8 <1476, 
2010f> (safSmfService) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client 
id:5c60002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 9 <1478, 
2010f> (safLckService) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client 
id:5c70002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 7 <1479, 
2010f> (safMsgGrpService) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing 
client id:5cc0002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5ce0002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 12 <1486, 
2010f> (safCheckPointService) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO 
Implementer disconnected 13 <0, 2020f(down)> (MsgQueueService131599) Sep 20 
15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 14 <0, 2020f(down)> 
(@OpenSafImmReplicatorB) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer 
disconnected 15 <0, 2020f(down)> (@safAmfService2020f) Sep 20 15:17:00 SC-1 
osafimmnd[3695]: NO Impl Discarded node 2020f Sep 20 15:17:00 SC-1 
osafimmnd[3695]: NO Implementer disconnected 16 <0, 2030f(down)> 
(MsgQueueService131855) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Impl Discarded 
node 2030f Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 19 
<0, 2040f(down)> (MsgQueueService132111) Sep 20 15:17:00 SC-1 osafimmnd[3695]: 
NO Impl Discarded node 2040f Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO MDS 
unregisterede. sleeping ...
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO Sleep done registering IMMND with MDS 
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fe8fa0043 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb60040 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb6002e already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb60037 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb60028 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb6003d already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb6002b already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb6001c already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb60019 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcba0012 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb60028 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb60019 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO SUCCESS IN REGISTERING IMMND WITH MDS 
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO Re-introduce-me
highestProcessed:218744 highestReceived:218744 Sep 20 15:17:03 SC-1 kernel: [ 
1794.198381] md: stopping all md devices.
Sep 20 15:17:03 SC-1 osafntfimcnd[8997]: WA ntfimcn_imm_init
saImmOiInitialize_2() returned SA_AIS_ERR_TIMEOUT (5) Sep 20 15:18:00 SC-1 
syslog-ng[1221]: syslog-ng starting up; version='2.0.9'
=================================================================================================================

-AVM

On 9/16/2016 2:03 PM, Anders Widell wrote:
>
> I don't think we need (or even should) inform the sender when MDS 
> receives an error information message from TIPC. Note that these error 
> information messages are received asynchronously, when the sender has 
> already received an OK return code from the MDS send call. The idea 
> was to just log reception of error info messages, for trouble-shooting 
> purposes. We already have a mechanism in MDS that informs the receiver 
> about lost MDS messages. If we wish to inform the sender we would need 
> to introduce a second mechanism in MDS, and at this point I don't 
> think it is needed. Another approach we could consider is that MDS 
> retransmits the message transparently without informing the sender.
> This would require MDS to internally store sent messages for a while, 
> so that they can be retransmitted. It would also require the receiver 
> to re-order received messages, since a retransmitted message will be 
> received out of sequence.
>
> regards,
>
> Anders Widell
>
>
> On 09/16/2016 06:40 AM, A V Mahesh wrote:
>> Hi HansN,
>>
>> I managed to create TIPC_ERRINFO/TIPC_RETDATA  error cases ( not 
>> TIPC_ERR_OVERLOAD error )  with normal messages and It is observed 
>> that  TIPC_DEST_DROPPABLE set to true even error TIPC_ERRINFO is NOT 
>> notified ( it means TIPC_ERR_OVERLOAD ) , if TIPC_DEST_DROPPABLE set 
>> to false TIPC_ERRINFO/TIPC_RETDATA errors are notified.
>>
>> Now I will also check implication of TIPC_DEST_DROPPABLE set to false 
>> on multicast and broadcast  messages, based on that we can re-arrange 
>> the TIPC_DEST_DROPPABLE setting to false conditions  based on agent 
>> `i_msg_loss_indication = true` condition mds can return to agent the 
>> same error  TIPC_ERR_OVERLOAD.
>>
>> TIPC_DEST_DROPPABLE to false:
>>
>> ==================================================================
>>
>> Sep 15 16:10:39 SC-1 osafimmnd[32051]: NO Implementer disconnected 13 
>> <0, 2040f> (MsgQueueService132111) Sep 15 16:10:39 SC-1 
>> osafimmd[32040]:  777 MDTM: undelivered message condition ancillary 
>> data: TIPC_ERRINFO abort err : 2 Sep 15 16:10:39 SC-1 
>> osafimmd[32040]: 7777 MDTM: undelivered message condition ancillary 
>> data: TIPC_RETDATA Sep 15 16:10:39 SC-1 osafimmd[32040]: NO MDS event 
>> from svc_id 25 (change:4, dest:567413369208836) Sep 15 16:10:39 SC-1 
>> osafimmd[32040]:  777 MDTM: undelivered message condition ancillary 
>> data: TIPC_ERRINFO abort err : 2 Sep 15 16:10:39 SC-1 
>> osafimmd[32040]: 7777 MDTM: undelivered message condition ancillary 
>> data: TIPC_RETDATA Sep 15 16:10:39 SC-1 osafimmd[32040]:  777 MDTM: 
>> undelivered message condition ancillary data: TIPC_ERRINFO abort err 
>> : 2 Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered 
>> message condition ancillary data: TIPC_RETDATA Sep 15 16:10:39 SC-1 
>> osafimmd[32040]:  777 MDTM: undelivered message condition ancillary 
>> data: TIPC_ERRINFO abort err : 2 Sep 15 16:10:39 SC-1 
>> osafimmd[32040]: 7777 MDTM: undelivered message condition ancillary 
>> data: TIPC_RETDATA Sep 15 16:10:39 SC-1 osafimmd[32040]:  777 MDTM: 
>> undelivered message condition ancillary data: TIPC_ERRINFO abort err 
>> : 2 Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered 
>> message condition ancillary data: TIPC_RETDATA Sep 15 16:10:39 SC-1 
>> osafimmd[32040]:  777 MDTM: undelivered message condition ancillary 
>> data: TIPC_ERRINFO abort err : 2 Sep 15 16:10:39 SC-1 
>> osafimmd[32040]: 7777 MDTM: undelivered message condition ancillary 
>> data: TIPC_RETDATA Sep 15 16:10:39 SC-1 osafimmd[32040]:  777 MDTM: 
>> undelivered message condition ancillary data: TIPC_ERRINFO abort err 
>> : 2 Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered 
>> message condition ancillary data: TIPC_RETDATA Sep 15 16:10:39 SC-1 
>> osafimmd[32040]:  777 MDTM: undelivered message condition ancillary 
>> data: TIPC_ERRINFO abort err : 2 Sep 15 16:10:39 SC-1 
>> osafimmd[32040]: 7777 MDTM: undelivered message condition ancillary 
>> data: TIPC_RETDATA Sep 15 16:10:39 SC-1 osafimmd[32040]:  777 MDTM: 
>> undelivered message condition ancillary data: TIPC_ERRINFO abort err 
>> : 2 Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered 
>> message condition ancillary data: TIPC_RETDATA Sep 15 16:10:39 SC-1 
>> osafimmd[32040]:  777 MDTM: undelivered message condition ancillary 
>> data: TIPC_ERRINFO abort err : 2 Sep 15 16:10:39 SC-1 
>> osafimmd[32040]: 7777 MDTM: undelivered message condition ancillary 
>> data: TIPC_RETDATA Sep 15 16:10:39 SC-1 osafamfd[32114]: NO Node 
>> 'PL-4' left the cluster
>>
>> ==================================================================
>>
>> TIPC_DEST_DROPPABLE to true:
>>
>> ==================================================================
>>
>> Sep 15 15:59:55 SC-1 osafimmnd[26461]: NO Implementer disconnected 13 
>> <0, 2040f> (MsgQueueService132111) Sep 15 15:59:55 SC-1 
>> osafimmd[26450]: NO MDS event from svc_id 25 (change:4, 
>> dest:567412923957252) Sep 15 15:59:55 SC-1 osafimmnd[26461]: NO 
>> Global discard node received for nodeId:2040f pid:410 Sep 15 15:59:55 
>> SC-1 osafamfd[28810]: NO Node 'PL-4' left the cluster Sep 15 15:59:58 
>> SC-1 kernel: [ 5147.648737] tipc: Resetting link 
>> <1.1.1:eth0-1.1.4:eth0>, peer not responding Sep 15 15:59:58 SC-1 
>> kernel: [ 5147.648756] tipc: Lost link <1.1.1:eth0-1.1.4:eth0> on 
>> network plane A Sep 15 15:59:58 SC-1 kernel: [ 5147.648771] tipc: 
>> Lost contact with <1.1.4>
>>
>> ==================================================================
>>
>> -AVM
>>
>>
>> On 9/1/2016 10:59 AM, Hans Nordebäck wrote:
>>> Hi Mahesh,
>>>
>>> I have not tested this, but the following should work:
>>>
>>> - Set BSRsock TIPC_IMPORTANCE to TIPC_LOW_IMPORTANCE
>>>
>>> - set socket receive buffer to a small value:
>>>
>>>   optval = "small socket recieive buffer size" , 5000 ?
>>>
>>>   setsockopt(tipc_cb.BSRsock, SOL_SOCKET, SO_RCVBUF, &optval, 
>>> optlen)
>>>
>>> -  sysctl -w net.tipc.tipc_rmem="5000 40000000 68240400" (or smaller
>>> values)
>>>
>>> - add some delays when processing messages in 
>>> mdtm_process_recv_events(), to provoke overloading the socket 
>>> receive buffer.
>>>
>>> We experience dropped packages in a 75 node system, and as a 
>>> workaround increasing the default so receive buffer size it seems 
>>> working for that setup.
>>>
>>> /Thanks HansN
>>>
>>> On 09/01/2016 05:50 AM, A V Mahesh wrote:
>>>> Hi HansN,
>>>>
>>>> Do you have any tips to created overload case,
>>>>
>>>> I would like test and observe TIPC_DEST_DROPPABLE enabled & 
>>>> disabled cases.
>>>>
>>>> -AVM
>>>>
>>>>
>>>> On 9/1/2016 9:12 AM, A V Mahesh wrote:
>>>>> Hi HansN,
>>>>>
>>>>> Sorry for the delay.
>>>>>
>>>>> I will test it and get back to you soon.
>>>>>
>>>>> -AVM
>>>>>
>>>>>
>>>>> On 8/31/2016 4:29 PM, Hans Nordebäck wrote:
>>>>>> Hi Mahesh,
>>>>>> Any updates on this?
>>>>>>
>>>>>> /Regards HansN
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Anders Widell
>>>>>> Sent: den 25 augusti 2016 13:11
>>>>>> To: A V Mahesh <mahesh.va...@oracle.com>; Hans Nordebäck 
>>>>>> <hans.nordeb...@ericsson.com>; mathi.naic...@oracle.com
>>>>>> Cc: opensaf-devel@lists.sourceforge.net
>>>>>> Subject: Re: [PATCH 1 of 1] MDS: Log TIPC dropped messages 
>>>>>> [#1957]
>>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> This is what the TIPC user documentation says about
>>>>>> TIPC_DEST_DROPPABLE:
>>>>>> "This option governs the handling of messages sent by the socket 
>>>>>> if the message cannot be delivered to its destination, either 
>>>>>> because the receiver is congested or because the specified 
>>>>>> receiver does not exist.
>>>>>> If enabled, the message is discarded; otherwise the message is 
>>>>>> returned to the sender."
>>>>>>
>>>>>> This is what the TIPC user documentation says about the return 
>>>>>> value from the recvmsg() system call: "When used with a 
>>>>>> connectionless socket, a return value of 0 indicates the arrival 
>>>>>> of a returned data message that was originally sent by this socket."
>>>>>>
>>>>>> I think the documentation is pretty clear. If you set 
>>>>>> TIPC_DEST_DROPPABLE to true, the receiver can discard messages 
>>>>>> e.g. when the receive buffer is full. The sender will not be 
>>>>>> notified in this case. If TIPC_DEST_DROPPABLE is set to false, 
>>>>>> the message will be returned to the sender in case of a full 
>>>>>> receive buffer. The sender knows that it has received such a 
>>>>>> returned message when the recvmsg() call returns zero.
>>>>>>
>>>>>> regards,
>>>>>> Anders Widell
>>>>>>
>>>>>> On 08/25/2016 11:30 AM, A V Mahesh wrote:
>>>>>>> Hi HansN,
>>>>>>>
>>>>>>>
>>>>>>> On 8/23/2016 5:22 PM, Hans Nordebäck wrote:
>>>>>>>
>>>>>>>> Hi Mahesh,
>>>>>>>>
>>>>>>>> Yes, this is my understanding too, if TIPC_DROPPABLE = true 
>>>>>>>> tipc may drop messages silently,  at receive sock buffer full 
>>>>>>>> condition,  but do not return any ancillary message.
>>>>>>>> If TIPC_DROPPABLE = false tipc may drop message but will send 
>>>>>>>> an ancillary message to inform about TIPC_ERR_OVERLOAD.
>>>>>>> [AVM]
>>>>>>>
>>>>>>> My observation are understanding is different, based on TIPC 
>>>>>>> code and Linux TIPC 2.0 Programmer's Guide , that the 
>>>>>>> TIPC_ERR_OVERLOAD error returned when TIPC is unable to enqueue 
>>>>>>> an incoming message on the receiving socket's receive queue 
>>>>>>> irrelevant of TIPC_DEST_DROPPABLE enabled or disabled.
>>>>>>>
>>>>>>> The only difference between TIPC_DEST_DROPPABLE enabled or 
>>>>>>> disabled is , If  TIPC_DEST_DROPPABLE enabled, the message is 
>>>>>>> discarded and
>>>>>>> recvmsg() returned size is ZERO and application will get errors, 
>>>>>>> if TIPC_DEST_DROPPABLE disabled  the message is returned to the 
>>>>>>> sender it means the recvmsg() returned size is user send data 
>>>>>>> size and application will get errors .
>>>>>>>
>>>>>>> I did check the TIPC code and documentations  and I haven't get 
>>>>>>> any evidences that  TIPC_ERR_OVERLOAD error code will be send 
>>>>>>> only If TIPC_DEST_DROPPABLE = false.
>>>>>>>
>>>>>>> Even while testing #1227
>>>>>>> (https://sourceforge.net/p/opensaf/mailman/message/33207717/) my 
>>>>>>> observations and understanding was, an individual TIPC socket is 
>>>>>>> only allowed to queue up
>>>>>>> OVERLOAD_LIMIT_BASE/2 messages of the lowest importance level 
>>>>>>> before it starts rejecting them.
>>>>>>> Once a socket receiving queue length exceeds the maximum limit 
>>>>>>> value, the receiving socket will send out a reject message with 
>>>>>>> TIPC_ERR_OVERLOAD error code with cmsg_type as 
>>>>>>> TIPC_ERRINFO/TIPC_RETDATA, and the tipc code and Linux TIPC 2.0 
>>>>>>> Programmer's Guide  confirmed the same .
>>>>>>>
>>>>>>> tipc/socket.c
>>>>>>> =======================================================
>>>>>>> /* Reject message if there isn't room to queue it */
>>>>>>>
>>>>>>> recv_q_len = (u32)atomic_read(&tipc_queue_size);
>>>>>>> if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) {
>>>>>>>      if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE))
>>>>>>>          return TIPC_ERR_OVERLOAD; } recv_q_len = 
>>>>>>> skb_queue_len(&sk->sk_receive_queue);
>>>>>>> if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2))) {
>>>>>>>      if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE / 2))
>>>>>>>          return TIPC_ERR_OVERLOAD; } 
>>>>>>> =======================================================
>>>>>>>
>>>>>>>
>>>>>>> 2.1.17. setsockopt() of  TIPC 2.0 Programmer's Guide 
>>>>>>> =======================================================
>>>>>>> TIPC_DEST_DROPPABLE
>>>>>>> This option governs the handling of messages sent by the socket 
>>>>>>> if the message cannot be delivered to its destination, either 
>>>>>>> because the receiver is congested or because the specified 
>>>>>>> receiver does not exist. If enabled, the message is discarded; 
>>>>>>> otherwise the message is returned to the sender.
>>>>>>>
>>>>>>> By default, this option is disabled for SOCK_SEQPACKET and 
>>>>>>> SOCK_STREAM socket types, and enabled for SOCK_RDM and 
>>>>>>> SOCK_DGRAM, This arrangement ensures proper teardown of failed 
>>>>>>> connections when connection-oriented data transfer is used, 
>>>>>>> without increasing the complexity of connectionless data 
>>>>>>> transfer.
>>>>>>>
>>>>>>> TIPC_SRC_DROPPABLE
>>>>>>> This option governs the handling of messages sent by the socket 
>>>>>>> if link congestion occurs. If enabled, the message is discarded; 
>>>>>>> otherwise the system queues the message for later transmission.
>>>>>>> By default, this option is disabled for SOCK_SEQPACKET, 
>>>>>>> SOCK_STREAM, and SOCK_RDM socket types (resulting in "reliable" 
>>>>>>> data transfer), and enabled for SOCK_DGRAM (resulting in 
>>>>>>> "unreliable" data transfer).
>>>>>>> =======================================================
>>>>>>>
>>>>>>> Now I will try to create OVERLOAD case and update you soon my 
>>>>>>> latest observations.
>>>>>>>
>>>>>>> -AVM
>>>>>>>
>>>>>>>> Correcting this and adding an abort is not backward compatible 
>>>>>>>> as some service already handle flow control in some way, only 
>>>>>>>> log when packages are dropped.
>>>>>>>> Regarding ticket #1960 there are other solutions than 
>>>>>>>> introducing flow control in MDS, e.g. expose an option to the 
>>>>>>>> service to choose connection oriented or connection less.
>>>>>>>> The problem with dropped messages seems in one case related to, 
>>>>>>>> (by MDS), intensive MDS logging.
>>>>>>>>
>>>>>>>> /Thanks HansN
>>>>>>>> -----Original Message-----
>>>>>>>> From: A V Mahesh [mailto:mahesh.va...@oracle.com]
>>>>>>>> Sent: den 23 augusti 2016 11:27
>>>>>>>> To: Hans Nordebäck <hans.nordeb...@ericsson.com>; Anders Widell 
>>>>>>>> <anders.wid...@ericsson.com>; mathi.naic...@oracle.com
>>>>>>>> Cc: opensaf-devel@lists.sourceforge.net
>>>>>>>> Subject: Re: [PATCH 1 of 1] MDS: Log TIPC dropped messages 
>>>>>>>> [#1957]
>>>>>>>>
>>>>>>>> Hi HansN,
>>>>>>>>
>>>>>>>> It seems I am missing some thing , please allow me to under 
>>>>>>>> stand
>>>>>>>>
>>>>>>>> If I currently understand you observation :
>>>>>>>>
>>>>>>>> With current Opensaf code ( this #1957 patch NOT applied ) , by 
>>>>>>>> default TIPC_DROPPABLE=true ,while running Opensaf with that 
>>>>>>>> binary when TIPC_ERR_OVERLOAD  occurring, TIPC is not given 
>>>>>>>> errors TIPC_ERRINFO or  TIPC_RETDATA and following code is not 
>>>>>>>> being get hit of function recvfrom_connectionless(), is my 
>>>>>>>> understanding right ?
>>>>>>>>
>>>>>>>> ===============================================================
>>>>>>>> ======
>>>>>>>>
>>>>>>>> ========================================
>>>>>>>>
>>>>>>>>
>>>>>>>> *if (anc->cmsg_type == TIPC_ERRINFO) {*
>>>>>>>>        /* TIPC_ERRINFO - TIPC error code associated with a 
>>>>>>>> returned data message or a connection termination message  so 
>>>>>>>> abort */
>>>>>>>>        m_MDS_LOG_CRITICAL("MDTM: undelivered message condition 
>>>>>>>> ancillary
>>>>>>>> data: TIPC_ERRINFO abort err :%s", strerror(errno) );
>>>>>>>> *abort();*
>>>>>>>> *} else if (anc->cmsg_type == TIPC_RETDATA) {*
>>>>>>>>        /* If we set TIPC_DEST_DROPPABLE off messge (configure 
>>>>>>>> TIPC to return rejected messages to the sender )
>>>>>>>>           we will hit this when we implement MDS retransmit 
>>>>>>>> lost messages abort can be replaced with flow control logic*/
>>>>>>>>        for (i = anc->cmsg_len - sizeof(*anc); i > 0; i--) {
>>>>>>>>            m_MDS_LOG_DBG("MDTM: returned byte 0x%02x\n", *cptr);
>>>>>>>>            cptr++;
>>>>>>>>        }
>>>>>>>>        /* TIPC_RETDATA -The contents of a returned data message 
>>>>>>>> so abort */
>>>>>>>>        m_MDS_LOG_CRITICAL("MDTM: undelivered message condition 
>>>>>>>> ancillary
>>>>>>>> data: TIPC_RETDATA abort err :%s", strerror(errno) );
>>>>>>>> *abort();*
>>>>>>>> }
>>>>>>>>
>>>>>>>> ===============================================================
>>>>>>>> ======
>>>>>>>>
>>>>>>>> ========================================
>>>>>>>>
>>>>>>>>
>>>>>>>> -AVM
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/23/2016 1:08 PM, Hans Nordebäck wrote:
>>>>>>>>> Hi Mahesh,
>>>>>>>>>
>>>>>>>>> Please see response below with [HansN] /Thanks HansN
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: A V Mahesh [mailto:mahesh.va...@oracle.com]
>>>>>>>>> Sent: den 23 augusti 2016 08:25
>>>>>>>>> To: Hans Nordebäck <hans.nordeb...@ericsson.com>; Anders 
>>>>>>>>> Widell <anders.wid...@ericsson.com>; mathi.naic...@oracle.com
>>>>>>>>> Cc: opensaf-devel@lists.sourceforge.net
>>>>>>>>> Subject: Re: [PATCH 1 of 1] MDS: Log TIPC dropped messages 
>>>>>>>>> [#1957]
>>>>>>>>>
>>>>>>>>> Hi HansN
>>>>>>>>>
>>>>>>>>> Please see response below with [AVM]
>>>>>>>>>
>>>>>>>>> -AVM
>>>>>>>>>
>>>>>>>>> On 8/23/2016 11:41 AM, Hans Nordebäck wrote:
>>>>>>>>>> Hi Mahesh,
>>>>>>>>>>
>>>>>>>>>> please see comments below.
>>>>>>>>>>
>>>>>>>>>> /Thanks HansN
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 08/23/2016 07:21 AM, A V Mahesh wrote:
>>>>>>>>>>> Hi HansN,
>>>>>>>>>>>
>>>>>>>>>>> Let us fist discuss the error handling and abort, then we 
>>>>>>>>>>> can come back to interpretation of  TIPC currently does 
>>>>>>>>>>> permit  OR does not permit an application to send a 
>>>>>>>>>>> multicast message with the "destination droppable" setting 
>>>>>>>>>>> disabled.
>>>>>>>>>>>
>>>>>>>>>>> Let us disable TIPC_DEST_DROPPABLE, so that TIPC will try to 
>>>>>>>>>>> return an undelivered multicast message to its sender and we 
>>>>>>>>>>> can determine issue is  because of TIPC_ERR_OVERLOAD, this 
>>>>>>>>>>> helps in debugging , so that application may increased 
>>>>>>>>>>> SO_SNDBUF/SO_RCVBUF to reduce the problem.
>>>>>>>>>>>
>>>>>>>>>>> But still we need to abort(), the reason for that is current 
>>>>>>>>>>> MDS implementations doesn't have flow control logic ( no 
>>>>>>>>>>> retry because of error ) , so Application like AMF can go 
>>>>>>>>>>> wrong and cluster will go into unstable/recoverble state.
>>>>>>>>>>>
>>>>>>>>>> [HansN] In the current implementation messages are dropped 
>>>>>>>>>> silently and no abort is done.
>>>>>>>>> [AVM]  I can see  abort(); in current code , you mean abort(); 
>>>>>>>>> is not working and application(amf) is not existing ?
>>>>>>>>> [HansN] In case of TIPC_DROPPABLE=true and messages are 
>>>>>>>>> dropped,
>>>>>>>>> (TIPC_ERR_OVERLOAD)  no abort is be performed, e.g amfd 
>>>>>>>>> detects this in the msg sanity chk and logs "invalid msg id 
>>>>>>>>> ..."
>>>>>>>>> ==============================================================
>>>>>>>>> ======
>>>>>>>>>
>>>>>>>>> ==
>>>>>>>>> ======
>>>>>>>>> if (anc->cmsg_type == TIPC_ERRINFO) {
>>>>>>>>>         /* TIPC_ERRINFO - TIPC error code associated with a 
>>>>>>>>> returned data message or a connection termination message so 
>>>>>>>>> abort */
>>>>>>>>>         m_MDS_LOG_CRITICAL("MDTM: undelivered message 
>>>>>>>>> condition ancillary
>>>>>>>>> data: TIPC_ERRINFO abort err :%s", strerror(errno) );
>>>>>>>>> *abort();*
>>>>>>>>> } else if (anc->cmsg_type == TIPC_RETDATA) {
>>>>>>>>>         /* If we set TIPC_DEST_DROPPABLE off messge (configure 
>>>>>>>>> TIPC to return rejected messages to the sender )
>>>>>>>>>            we will hit this when we implement MDS retransmit 
>>>>>>>>> lost messages abort can be replaced with flow control logic*/
>>>>>>>>>         for (i = anc->cmsg_len - sizeof(*anc); i > 0; i--) {
>>>>>>>>>             m_MDS_LOG_DBG("MDTM: returned byte 0x%02x\n", *cptr);
>>>>>>>>>             cptr++;
>>>>>>>>>         }
>>>>>>>>>         /* TIPC_RETDATA -The contents of a returned data 
>>>>>>>>> message  so abort */
>>>>>>>>>         m_MDS_LOG_CRITICAL("MDTM: undelivered message 
>>>>>>>>> condition ancillary
>>>>>>>>> data: TIPC_RETDATA abort err :%s", strerror(errno) );
>>>>>>>>> *abort();*
>>>>>>>>> }
>>>>>>>>> ==============================================================
>>>>>>>>> ======
>>>>>>>>>
>>>>>>>>> ==
>>>>>>>>> ======
>>>>>>>>>> This patch enables logging
>>>>>>>>>> when packages are dropped to help in debugging. I don't agree 
>>>>>>>>>> that we should also introduce abort, but instead:
>>>>>>>>>> 1) Implement a solution to handle dropped packages, ticket 
>>>>>>>>>> #1960
>>>>>>>>> [AVM]  This is nothing but flow control implementation in MDS, 
>>>>>>>>> this is future enhancement
>>>>>>>>>
>>>>>>>>>> 2) Investigate why packages may be dropped, the receiving MDS 
>>>>>>>>>> thread is a real time thread and should be able to consume a 
>>>>>>>>>> large amount of incoming messages.
>>>>>>>>>> E.g. is the receiving MDS thread "live hanging" due to locks, 
>>>>>>>>>> file I/O etc?
>>>>>>>>>>> This was the reason we haven't gone for it while addressing 
>>>>>>>>>>> Ticket
>>>>>>>>>>> #1227
>>>>>>>>>>> (https://sourceforge.net/p/opensaf/mailman/message/33207717/
>>>>>>>>>>> ) So currently we don't have any advantage of disabling 
>>>>>>>>>>> TIPC_DEST_DROPPABLE and not allowing multicast messages.
>>>>>>>>>>>
>>>>>>>>>>> -AVM
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 8/18/2016 2:43 PM, Hans Nordeback wrote:
>>>>>>>>>>>> osaf/libs/core/mds/mds_dt_tipc.c |  32
>>>>>>>>>>>> +++++++++++++++++++++++++-------
>>>>>>>>>>>>      1 files changed, 25 insertions(+), 7 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/osaf/libs/core/mds/mds_dt_tipc.c
>>>>>>>>>>>> b/osaf/libs/core/mds/mds_dt_tipc.c
>>>>>>>>>>>> --- a/osaf/libs/core/mds/mds_dt_tipc.c
>>>>>>>>>>>> +++ b/osaf/libs/core/mds/mds_dt_tipc.c
>>>>>>>>>>>> @@ -320,6 +320,15 @@ uint32_t mdtm_tipc_init(NODE_ID nodeid,
>>>>>>>>>>>>                      m_MDS_LOG_INFO("MDTM: Successfully set 
>>>>>>>>>>>> default socket option TIPC_IMP = %d", TIPCIMPORTANCE);
>>>>>>>>>>>>              }
>>>>>>>>>>>>      +        int droppable = 0;
>>>>>>>>>>>> +        if (setsockopt(tipc_cb.BSRsock, SOL_TIPC,
>>>>>>>>>>>> TIPC_DEST_DROPPABLE, &droppable, sizeof(droppable)) != 0) {
>>>>>>>>>>>> +                LOG_ER("MDTM: Can't set
>>>>>>>>>>>> TIPC_DEST_DROPPABLE to
>>>>>>>>>>>> + zero
>>>>>>>>>>>> err :%s\n", strerror(errno));
>>>>>>>>>>>> +                m_MDS_LOG_ERR("MDTM: Can't set 
>>>>>>>>>>>> + TIPC_DEST_DROPPABLE
>>>>>>>>>>>> to zero err :%s\n", strerror(errno));
>>>>>>>>>>>> +                osafassert(0);
>>>>>>>>>>>> +        } else {
>>>>>>>>>>>> +                m_MDS_LOG_NOTIFY("MDTM: Successfully set
>>>>>>>>>>>> TIPC_DEST_DROPPABLE to zero");
>>>>>>>>>>>> +        }
>>>>>>>>>>>> +
>>>>>>>>>>>>          return NCSCC_RC_SUCCESS;
>>>>>>>>>>>>      }
>>>>>>>>>>>>      @@ -563,6 +572,8 @@ ssize_t recvfrom_connectionless 
>>>>>>>>>>>> (int sd,
>>>>>>>>>>>>          unsigned char *cptr;
>>>>>>>>>>>>          int i;
>>>>>>>>>>>>          int has_addr;
>>>>>>>>>>>> +    int anc_data[2];
>>>>>>>>>>>> +
>>>>>>>>>>>>          ssize_t sz;
>>>>>>>>>>>>            has_addr = (from != NULL) && (addrlen != NULL); 
>>>>>>>>>>>> @@
>>>>>>>>>>>> -591,19
>>>>>>>>>>>> +602,26 @@ ssize_t recvfrom_connectionless (int sd,
>>>>>>>>>>>>                     if the message was sent using a TIPC 
>>>>>>>>>>>> name or name sequence as the
>>>>>>>>>>>>                     destination rather than a TIPC port ID 
>>>>>>>>>>>> So abort for TIPC_ERRINFO and TIPC_RETDATA*/
>>>>>>>>>>>>                  if (anc->cmsg_type == TIPC_ERRINFO) {
>>>>>>>>>>>> -                /* TIPC_ERRINFO - TIPC error code 
>>>>>>>>>>>> associated with a
>>>>>>>>>>>> returned data message or a connection termination message  
>>>>>>>>>>>> so abort */
>>>>>>>>>>>> -                m_MDS_LOG_CRITICAL("MDTM: undelivered message
>>>>>>>>>>>> condition ancillary data: TIPC_ERRINFO abort err :%s",
>>>>>>>>>>>> strerror(errno) );
>>>>>>>>>>>> -                abort();
>>>>>>>>>>>> +                anc_data[0] = *((unsigned
>>>>>>>>>>>> int*)(CMSG_DATA(anc) +
>>>>>>>>>>>> 0));
>>>>>>>>>>>> +                if (anc_data[0] == TIPC_ERR_OVERLOAD) {
>>>>>>>>>>>> +                    LOG_CR("MDTM: undelivered message
>>>>>>>>>>>> condition
>>>>>>>>>>>> ancillary data: TIPC_ERR_OVERLOAD");
>>>>>>>>>>>> + m_MDS_LOG_CRITICAL("MDTM: undelivered message
>>>>>>>>>>>> condition ancillary data: TIPC_ERR_OVERLOAD");
>>>>>>>>>>>> +                } else {
>>>>>>>>>>>> +                    /* TIPC_ERRINFO - TIPC error code
>>>>>>>>>>>> associated
>>>>>>>>>>>> with a returned data message or a connection termination 
>>>>>>>>>>>> message so abort */
>>>>>>>>>>>> +                    LOG_CR("MDTM: undelivered message
>>>>>>>>>>>> condition
>>>>>>>>>>>> ancillary data: TIPC_ERRINFO abort err : %d", anc_data[0]);
>>>>>>>>>>>> + m_MDS_LOG_CRITICAL("MDTM: undelivered message
>>>>>>>>>>>> condition ancillary data: TIPC_ERRINFO abort err : %d", 
>>>>>>>>>>>> anc_data[0]);
>>>>>>>>>>>> +                }
>>>>>>>>>>>>                  } else if (anc->cmsg_type == TIPC_RETDATA) {
>>>>>>>>>>>> -                /* If we set TIPC_DEST_DROPPABLE off messge
>>>>>>>>>>>> (configure TIPC to return rejected messages to the sender )
>>>>>>>>>>>> +                /* If we set TIPC_DEST_DROPPABLE off 
>>>>>>>>>>>> + message
>>>>>>>>>>>> (configure TIPC to return rejected messages to the sender )
>>>>>>>>>>>>                         we will hit this when we implement 
>>>>>>>>>>>> MDS retransmit lost messages  abort can be replaced with 
>>>>>>>>>>>> flow control logic*/
>>>>>>>>>>>>                      for (i = anc->cmsg_len - sizeof(*anc); 
>>>>>>>>>>>> i > 0;
>>>>>>>>>>>> i--) {
>>>>>>>>>>>> -                    m_MDS_LOG_DBG("MDTM: returned byte 
>>>>>>>>>>>> 0x%02x\n",
>>>>>>>>>>>> *cptr);
>>>>>>>>>>>> +                    LOG_CR("MDTM: returned byte 0x%02x\n",
>>>>>>>>>>>> *cptr);
>>>>>>>>>>>> + m_MDS_LOG_CRITICAL("MDTM: returned byte
>>>>>>>>>>>> 0x%02x\n", *cptr);
>>>>>>>>>>>>                          cptr++;
>>>>>>>>>>>>                      }
>>>>>>>>>>>>                      /* TIPC_RETDATA -The contents of a 
>>>>>>>>>>>> returned data message  so abort */
>>>>>>>>>>>> -                m_MDS_LOG_CRITICAL("MDTM: undelivered message
>>>>>>>>>>>> condition ancillary data: TIPC_RETDATA abort err :%s",
>>>>>>>>>>>> strerror(errno) );
>>>>>>>>>>>> -                abort();
>>>>>>>>>>>> +                LOG_CR("MDTM: undelivered message 
>>>>>>>>>>>> + condition
>>>>>>>>>>>> ancillary data: TIPC_RETDATA");
>>>>>>>>>>>> +                m_MDS_LOG_CRITICAL("MDTM: undelivered 
>>>>>>>>>>>> + message
>>>>>>>>>>>> condition ancillary data: TIPC_RETDATA");
>>>>>>>>>>>>                  } else if (anc->cmsg_type == TIPC_DESTNAME) {
>>>>>>>>>>>>                      if (sz == 0) {
>>>>>>>>>>>> m_MDS_LOG_DBG("MDTM: recd bytes=0 on received on sock, 
>>>>>>>>>>>> abnormal/unknown condition. Ignoring");
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>


------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to