Re: [devel] [PATCH 1/9] mds: Add README for solution of TIPC buffer overflow at MDS [#1960]

Minh Hon Chau Mon, 16 Sep 2019 04:40:54 -0700

Hi Vu,

Thanks for your time to review the patches, the question is interesting.

At this moment with normal load traffic, the resource towards the newstandby (old active) is not released and will be reused if standbyswitches back to active. The reason is that mds won't start the "txprobation" again to confirm flow control support as mds has known it hadenabled flow control on this port id. The messages towards the newactive are sent on another port id thus they are running on a differentflow control counter. The test of multiple switchover looks ok so far.However, the problem probably happens with overloaded traffic while afailover/switchover (I haven't tested this case). The pending messagesunder overload state to be sent to the old active won't be sent to thenew active, I guess the mds user would get TIMEOUT and try again to sendthe message to the new active, which at least corresponds to legacybehavior. However, this could be looked at as an improvement as we havepending messages, we know the new active, we can send the pendingmessages to new active, but another question is that whether theexisting users expect to receive these pending messages according totheir current logics.


Regards,

Minh

On 16/9/19 5:34 pm, Nguyen Minh Vu wrote:

Hi Minh,
I have just finished my review to your MDS patches, and I have aquestion:
With 2N services, suppose the active is having TIPC overloaded issue;
it will do some memory allocations, and probably starting a timerthere too.
Then, what happens if that active service is changed to the standby role?
Shall allocated memory/timer be freed up and is there any impact onthe subsequent messages sent to the new active?
Regards, Vu

On 8/14/19 1:38 PM, Minh Chau wrote:
---
src/mds/README | 221+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  1 file changed, 221 insertions(+)
  create mode 100644 src/mds/README

diff --git a/src/mds/README b/src/mds/README
new file mode 100644
index 0000000..1b94632
--- /dev/null
+++ b/src/mds/README
@@ -0,0 +1,221 @@
+/*      -*- OpenSAF  -*-
+ *
+ * (C) Copyright 2019 The OpenSAF Foundation
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty ofMERCHANTABILITY+ * or FITNESS FOR A PARTICULAR PURPOSE. This file and program arelicensed+ * under the GNU Lesser General Public License Version 2.1, February1999.
+ * The complete license can be accessed from the following location:
+ * http://opensource.org/licenses/lgpl-license.php
+ * See the Copying file included with the OpenSAF distribution for full
+ * licensing terms.
+ *
+ * Author(s): Ericsson AB
+ *
+ */
+Background
+==========
+If OpenSAF configures TIPC as transport, the MDS library today will use
+TIPC SOCK_RDM socket for message distribution in the cluster. TheSOCK_RDM+datagram socket possibly encounters buffer overflow at receiver endswhich
+has been documented in tipc.io[1]. A temporary solution for this buffer
+overflow issue is that the socket buffer size can be increased to alarger+number. However, if the cluster continues either scaling out oradding more
+components, the system will be under dimensioned, thus the TIPC buffer
+overflow can occur again.
+
+MDS's solution for TIPC buffer overflow
+=======================================
+If MDS disables TIPC_DEST_DROPPABLE, TIPC will return the ancillarymessage+when the original message is failed to deliver. By this event, ifthe message+has been saved in queue, MDS at sender sides can search andretransmit this
+message to the receivers.
+Once the messages in the sender's queue has been deliveredsuccessfully, MDS
+needs to remove them. MDS introduces its internal ACK message as an
+acknowledgment from receivers so that the senders can remove themessages
+out of the queue.
+Also, as such situation of buffer overflow at receivers, theretransmission may+not succeed or even become worse at receiver ends (the moreretransmission,+the more overflow to occur). MDS imitates the sliding window inTCP[2] to
+control the flow of data message towards the receivers.
+
+Legacy MDS data message, new (data + ACK) MDS message, andupgradability+------------------------------------------------------------------------+Below is the MDS legacy message format that has been used tillOpenSAF 5.19.07
+
+oct 0  message length
+oct 1
+------------------------------------------
+oct 2 sequence number: incremented for every message sent out toall destined
+...       tipc portid.
+oct 5
+------------------------------------------
+oct 6 fragment number: a message with same sequence number can befragmented,
+oct 7  identified by this fragment number.
+------------------------------------------
+oct 8  length check: cross check with message length(oct0,1), NOT USED.
+oct 9
+------------------------------------------
+oct 10 protocol version: (MDS_PROT:0xA0 | MDS_VERSION:0x08) = 0xA8,NOT USED
+------------------------------------------
+oct 11 mds length: length of mds header and mds data, starting fromoct13
+oct 12
+------------------------------------------
+oct 13 mds header and data
+...
+------------------------------------------
+
+The current sequence number/fragment number are being used in MDSfor all+messages sent to all discovered tipc portid(s), meaning that everymessage is sent+to any tipc portid, the sequence/fragment number is increased. Theflow control+needs its own sequence number sliding between two tipc porid(s) sothat receivers+can detect message drop due to buffer overload. Therefore, the oct8and oct9 are+now reused as flow control sequence number. The oct10, protocolversion, has new
+value of 0xB8. The format of new data message as below:
+
+oct 0  same
+...
+oct 7
+------------------------------------------
+oct 8  flow control sequence number
+oct 9
+------------------------------------------
+oct 10 protocol version: (MDS_PROT_TIPC_FCTRL:0xB0 |MDS_VERSION:0x08) = 0xB8
+------------------------------------------
+oct 11 same
+...
+------------------------------------------
+
+The ACK message is introduced to acknowledge one data message or achunk of
+accumulative data message. The ACK message format:
+
+oct 0  message length
+oct 1
+------------------------------------------
+oct 2  8 bytes, NOT USED
+....
+oct 9
+------------------------------------------
+oct 10 protocol version: (MDS_PROT_TIPC_FCTRL:0xB0 |MDS_VERSION:0x08) = 0xB8
+------------------------------------------
+oct 11 protocol identifier: MDS_PROT_FCTRL_ID
+......
+oct 14
+------------------------------------------
+oct 15 flow control message type: CHUNKACK
+------------------------------------------
+oct 16 service id: service id of data messages to be acknowledged
+oct 17
+------------------------------------------
+oct 18 acknowledged sequence
+oct 19
+------------------------------------------
+oct 20 chunk size
+oct 21
+------------------------------------------
+
+Psuedo code illustrates the data message handling at MDS.
+
+if protocol version is 0xB8 then
+  if protocol identifier is MDS_PROT_FCTRL_ID then
+    this is mds flow control message.
+    if message type is CHUNKACK, then
+      this is ACK message for successfully delivered data message(s).
+  else
+    this is data message within flow control.
+    decode oct8,9 as flow control sequence number.
+else
+  this is legacy data message.
+
+Because the legacy MDS does not use oct8,9,10, so it can communicatetransparently+to the new MDS in regard to the presence of flow control. Therefore,the upgrade
+will not be affected.
+
+In case that the receiver end is at legacy MDS version, the new MDShas a timer+mechanism to recognize if the receiver has no flow controlsupported. This timer+is implemented at the sender so that MDS will stop message queuingfor a non-flow
+-control MDS at receivers, namely tx-probation timer.
+
+MDS's sliding window
+--------------------
+One important factor that needs to be documented in theimplementation of MDS's+sliding window, is that "TIPC's link layer delivery guarantee, theonly limiting+factor for datagram delivery is the socket receive buffer size"[1][3][4]. Therefore,+MDS at sender side does not have to implement the retransmissiontimer. Also, if+MDS at sender side anticipates the buffer overflow at receiver ends,or receives+the first ancillary message, MDS starts queuing messages till thebuffer overflow
+is resolved to resume data message transmission.
+
+(1) Sender sequence window
+
+acked_: last sequence has been acked by receiver
+send_:  next sequence to be sent
+nacked_space_: total bytes are sent but not acked
+
+Example:
+   1     2     3     4     5     6     7     8
+|-----|-----|-----|-----|-----|-----|-----|-----|
+            acked_                  send_
+If acked_:3, send_:7, then
+The message with sequence 1,2,3 have been acked.
+The 4,5,6 are sent but not acked, and are still in queue.
+The 7 is not sent yet.
+The nacked_space_: bytes_of(4,5,6)
+
+(2) Receiver sequence window
+
+acked_: last sequence has been acked to sender
+rcv_: last sequence has been received
+nacked_space_: total bytes has not been acked
+Example:
+   1     2     3     4     5     6     7     8     9     10
+|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
+            acked_                        rcv_
+If acked_:3, rcv_:8
+The message with sequence 1,2,3: has been acked.
+The 4,5,6,7,8 are received by not acked, still in sender's queue.
+The 9,10 are not received yet
+The nacked_space_: bytes_of(4,5,6,7,8)
+
+TIPC portid state machine and its transition
+--------------------------------------------
+kDisabled, // no flow control support at this state
+kStartup,  // a newly published portid starts at this state
+kTxProb, // txprob timer is running to confirm if the flow controlis supported+kEnabled // flow control support is confirmed, data flow iscontrolled+kRcvBuffOverflow // anticipating (or experienced) the receiver'sbuffer overflow
+
+  kDisabled <--- kStartup --------
+     /|\             |           |
+      |              |           |
+      |              V           V
+      -----------kTxProb ---> kEnabled <---> kRcvBuffOverflow
+
+At the kRcvBuffOverflow state, the messages are being requested tosend by MDS's+users will be enqueued at sender sides. When the state returns backto kEnabled,+the queued messages will be transmitted, the transmission is openfor MDS's users.
+
+At this version, MDS changes to kRcvBuffOverflow state if theTIPC_RETDATA event+is returned, which is known as loss-based buffer overflow detection.Another+approach is that MDS can utilize the TIPC_USED_RCV_BUFF TIPC socketoption+so that the senders can periodically get update of the receiver'sTIPC sock buffer+utilization. In that way, the senders can anticipate the bufferoverflow in advance,
+which is called in MDS's context as a loss-less detection.
+
+Configuration
+=============
+ChunkAckTimeout timer: the receivers send the ACK message if thistimer expires.+If this ChunkAckTimeout is set too large, the round trip of datamessage
+acknowledgment increases, data message stays too long in the queue.
+
+ChunkAckSize: The number of message to be acknowledged in an ACKmessage. If+this ChunkAckSize is too small, there will be a plentiful number ofACK messages+sent across two ends, which causes the overhead cost to MDS'smessage handling.
+
+References
+==========
+[1] http://tipc.io/programming.html, 1.3.1. Datagram Messaging
+[2] https://tools.ietf.org/html/rfc793, page 20.
+[3] http://www.tipc.io/protocol.html#anchor71, 7.2.5. SequenceControl and Retransmission
+[4] http://tipc.sourceforge.net/protocol.html, 4.2. Link



_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1/9] mds: Add README for solution of TIPC buffer overflow at MDS [#1960]

Reply via email to