--- src/mds/README | 221 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 221 insertions(+) create mode 100644 src/mds/README
diff --git a/src/mds/README b/src/mds/README new file mode 100644 index 0000000..1b94632 --- /dev/null +++ b/src/mds/README @@ -0,0 +1,221 @@ +/* -*- OpenSAF -*- + * + * (C) Copyright 2019 The OpenSAF Foundation + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY + * or FITNESS FOR A PARTICULAR PURPOSE. This file and program are licensed + * under the GNU Lesser General Public License Version 2.1, February 1999. + * The complete license can be accessed from the following location: + * http://opensource.org/licenses/lgpl-license.php + * See the Copying file included with the OpenSAF distribution for full + * licensing terms. + * + * Author(s): Ericsson AB + * + */ +Background +========== +If OpenSAF configures TIPC as transport, the MDS library today will use +TIPC SOCK_RDM socket for message distribution in the cluster. The SOCK_RDM +datagram socket possibly encounters buffer overflow at receiver ends which +has been documented in tipc.io[1]. A temporary solution for this buffer +overflow issue is that the socket buffer size can be increased to a larger +number. However, if the cluster continues either scaling out or adding more +components, the system will be under dimensioned, thus the TIPC buffer +overflow can occur again. + +MDS's solution for TIPC buffer overflow +======================================= +If MDS disables TIPC_DEST_DROPPABLE, TIPC will return the ancillary message +when the original message is failed to deliver. By this event, if the message +has been saved in queue, MDS at sender sides can search and retransmit this +message to the receivers. +Once the messages in the sender's queue has been delivered successfully, MDS +needs to remove them. MDS introduces its internal ACK message as an +acknowledgment from receivers so that the senders can remove the messages +out of the queue. +Also, as such situation of buffer overflow at receivers, the retransmission may +not succeed or even become worse at receiver ends (the more retransmission, +the more overflow to occur). MDS imitates the sliding window in TCP[2] to +control the flow of data message towards the receivers. + +Legacy MDS data message, new (data + ACK) MDS message, and upgradability +------------------------------------------------------------------------ +Below is the MDS legacy message format that has been used till OpenSAF 5.19.07 + +oct 0 message length +oct 1 +------------------------------------------ +oct 2 sequence number: incremented for every message sent out to all destined +... tipc portid. +oct 5 +------------------------------------------ +oct 6 fragment number: a message with same sequence number can be fragmented, +oct 7 identified by this fragment number. +------------------------------------------ +oct 8 length check: cross check with message length(oct0,1), NOT USED. +oct 9 +------------------------------------------ +oct 10 protocol version: (MDS_PROT:0xA0 | MDS_VERSION:0x08) = 0xA8, NOT USED +------------------------------------------ +oct 11 mds length: length of mds header and mds data, starting from oct13 +oct 12 +------------------------------------------ +oct 13 mds header and data +... +------------------------------------------ + +The current sequence number/fragment number are being used in MDS for all +messages sent to all discovered tipc portid(s), meaning that every message is sent +to any tipc portid, the sequence/fragment number is increased. The flow control +needs its own sequence number sliding between two tipc porid(s) so that receivers +can detect message drop due to buffer overload. Therefore, the oct8 and oct9 are +now reused as flow control sequence number. The oct10, protocol version, has new +value of 0xB8. The format of new data message as below: + +oct 0 same +... +oct 7 +------------------------------------------ +oct 8 flow control sequence number +oct 9 +------------------------------------------ +oct 10 protocol version: (MDS_PROT_TIPC_FCTRL:0xB0 | MDS_VERSION:0x08) = 0xB8 +------------------------------------------ +oct 11 same +... +------------------------------------------ + +The ACK message is introduced to acknowledge one data message or a chunk of +accumulative data message. The ACK message format: + +oct 0 message length +oct 1 +------------------------------------------ +oct 2 8 bytes, NOT USED +.... +oct 9 +------------------------------------------ +oct 10 protocol version: (MDS_PROT_TIPC_FCTRL:0xB0 | MDS_VERSION:0x08) = 0xB8 +------------------------------------------ +oct 11 protocol identifier: MDS_PROT_FCTRL_ID +...... +oct 14 +------------------------------------------ +oct 15 flow control message type: CHUNKACK +------------------------------------------ +oct 16 service id: service id of data messages to be acknowledged +oct 17 +------------------------------------------ +oct 18 acknowledged sequence +oct 19 +------------------------------------------ +oct 20 chunk size +oct 21 +------------------------------------------ + +Psuedo code illustrates the data message handling at MDS. + +if protocol version is 0xB8 then + if protocol identifier is MDS_PROT_FCTRL_ID then + this is mds flow control message. + if message type is CHUNKACK, then + this is ACK message for successfully delivered data message(s). + else + this is data message within flow control. + decode oct8,9 as flow control sequence number. +else + this is legacy data message. + +Because the legacy MDS does not use oct8,9,10, so it can communicate transparently +to the new MDS in regard to the presence of flow control. Therefore, the upgrade +will not be affected. + +In case that the receiver end is at legacy MDS version, the new MDS has a timer +mechanism to recognize if the receiver has no flow control supported. This timer +is implemented at the sender so that MDS will stop message queuing for a non-flow +-control MDS at receivers, namely tx-probation timer. + +MDS's sliding window +-------------------- +One important factor that needs to be documented in the implementation of MDS's +sliding window, is that "TIPC's link layer delivery guarantee, the only limiting +factor for datagram delivery is the socket receive buffer size" [1][3][4]. Therefore, +MDS at sender side does not have to implement the retransmission timer. Also, if +MDS at sender side anticipates the buffer overflow at receiver ends, or receives +the first ancillary message, MDS starts queuing messages till the buffer overflow +is resolved to resume data message transmission. + +(1) Sender sequence window + +acked_: last sequence has been acked by receiver +send_: next sequence to be sent +nacked_space_: total bytes are sent but not acked + +Example: + 1 2 3 4 5 6 7 8 +|-----|-----|-----|-----|-----|-----|-----|-----| + acked_ send_ +If acked_:3, send_:7, then +The message with sequence 1,2,3 have been acked. +The 4,5,6 are sent but not acked, and are still in queue. +The 7 is not sent yet. +The nacked_space_: bytes_of(4,5,6) + +(2) Receiver sequence window + +acked_: last sequence has been acked to sender +rcv_: last sequence has been received +nacked_space_: total bytes has not been acked +Example: + 1 2 3 4 5 6 7 8 9 10 +|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----| + acked_ rcv_ +If acked_:3, rcv_:8 +The message with sequence 1,2,3: has been acked. +The 4,5,6,7,8 are received by not acked, still in sender's queue. +The 9,10 are not received yet +The nacked_space_: bytes_of(4,5,6,7,8) + +TIPC portid state machine and its transition +-------------------------------------------- +kDisabled, // no flow control support at this state +kStartup, // a newly published portid starts at this state +kTxProb, // txprob timer is running to confirm if the flow control is supported +kEnabled // flow control support is confirmed, data flow is controlled +kRcvBuffOverflow // anticipating (or experienced) the receiver's buffer overflow + + kDisabled <--- kStartup -------- + /|\ | | + | | | + | V V + -----------kTxProb ---> kEnabled <---> kRcvBuffOverflow + +At the kRcvBuffOverflow state, the messages are being requested to send by MDS's +users will be enqueued at sender sides. When the state returns back to kEnabled, +the queued messages will be transmitted, the transmission is open for MDS's users. + +At this version, MDS changes to kRcvBuffOverflow state if the TIPC_RETDATA event +is returned, which is known as loss-based buffer overflow detection. Another +approach is that MDS can utilize the TIPC_USED_RCV_BUFF TIPC socket option +so that the senders can periodically get update of the receiver's TIPC sock buffer +utilization. In that way, the senders can anticipate the buffer overflow in advance, +which is called in MDS's context as a loss-less detection. + +Configuration +============= +ChunkAckTimeout timer: the receivers send the ACK message if this timer expires. +If this ChunkAckTimeout is set too large, the round trip of data message +acknowledgment increases, data message stays too long in the queue. + +ChunkAckSize: The number of message to be acknowledged in an ACK message. If +this ChunkAckSize is too small, there will be a plentiful number of ACK messages +sent across two ends, which causes the overhead cost to MDS's message handling. + +References +========== +[1] http://tipc.io/programming.html, 1.3.1. Datagram Messaging +[2] https://tools.ietf.org/html/rfc793, page 20. +[3] http://www.tipc.io/protocol.html#anchor71, 7.2.5. Sequence Control and Retransmission +[4] http://tipc.sourceforge.net/protocol.html, 4.2. Link -- 2.7.4 _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
