I'm submitting this fast-track for Cathy Zhou, timing out on 06/14/2010. The design document referenced below is contained in the materials directory.

Problem Area
============

   TCP, as a protocol widely used for reliable end-to-end network
   communication, the users must often perform tedious manual
   optimizations simply to achieve acceptable performance. Among those
   tasks, one important aspect is to determine and set appropriate
   socket buffer size to achieve better performance with the
   limitation of the memory resources. This specific manual task
   usually requires very experienced administrator or application
   developers. Also, the dynamic characteristic of the networking
   environment makes the task even more complicated.

   Therefore, we identify the needs for an automatically-sizing
   algorithm in our TCP/IP stack, which can automatically adjust the
   buffer size based on the current state of each specific
   connection. The goal is to achieve the preferable transfer rates on
   each connection out-of-box without manual intervention.

Details
=======

   After investigating several existing algorithms and doing some
   experiments with the prototypes, we decided to deploy the DRS
   (Dynamic Right-Sizing)[1] receive buffer auto-sizing algorithm in
   Solaris.

   Basically, DRS tries to let the receiver estimate the sender's
   congestion window size (cwnd) by measuring the amount of data
   received over a period of time that is a round-trip time (BDP
   measurement), and then uses that measurement to dynamically change
   the size of the receiver's receive buffer.

   Specifically in Solaris, there are several aspect of the DRS
   algorithm:

   - RTT measurement

     * High-resolution time

       In today's Solaris, the TCP timestamps option is filled in with
       low-resolution time unit (llbolt) which has precision of
       10ms. This would result in the imprecision of the RTT
       measurement, which will ultimately results in the
       over-estimation of the receive buffer size.

       Therefore, the high-resolution time (I.e., the gethrtime()
       function) will be used instead to get the current timestamps in
       our TCP/IP stack. Since gethrtime() returns 64 bits result
       which is expressed as nanoseconds, the high-resolution time
       will be right-shifted 20 bits and fit into the 32 bits
       timestamps option. This assures the time precision of 1ms (1ms
       is the fastest precision that is required by PAWS[2]), hence
       prevents overbooking of the receive buffer.

     * Send-side RTT measurement

       Solaris today already has the send-side RTT measurement either
       based on the timestamps option or by observing the time between
       a packet and its acknowledgment. Note that the send-side
       averaged RTT will only be used as one source of the RTT
       measurement when there are more than tcp_tx_rtt_updates (a new
       TCP property, default is set to 20) RTT samples - when it is
       statistically useful.

     * Receive-side RTT measurement

       Besides the existing send-side RTT measurement, we will add the
       receive-side RTT measurement as well. We will get the
       receive-side RTT samples by observing:

       - the time difference when a timestamps is reflected back from
         the peer if the timestamps option is enabled;

       - the time between the acknowledgment of sequence number S
         which announces receive window W and the receipt of data
         segment that contains sequence number which is at least S + W
         + 1.

       Since the above receive-side RTT sampling acts only as an
       upper-bound on the RTT, each time we get a sample, the
       receive-side RTT will be updated to be the minimum of the old
       receive-side RTT value and the current RTT measurement.

     Both receive-side RTT and the send-side RTT will be considered as
     the source of the RTT estimate when they are available. The final
     RTT will be set as the average of both values.

   - Receive buffer auto-sizing

     At the start of the connection, if the receive buffer auto-sizing
     is enabled for this connection, the receive buffer will be
     initialized to be tcp_recv_autosize_initmss * MSS
     (tcp_recv_autosize_initmss is another new TCP property, default
     is set to 15).  This is an attempt to assure the sender to not be
     receive-window limited during the first RTT.

     Then each time a packet is received, the current time will be
     compared to the last measurement time for that connection. If
     more than the current estimated RTT has passed, the highest
     sequence number seen is compared to the next sequence number
     expected at the beginning of the measurement (BDP
     measurement). The receive buffer size for this connection is then
     updated, to advertise the receive window (rwnd) that is 3.5 times
     of the sequence number difference.

   - Window Scaling and loopback connections

    The window scale is necessary for receive buffer auto-sizing
    algorithm in order to advertise big receive window. As that window
    scale is negotiated at the time when a connection is setup, and
    cannot be changed overtime, therefore, the scaling must be
    carefully chosen to be the smallest window scale value that can
    represent the maximum receive buffer size (defined by
    tcp_max_buf).

    If the window scaling negotiation fails, the TCP buffer
    auto-sizing will be disabled over this specific connection.

    In the case of loopback TCP connections, thread scheduling affects
    performance more than the receive buffer size. To avoid
    unnecessary code complexities (especially for the TCP fusion code
    path), the receive buffer auto-sizing is disabled for loopback
    connections as well.

Interface Changes
=================

   - Relevant TCP/IP protocol properties

     1. tcp_recvbuf_autosize (new)

        A new tcp_recvbuf_autosize property will be introduced. The
        current possible value would be either "off" or "drs". In the
        future, other values may be introduces when more auto-sizing
        algorithms are implemented.

        If tcp_recvbuf_autosize is set to "drs", the DRS receive
        buffer auto-sizing will be attempted on all non-loop-back
        connections; If it is set to "off", the auto-sizing will be
        disabled.

        By default, this property will be set to "drs".

        Like other ipadm properties, one would require 'sys_ip_config'
        privilege to configure this property.

     2. tcp_recv_autosize_initmss (new)

        If the receive buffer auto-sizing is enabled, a TCP
        connection's initial receive buffer size will be set as
        tcp_recv_autosize_initmss * MSS.

        The default value of tcp_recv_autosize_initmss will be set to
        15.

     3. tcp_xmit_rtt_autosize_start (new)

        As we discussed, the send-side RTT estimate can only become
        statistically useful when there are enough samples: if there
        are more than tcp_xmit_rtt_autosize_start send-side RTT
        measurement samples, we will start to use the averaged
        send-side RTT as one source of the RTT estimates of the
        connection.

        The default value of tcp_xmit_rtt_autosize_start will be set
        to 20.

     All the 3 new properties will be made as Consolidation Private
     properties for now, and will only be expected to be used for
     diagnosing purpose.  They will only be made public when it proves
     to be useful for customers.

     4. tcp_max_buf

        The existing tcp_max_buf property will be used as the maximum
        receive buffer size that can be set (or auto-sized to). The
        same value will be used to determine the window scale value of
        the connection if the auto-sizing is enabled.

     5. recv_maxbuf

        This is an existing property that is general for
        transportation protocols. Specifically for TCP, today it
        defines the default receive buffer size for a
        connection. After this project is integrated, if auto-sizing
        is enabled for a specific TCP connection, the recv_maxbuf
        property will become irrelevant, since the receive buffer size
        will be automatically adjusted based on its network condition.

   - Socket options

     1. SO_RCVBUF

        In today's Solaris, this socket option is used to set/get the
        current receive buffer size of a specific TCP connection. In
        the future, the receive buffer size will be changed
        dynamically during the lifetime of the connection when the
        auto-sizing is enabled.

        After this project, the semantics of SO_RCVBUF is going to
        change.  If auto-sizing is disabled, the SO_RCVBUF socket
        option will still be used to set/get the current receive
        buffer size. But On the other hand, if auto-sizing is enabled,
        the SO_RCVBUF socket option will be used to set the maximum
        receive buffer size that be auto-sized up to for a connection.

     2. TCP_RCVBUF_AUTOSIZE (new)

        A new IPPROTO_TCP level TCP_RCVBUF_AUTOSIZE socket option will
        be introduced to enable or disable the receive buffer
        auto-sizing for a specific TCP connection. It can be set to
        either "TCP_AUTOSIZE_OFF" (disable) or "TCP_AUTOSIZE_DRS"
        (enable the DRS algorithm) for now.

        When the TCP_RCVBUF_AUTOSIZE option is used to disable receive
        buffer auto-sizing, the current receive buffer size will be
        set to the value specified by the recv_maxbuf property, the
        same as the current Solaris behavior.

        When the TCP_RCVBUF_AUTOSIZE option is used to enable receive
        buffer auto-sizing, the maximum receive buffer size will be
        set to the value specified by the tcp_max_buf property, and
        the current receive buffer size will stay the same and will be
        automatically adjusted based on the current network condition.

     3. TCP_CUR_RCVBUF (new)

        A new IPPROTO_TCP level TCP_CUR_RCVBUF socket option will be
        introduced to get the current receive buffer size of a
        specific TCP connection. This property will be read-only and
        its return value may change during the lifetime of a TCP
        connection.

     Both TCP_RCVBUF_AUTOSIZE and TCP_CUR_RCVBUF socket options will
     be Consolidation Private interfaces for now and will only be made
     public when it proves to be useful for customers.

   - Observability

     1. pfiles(1)

        pfiles(1) will be extended to show whether the receive buffer
        auto-sizing is enabled and the current receive buffer size
        value for a TCP connection. We will also make several other
        minor changes of the output format so that it reads more
        clearly:

            9: S_IFSOCK mode:0666 dev:337,0 ino:10285 uid:0 gid:0 size:0
               O_RDWR|O_NONBLOCK
               SOCK_STREAM
               send buffer: 49152 bytes
               receive buffer: 21720 bytes (maximum: 1048576 bytes)
               auto-size: enabled
               sockname: AF_INET 129.146.104.83 port: 65506
               peername: AF_INET 192.18.34.10 port: 5001

     2. DTrace probes

        sdt DTrace probes will be provided to observe the RTT
        measurement process, the BDP calculation process and the
        current receive buffer size.

Interface Table
===============

   +-----------------------------+-----------------------+----------------+
   |          Interface          |     Stability         |  Description   |
   +-----------------------------+-----------------------+----------------+
   | TCP_RCVBUF_AUTOSIZE         | Consolidation Private | socket option  |
   | TCP_CUR_RCVBUF              | Consolidation Private | socket option  |
   +-----------------------------+-----------------------+----------------+
   | tcp_recvbuf_autosize        | Consolidation Private | ipadm property |
   | tcp_recv_autosize_initmss   | Consolidation Private | ipadm property |
   | tcp_xmit_rtt_autosize_start | Consolidation Private | ipadm property |
   +-----------------------------+-----------------------+----------------+
   | sdt probes for auto-sizing  | Project Private       | DTrace probes  |
   +-----------------------------+-----------------------+----------------+

References
==========

[1] M. Fisk, W. Feng, "Dynamic Right-Sizing in TCP" Oct. 2001.

[2] TCP Extensions for High Performance - rfc1323

[3] Receive buffer auto-sizing design document
_______________________________________________
opensolaris-arc mailing list
[email protected]

Reply via email to