I'm submitting this fast-track for Cathy Zhou, timing out on 06/14/2010.
The design document referenced below is contained in the materials
directory.
Problem Area
============
TCP, as a protocol widely used for reliable end-to-end network
communication, the users must often perform tedious manual
optimizations simply to achieve acceptable performance. Among those
tasks, one important aspect is to determine and set appropriate
socket buffer size to achieve better performance with the
limitation of the memory resources. This specific manual task
usually requires very experienced administrator or application
developers. Also, the dynamic characteristic of the networking
environment makes the task even more complicated.
Therefore, we identify the needs for an automatically-sizing
algorithm in our TCP/IP stack, which can automatically adjust the
buffer size based on the current state of each specific
connection. The goal is to achieve the preferable transfer rates on
each connection out-of-box without manual intervention.
Details
=======
After investigating several existing algorithms and doing some
experiments with the prototypes, we decided to deploy the DRS
(Dynamic Right-Sizing)[1] receive buffer auto-sizing algorithm in
Solaris.
Basically, DRS tries to let the receiver estimate the sender's
congestion window size (cwnd) by measuring the amount of data
received over a period of time that is a round-trip time (BDP
measurement), and then uses that measurement to dynamically change
the size of the receiver's receive buffer.
Specifically in Solaris, there are several aspect of the DRS
algorithm:
- RTT measurement
* High-resolution time
In today's Solaris, the TCP timestamps option is filled in with
low-resolution time unit (llbolt) which has precision of
10ms. This would result in the imprecision of the RTT
measurement, which will ultimately results in the
over-estimation of the receive buffer size.
Therefore, the high-resolution time (I.e., the gethrtime()
function) will be used instead to get the current timestamps in
our TCP/IP stack. Since gethrtime() returns 64 bits result
which is expressed as nanoseconds, the high-resolution time
will be right-shifted 20 bits and fit into the 32 bits
timestamps option. This assures the time precision of 1ms (1ms
is the fastest precision that is required by PAWS[2]), hence
prevents overbooking of the receive buffer.
* Send-side RTT measurement
Solaris today already has the send-side RTT measurement either
based on the timestamps option or by observing the time between
a packet and its acknowledgment. Note that the send-side
averaged RTT will only be used as one source of the RTT
measurement when there are more than tcp_tx_rtt_updates (a new
TCP property, default is set to 20) RTT samples - when it is
statistically useful.
* Receive-side RTT measurement
Besides the existing send-side RTT measurement, we will add the
receive-side RTT measurement as well. We will get the
receive-side RTT samples by observing:
- the time difference when a timestamps is reflected back from
the peer if the timestamps option is enabled;
- the time between the acknowledgment of sequence number S
which announces receive window W and the receipt of data
segment that contains sequence number which is at least S + W
+ 1.
Since the above receive-side RTT sampling acts only as an
upper-bound on the RTT, each time we get a sample, the
receive-side RTT will be updated to be the minimum of the old
receive-side RTT value and the current RTT measurement.
Both receive-side RTT and the send-side RTT will be considered as
the source of the RTT estimate when they are available. The final
RTT will be set as the average of both values.
- Receive buffer auto-sizing
At the start of the connection, if the receive buffer auto-sizing
is enabled for this connection, the receive buffer will be
initialized to be tcp_recv_autosize_initmss * MSS
(tcp_recv_autosize_initmss is another new TCP property, default
is set to 15). This is an attempt to assure the sender to not be
receive-window limited during the first RTT.
Then each time a packet is received, the current time will be
compared to the last measurement time for that connection. If
more than the current estimated RTT has passed, the highest
sequence number seen is compared to the next sequence number
expected at the beginning of the measurement (BDP
measurement). The receive buffer size for this connection is then
updated, to advertise the receive window (rwnd) that is 3.5 times
of the sequence number difference.
- Window Scaling and loopback connections
The window scale is necessary for receive buffer auto-sizing
algorithm in order to advertise big receive window. As that window
scale is negotiated at the time when a connection is setup, and
cannot be changed overtime, therefore, the scaling must be
carefully chosen to be the smallest window scale value that can
represent the maximum receive buffer size (defined by
tcp_max_buf).
If the window scaling negotiation fails, the TCP buffer
auto-sizing will be disabled over this specific connection.
In the case of loopback TCP connections, thread scheduling affects
performance more than the receive buffer size. To avoid
unnecessary code complexities (especially for the TCP fusion code
path), the receive buffer auto-sizing is disabled for loopback
connections as well.
Interface Changes
=================
- Relevant TCP/IP protocol properties
1. tcp_recvbuf_autosize (new)
A new tcp_recvbuf_autosize property will be introduced. The
current possible value would be either "off" or "drs". In the
future, other values may be introduces when more auto-sizing
algorithms are implemented.
If tcp_recvbuf_autosize is set to "drs", the DRS receive
buffer auto-sizing will be attempted on all non-loop-back
connections; If it is set to "off", the auto-sizing will be
disabled.
By default, this property will be set to "drs".
Like other ipadm properties, one would require 'sys_ip_config'
privilege to configure this property.
2. tcp_recv_autosize_initmss (new)
If the receive buffer auto-sizing is enabled, a TCP
connection's initial receive buffer size will be set as
tcp_recv_autosize_initmss * MSS.
The default value of tcp_recv_autosize_initmss will be set to
15.
3. tcp_xmit_rtt_autosize_start (new)
As we discussed, the send-side RTT estimate can only become
statistically useful when there are enough samples: if there
are more than tcp_xmit_rtt_autosize_start send-side RTT
measurement samples, we will start to use the averaged
send-side RTT as one source of the RTT estimates of the
connection.
The default value of tcp_xmit_rtt_autosize_start will be set
to 20.
All the 3 new properties will be made as Consolidation Private
properties for now, and will only be expected to be used for
diagnosing purpose. They will only be made public when it proves
to be useful for customers.
4. tcp_max_buf
The existing tcp_max_buf property will be used as the maximum
receive buffer size that can be set (or auto-sized to). The
same value will be used to determine the window scale value of
the connection if the auto-sizing is enabled.
5. recv_maxbuf
This is an existing property that is general for
transportation protocols. Specifically for TCP, today it
defines the default receive buffer size for a
connection. After this project is integrated, if auto-sizing
is enabled for a specific TCP connection, the recv_maxbuf
property will become irrelevant, since the receive buffer size
will be automatically adjusted based on its network condition.
- Socket options
1. SO_RCVBUF
In today's Solaris, this socket option is used to set/get the
current receive buffer size of a specific TCP connection. In
the future, the receive buffer size will be changed
dynamically during the lifetime of the connection when the
auto-sizing is enabled.
After this project, the semantics of SO_RCVBUF is going to
change. If auto-sizing is disabled, the SO_RCVBUF socket
option will still be used to set/get the current receive
buffer size. But On the other hand, if auto-sizing is enabled,
the SO_RCVBUF socket option will be used to set the maximum
receive buffer size that be auto-sized up to for a connection.
2. TCP_RCVBUF_AUTOSIZE (new)
A new IPPROTO_TCP level TCP_RCVBUF_AUTOSIZE socket option will
be introduced to enable or disable the receive buffer
auto-sizing for a specific TCP connection. It can be set to
either "TCP_AUTOSIZE_OFF" (disable) or "TCP_AUTOSIZE_DRS"
(enable the DRS algorithm) for now.
When the TCP_RCVBUF_AUTOSIZE option is used to disable receive
buffer auto-sizing, the current receive buffer size will be
set to the value specified by the recv_maxbuf property, the
same as the current Solaris behavior.
When the TCP_RCVBUF_AUTOSIZE option is used to enable receive
buffer auto-sizing, the maximum receive buffer size will be
set to the value specified by the tcp_max_buf property, and
the current receive buffer size will stay the same and will be
automatically adjusted based on the current network condition.
3. TCP_CUR_RCVBUF (new)
A new IPPROTO_TCP level TCP_CUR_RCVBUF socket option will be
introduced to get the current receive buffer size of a
specific TCP connection. This property will be read-only and
its return value may change during the lifetime of a TCP
connection.
Both TCP_RCVBUF_AUTOSIZE and TCP_CUR_RCVBUF socket options will
be Consolidation Private interfaces for now and will only be made
public when it proves to be useful for customers.
- Observability
1. pfiles(1)
pfiles(1) will be extended to show whether the receive buffer
auto-sizing is enabled and the current receive buffer size
value for a TCP connection. We will also make several other
minor changes of the output format so that it reads more
clearly:
9: S_IFSOCK mode:0666 dev:337,0 ino:10285 uid:0 gid:0 size:0
O_RDWR|O_NONBLOCK
SOCK_STREAM
send buffer: 49152 bytes
receive buffer: 21720 bytes (maximum: 1048576 bytes)
auto-size: enabled
sockname: AF_INET 129.146.104.83 port: 65506
peername: AF_INET 192.18.34.10 port: 5001
2. DTrace probes
sdt DTrace probes will be provided to observe the RTT
measurement process, the BDP calculation process and the
current receive buffer size.
Interface Table
===============
+-----------------------------+-----------------------+----------------+
| Interface | Stability | Description |
+-----------------------------+-----------------------+----------------+
| TCP_RCVBUF_AUTOSIZE | Consolidation Private | socket option |
| TCP_CUR_RCVBUF | Consolidation Private | socket option |
+-----------------------------+-----------------------+----------------+
| tcp_recvbuf_autosize | Consolidation Private | ipadm property |
| tcp_recv_autosize_initmss | Consolidation Private | ipadm property |
| tcp_xmit_rtt_autosize_start | Consolidation Private | ipadm property |
+-----------------------------+-----------------------+----------------+
| sdt probes for auto-sizing | Project Private | DTrace probes |
+-----------------------------+-----------------------+----------------+
References
==========
[1] M. Fisk, W. Feng, "Dynamic Right-Sizing in TCP" Oct. 2001.
[2] TCP Extensions for High Performance - rfc1323
[3] Receive buffer auto-sizing design document
_______________________________________________
opensolaris-arc mailing list
[email protected]