[lwip-users] [TCP raw API] Nagle + tcp_output interaction (behavior in 24 throughput tests)

vr roriz Thu, 11 Oct 2018 06:43:38 -0700

Dear colleagues,

I'am writting my master thesis in a project using the raw API of
lwip-2.0.3. Although my implementation works, I want to understand a
certain behavior between the Nagle algorithm and the way I call (or not)
tcp_output, but I am not quite sure what is happening.


In the case of a TCP write request, the sender function is invoked:
sender(data[ ], size, send_now). It's pseudocode is:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sender(*data, size, send_now) {

left = size //how much data is left to be put in the queue
available = 0 // how much data I can  put in the queue
pos = 0 // current position in the data vector to be put in the queue
err = ERR_OK // holds the current err of tcp_write attempts

while( left > 0 ) {
do {
  available = tcp_sndbuf(pcb)

 * if((left <= available) && (available > 0))* {
*err = tcp_write(pcb, data[pos], left, TCP_WRITE_FLAG_COPY)*
if(err == ERR_OK) {
left = 0
if(send_now) {
*tcp_output(pcb)*
}
} else { // err == ERR_MEM
blocks and waits for trigger from tcp_sent callback, indicating that sent
data was acked
}

  }* else if((left > available) && (available > 0))*{ //left > available
*err = tcp_write(pcb, data[pos], available, TCP_WRITE_FLAG_MORE |
TCP_WRITE_FLAG_COPY)*
if(err == ERR_OK) {
left = left - available
pos = pos + available
} else { // err == ERR_MEM
blocks and waits for trigger from tcp_sent callback, indicating that sent
data was acked
}
  }* else* {//available == 0
blocks and waits for trigger from tcp_sent callback, indicating that sent
data was acked
  }

} while (err == ERR_MEM)

}
}
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Initially I didn't have a send_now option, so I was always calling
tcp_output when there was available space to send all the rest of the data
I wanted. This option was requested because the team writting application
needs that two specific TCP segments are sent together, even when they call
sender two consecutive times. Therefore, I have to give the application the
possibility to control when tcp_output is called, because  "if the window
size >= MSS and available data is >= MSS" [1], or if there is NOT
unconfirmed data still in the pipe, the tcp_segment will be send now even
with Nagle algo enabled. The application case is the second, they are
trying to send data that is smaller than MSS but there is not unconfirmed
data in the pipe.

Then, I added the send_now control option, letting tcp_output (with
send_now = 0) to be called by lwip itself. I've searched all the references
for tcp_output in the lwip code. For what I understood, not considering
retransmissions connect/close and etc, tcp_output will be called from the
tcp slow timer and in the end of tcp_input (with the comment /* Try to send
something out. */). It makes sense, because when we received an acked lwip
seems to try to flush the TCP Tx queue.

Ok, this strategy seems to be working fine for our purposes. But I would
like to understand the behavior during throughput tests. In the throughput
test, the Application is a client. It connects to a TCP server and sends a
defined amount of data at each period of *1ms*. This amount of data is
defined according to the throughput setpoint I set in the test. I've run
the test for the 4 scenarios *s = (Nagle, send_now)*, with 6 different
throughput setpoints. Thus, 24 tests. The server is implemented in a PC, by
using python sockets. The nodes are directly connected. The network layer
is IPv6, then I am configuring MSS to be 1500 (ETH MTU) - 40 (IPv6 header)
- 20 (TCP header) = 1440 bytes.


------------------------------------------ Test summary  - *Task period = 1
ms - MSS = 1440 (ipv6)*
 
--------------------------------------------------------------------------------------
throughput_setpoint = *1 Mbps (125 bytes / period)*
test_id = 1 : s = (0,0) -- throughput below throughput_setpoint  and
floating, RTT between 170 ms to 200 ms
test_id = 2 : s = (0,1) --  throughput_measured =  throughput_setpoint,
stable, RTT close to 1 ms.
test_id = 3 : s = (1,0) -- throughput below  throughput_setpoint  and
floating, RTT between 170 ms to 200 ms
test_id = 4 : s = (1,1) -- throughput_measured =  throughput_setpoint,
stable

throughput_setpoint = *10 Mbps (1250 bytes / period)*
test_id = 5 : s = (0,0) -- throughput below throughput_setpoint  and
floating, RTT between 1 ms to 200 ms (a bit better)
test_id = 6 : s = (0,1) -- throughput_measured =  throughput_setpoint,
stable, RTT close to 1.
test_id = 7 : s = (1,0) -- throughput below  throughput_setpoint  and
floating, RTT between 1 ms to 200 ms (a bit better)
test_id = 8 : s = (1,1) -- throughput_measured =  throughput_setpoint,
stable, RTT close to 1.

throughput_setpoint = *25 Mbps** (3125 bytes / period)*
test_id = 9  : s = (0,0) -- throughput below throughput_setpoint  and
floating, RTT between 1 ms to 200 ms (a bit better)
test_id =10 : s = (0,1) -- throughput_measured =  throughput_setpoint,
stable, RTT close to 1.
test_id =11 : s = (1,0) -- throughput below  throughput_setpoint  and
floating, RTT between 1 ms to 200 ms (a bit better)
test_id =12 : s = (1,1) -- throughput_measured =  throughput_setpoint,
stable, RTT close to 1.

throughput_setpoint = *35 Mbps** (4375 bytes / period)*
*test_id =13* : s = (0,0) -- *throughput_measured =  throughput_setpoint,
stable, RTT close to 1 (so, for this amount of data the RTT decreases and,
thus, throughput is achieved)*
test_id =14 : s = (0,1) -- throughput_measured =  throughput_setpoint,
stable, RTT close to 1.
test_id =15 : s = (1,0) -- throughput below  throughput_setpoint  and
floating, RTT between 1 ms to 200 ms (a bit better)
test_id =16 : s = (1,1) -- throughput_measured =  throughput_setpoint,
stable, RTT close to 1.

throughput_setpoint = *45 Mbps** (5625 bytes / period)*
test_id =17 : s = (0,0) -- *throughput_measured =  throughput_setpoint,
stable, RTT close to 1 (so, for this amount of data the RTT decreases and,
thus, throughput is achieved)*
test_id =18 : s = (0,1) -- throughput_measured =  throughput_setpoint,
stable, RTT close to 1.
test_id =19 : s = (1,0) -- throughput below  throughput_setpoint  and
floating, RTT between 1 ms to 200 ms (a bit better)
test_id =20 : s = (1,1) -- throughput_measured =  throughput_setpoint,
stable, RTT close to 1.

throughput_setpoint = *49 Mbps** (6125 bytes / period)*
test_id =21 : s = (0,0) -- *throughput_measured =  throughput_setpoint,
stable, RTT close to 1 (so, for this amount of data the RTT decreases and,
thus, throughput is achieved)*
test_id =22 : s = (0,1) -- throughput_measured =  throughput_setpoint,
stable, RTT close to 1.
*test_id =23 *: s = (1,0) -- *throughput_measured =  throughput_setpoint,
stable, RTT close to 1 (so, for this amount of data the RTT decreases and,
thus, throughput is achieved)*
test_id =24 : s = (1,1) -- throughput_measured =  throughput_setpoint,
stable, RTT close to 1.

-------------------------------------------------------------------------------------------------------------------------------------------------------

For (0, 1) and (1, 1) cases (send_now always 1 ):
It doesn't matter if Nagle is ON or OFF, I can always achieve the
throughput setpoint, by wireshark measurements, until 49 Mbps. This is the
maximum value we can achieve, because the OS is message-passing based and
we've limited the maximum length of a message is the OS, therefore
limitting the capability of the application to enqueue data.

For (0, 0) cases (When Nagle = 0 and send_now = 0):
if I write 3125 or more bytes (tests 13, 17 and 21), each period, than the
RTT decreases and the throughput is achieved. I don't understand why at
this point things change. I thought it could be related to delayed acks
from the server. I've changed the advertised windows size (on the server
side) for smaller values and also set the TCP_NODELAY param of the socket
to 1 but the overall behavior is the same.

For (1, 0) cases (When Nagle = 1 and send_now = 0):
The behavior is similar to (0, 0), the RTT started very high and gets
better with more data being sent, but the point where it finally gets back
to around 1 ms is just in the test_id = 23, when we are sending 6125 bytes
each period. Obviously, the throughput is affected by the RTT.

Therefore, I would like to understand what is causing the RTT to reach high
values when send_now is 0 and to dramatically drop for certain amounts of
data being sent. What else can be if not the delayed acks and why we
observe different behaviors for test_id = 13 onwards (for Nagle = 0) and
for test_id = 23 (for Nagle = 1).

--------
Atachments:
Can be download from: https://github.com/vitorroriz/lwip-tests
* lwipopts.h
* test_description_table summarizes the test configs.
* test_wireshark_tracefiles:
Wireshark trace files with names testX_Na_SNb_Ty. With X being the test ID,
"Na" being N0 (Nagle off) or N1 (Naggle on), "SNb" being SN0 (send_now = 0)
or SN1 (send_now = 1) and "Ty" being the Throughput setpoint = y. The
Toyota device is the client in the trace files.
--------
Refs:
 [1]  https://en.wikipedia.org/wiki/Nagle%27s_algorithm


Sorry for the long email but I think this bunch of tests with behavior
analysis can be quite useful for future developers since it is not so easy
to find complete throughput tests available in the forums.
Thank you very much!

Kind regards,
Vitor

_______________________________________________
lwip-users mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/lwip-users

[lwip-users] [TCP raw API] Nagle + tcp_output interaction (behavior in 24 throughput tests)

Reply via email to