>I want to state again that using 536 bytes as a magic value makes no
    sense to me.

 >autotuning might be delayed by one RTT, this does not match numbers
 >given by Mohamed (flows stuck in low speed)

  >autotuning is an heuristic, and because it has one RTT latency, it is
   >crucial to get proper initial rcvmem values.

   >People using MTU=9000 should know they have to tune tcp_rmem[1]
   >accordingly, especially when using drivers consuming one page per
   >+incoming MSS.



The magic number would be 10*rcv_mss=5360 not 536 and in my opinion it's a big 
amount of data to be sent in security attack so if we are talking about DDos 
attack triggering Autotuning at 5360 bytes I'd say he will also be able to 
trigger it sending 64KB but I totally agree that it would be easier with lower 
rcvq_space.space, it's always a tradeoff between security and performance.

Other options would be to either consider the configured MTU in the rcv_wnd 
calculation or probably check the MTU before calculating the initial rcvspace. 
We have to make sure that initial receive space is lower than initial receive 
window so Autotuning would work regardless the configured MTU on the receiver 
and only people using Jumbo frames will be paying the price if we agreed that 
it's expected for Jumbo frame users to have machines with more memory,  I'd say 
something as below should work:

void tcp_init_buffer_space(struct sock *sk)
{
        int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win;
        struct inet_connection_sock *icsk = inet_csk(sk);
        struct tcp_sock *tp = tcp_sk(sk);
        int maxwin;

        if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
                tcp_sndbuf_expand(sk);
        if(tp->advmss < 6000)
                tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * 
tp->advmss);
        else
                tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * 
icsk->icsk_ack.rcv_mss);
        tcp_mstamp_refresh(tp);
        tp->rcvq_space.time = tp->tcp_mstamp;
        tp->rcvq_space.seq = tp->copied_seq;



I don't think that we should rely on Admins manually tuning this tcp_rmem[1] 
with Jumbo frame in use also Linux users shouldn't expect performance 
degradation after kernel upgrade. although [1] is the only public reporting of 
this issue, I am pretty sure we will see more users reporting this with Linux 
Main distributions moving to kernel 5.4 as stable version. In Summary we should 
come up with something either the proposed patch or something else to avoid 
admins doing the manual job.


Links

[1] https://github.com/kubernetes/kops/issues/10206

On 07/12/2020, 17:08, "Eric Dumazet" <[email protected]> wrote:

    CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



    On Mon, Dec 7, 2020 at 5:34 PM Neal Cardwell <[email protected]> wrote:
    >
    > On Mon, Dec 7, 2020 at 11:23 AM Eric Dumazet <[email protected]> wrote:
    > >
    > > On Mon, Dec 7, 2020 at 5:09 PM Mohamed Abuelfotoh, Hazem
    > > <[email protected]> wrote:
    > > >
    > > >     >Since I can not reproduce this problem with another NIC on x86, I
    > > >     >really wonder if this is not an issue with ENA driver on PowerPC
    > > >     >perhaps ?
    > > >
    > > >
    > > > I am able to reproduce it on x86 based EC2 instances using ENA  or  
Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to 
ENA, we were able to easily reproduce it between 2 VMs running in virtual box 
on the same physical host considering the environment requirements I mentioned 
in my first e-mail.
    > > >
    > > > What's the RTT between the sender & receiver in your reproduction? 
Are you using bbr on the sender side?
    > >
    > >
    > > 100ms RTT
    > >
    > > Which exact version of linux kernel are you using ?
    >
    > Thanks for testing this, Eric. Would you be able to share the MTU
    > config commands you used, and the tcpdump traces you get? I'm
    > surprised that receive buffer autotuning would work for advmss of
    > around 6500 or higher.

    autotuning might be delayed by one RTT, this does not match numbers
    given by Mohamed (flows stuck in low speed)

    autotuning is an heuristic, and because it has one RTT latency, it is
    crucial to get proper initial rcvmem values.

    People using MTU=9000 should know they have to tune tcp_rmem[1]
    accordingly, especially when using drivers consuming one page per
    incoming MSS.


    (mlx4 driver only uses ome 2048 bytes fragment for a 1500 MTU packet.
    even with MTU set to 9000)

    I want to state again that using 536 bytes as a magic value makes no
    sense to me.


    For the record, Google has increased tcp_rmem[1] when switching to a bigger 
MTU.

    The reason is simple : If we intend to receive 10 MSS, we should allow
    for 90000 bytes of payload, or tcp_rmem[1] set to 180,000
    Because of autotuning latency, doubling the value is advised : 360000

    Another problem with kicking autotuning too fast is that it might
    allow bigger sk->sk_rcvbuf values even for small flows, opening more
    surface to malicious attacks.

    I _think_ that if we want to allow admins to set high MTU without
    having to tune tcp_rmem[], we need something different than current
    proposal.




Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, 
R.C.S. Luxembourg B186284

Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington 
Road, Dublin 4, Ireland, branch registration number 908705


Reply via email to