Hi Andre, On 04/24/18 10:01 AM, Dave Watson wrote: > On 04/22/18 11:21 PM, Andre Tomt wrote: > > The kernel seems to get increasingly unstable as I load it up with client > > connections. At about 9Gbps and 700 connections, it is okay at least for a > > while - it might run fine for say 45 minutes. Once it gets to 20 - 30Gbps, > > the kernel will usually start spewing OOPSes within minutes and the traffic > > drops. > > > > Some bad interaction between mlx4 and kTLS?
I tried to repro, but wasn't able to - of course I don't have an mlx4 test setup. If I manually add a tls_write_space call after do_tcp_sendpages, I get a similar stack though. Something like the following should work, can you test? Thanks diff --git a/include/net/tls.h b/include/net/tls.h index 8c56809..ee78f33 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -187,6 +187,7 @@ struct tls_context { struct scatterlist *partially_sent_record; u16 partially_sent_offset; unsigned long flags; + bool in_tcp_sendpages; u16 pending_open_record_frags; int (*push_pending_record)(struct sock *sk, int flags); diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c index 3aafb87..095af65 100644 --- a/net/tls/tls_main.c +++ b/net/tls/tls_main.c @@ -114,6 +114,7 @@ int tls_push_sg(struct sock *sk, size = sg->length - offset; offset += sg->offset; + ctx->in_tcp_sendpages = 1; while (1) { if (sg_is_last(sg)) sendpage_flags = flags; @@ -148,6 +149,8 @@ int tls_push_sg(struct sock *sk, } clear_bit(TLS_PENDING_CLOSED_RECORD, &ctx->flags); + ctx->in_tcp_sendpages = 0; + ctx->sk_write_space(sk); return 0; } @@ -217,6 +220,9 @@ static void tls_write_space(struct sock *sk) { struct tls_context *ctx = tls_get_ctx(sk); + if (ctx->in_tcp_sendpages) + return; + if (!sk->sk_write_pending && tls_is_pending_closed_record(ctx)) { gfp_t sk_allocation = sk->sk_allocation; int rc;