On Fri, Aug 15, 2025 at 1:40 AM <chia-yu.ch...@nokia-bell-labs.com> wrote: > > From: Ilpo Järvinen <i...@kernel.org> > > The Accurate ECN allows echoing back the sum of bytes for > each IP ECN field value in the received packets using > AccECN option. This change implements AccECN option tx & rx > side processing without option send control related features > that are added by a later change. > > Based on specification: > https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt > (Some features of the spec will be added in the later changes > rather than in this one). > > A full-length AccECN option is always attempted but if it does > not fit, the minimum length is selected based on the counters > that have changed since the last update. The AccECN option > (with 24-bit fields) often ends in odd sizes so the option > write code tries to take advantage of some nop used to pad > the other TCP options. > > The delivered_ecn_bytes pairs with received_ecn_bytes similar > to how delivered_ce pairs with received_ce. In contrast to > ACE field, however, the option is not always available to update > delivered_ecn_bytes. For ACK w/o AccECN option, the delivered > bytes calculated based on the cumulative ACK+SACK information > are assigned to one of the counters using an estimation > heuristic to select the most likely ECN byte counter. Any > estimation error is corrected when the next AccECN option > arrives. It may occur that the heuristic gets too confused > when there are enough different byte counter deltas between > ACKs with the AccECN option in which case the heuristic just > gives up on updating the counters for a while. > > tcp_ecn_option sysctl can be used to select option sending > mode for AccECN: TCP_ECN_OPTION_DISABLED, TCP_ECN_OPTION_MINIMUM, > and TCP_ECN_OPTION_FULL. > > This patch increases the size of tcp_info struct, as there is > no existing holes for new u32 variables. Below are the pahole > outcomes before and after this patch: > > [BEFORE THIS PATCH] > struct tcp_info { > [...] > __u32 tcpi_total_rto_time; /* 244 4 */ > > /* size: 248, cachelines: 4, members: 61 */ > } > > [AFTER THIS PATCH] > struct tcp_info { > [...] > __u32 tcpi_total_rto_time; /* 244 4 */ > __u32 tcpi_received_ce; /* 248 4 */ > __u32 tcpi_delivered_e1_bytes; /* 252 4 */ > __u32 tcpi_delivered_e0_bytes; /* 256 4 */ > __u32 tcpi_delivered_ce_bytes; /* 260 4 */ > __u32 tcpi_received_e1_bytes; /* 264 4 */ > __u32 tcpi_received_e0_bytes; /* 268 4 */ > __u32 tcpi_received_ce_bytes; /* 272 4 */ > > /* size: 280, cachelines: 5, members: 68 */ > } > > This patch uses the existing 1-byte holes in the tcp_sock_write_txrx > group for new u8 members, but adds a 4-byte hole in tcp_sock_write_rx > group after the new u32 delivered_ecn_bytes[3] member. Therefore, the > group size of tcp_sock_write_rx is increased from 96 to 112. Below > are the pahole outcomes before and after this patch: > > [BEFORE THIS PATCH] > struct tcp_sock { > [...] > u8 received_ce_pending:4; /* 2522: 0 1 */ > u8 unused2:4; /* 2522: 4 1 */ > /* XXX 1 byte hole, try to pack */ > > [...] > u32 rcv_rtt_last_tsecr; /* 2668 4 */ > > [...] > __cacheline_group_end__tcp_sock_write_rx[0]; /* 2728 0 */ > > [...] > /* size: 3200, cachelines: 50, members: 167 */ > } > > [AFTER THIS PATCH] > struct tcp_sock { > [...] > u8 received_ce_pending:4;/* 2522: 0 1 */ > u8 unused2:4; /* 2522: 4 1 */ > u8 accecn_minlen:2; /* 2523: 0 1 */ > u8 est_ecnfield:2; /* 2523: 2 1 */ > u8 unused3:4; /* 2523: 4 1 */ > > [...] > u32 rcv_rtt_last_tsecr; /* 2668 4 */ > u32 delivered_ecn_bytes[3];/* 2672 12 */ > /* XXX 4 bytes hole, try to pack */ > > [...] > __cacheline_group_end__tcp_sock_write_rx[0]; /* 2744 0 */ > > [...] > /* size: 3200, cachelines: 50, members: 171 */ > } > > Signed-off-by: Ilpo Järvinen <i...@kernel.org> > Signed-off-by: Neal Cardwell <ncardw...@google.com> > Co-developed-by: Chia-Yu Chang <chia-yu.ch...@nokia-bell-labs.com> > Signed-off-by: Chia-Yu Chang <chia-yu.ch...@nokia-bell-labs.com> > > --- > v13 > - Move TCP_ACCECN_E1B_INIT_OFFSET, TCP_ACCECN_E0B_INIT_OFFSET, and > TCP_ACCECN_CEB_INIT_OFFSET to this patch > - Use static array lookup in tcp_accecn_optfield_to_ecnfield() > - Return false when WARN_ON_ONCE() is true in tcp_accecn_process_option() > - Make synack_ecn_bytes as static const array and use const u32 pointer > in tcp_options_write() > - Use ALIGN() and ALIGN_DOWN() in tcp_options_fit_accecn() to pad TCP AccECN > option to dword > > v10 > - Add documentation of tcp_ecn_option in ip-sysctl.rst to this patch > - Remove the global variable u32 synack_ecn_bytes[3] > - Add READ_ONCE() over every reads of sysctl > > v9: > - Restruct the code in the for loop of tcp_accecn_process_option() > - Remove ecn_bytes and add use_synack_ecn_bytes flag in tcp_out_options > struct to identify whether syn_ack_bytes or received_ecn_bytes is used > - Replace leftover_bytes and leftover_size with leftover_highbyte and > leftover_lowbyte and add comments in tcp_options_write() > > v8: > - Reset leftover_size to 2 once leftover_bytes is used > --- > Documentation/networking/ip-sysctl.rst | 19 ++ > .../networking/net_cachelines/tcp_sock.rst | 3 + > include/linux/tcp.h | 9 +- > include/net/netns/ipv4.h | 1 + > include/net/tcp.h | 13 ++ > include/net/tcp_ecn.h | 89 +++++++++- > include/uapi/linux/tcp.h | 7 + > net/ipv4/sysctl_net_ipv4.c | 9 + > net/ipv4/tcp.c | 15 +- > net/ipv4/tcp_input.c | 94 +++++++++- > net/ipv4/tcp_ipv4.c | 1 + > net/ipv4/tcp_output.c | 165 +++++++++++++++++- > 12 files changed, 412 insertions(+), 13 deletions(-) >
minlen); > + } > } > } > > @@ -263,6 +347,9 @@ static inline void tcp_accecn_init_counters(struct > tcp_sock *tp) > tp->received_ce = 0; > tp->received_ce_pending = 0; > __tcp_accecn_init_bytes_counters(tp->received_ecn_bytes); > + __tcp_accecn_init_bytes_counters(tp->delivered_ecn_bytes); > + tp->accecn_minlen = 0; > + tp->est_ecnfield = 0; > } > > /* Used for make_synack to form the ACE flags */ > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h > index bdac8c42fa82..53e0e85b52be 100644 > --- a/include/uapi/linux/tcp.h > +++ b/include/uapi/linux/tcp.h > @@ -316,6 +316,13 @@ struct tcp_info { > * in milliseconds, including any > * unfinished recovery. > */ > + __u32 tcpi_received_ce; /* # of CE marks received */ > + __u32 tcpi_delivered_e1_bytes; /* Accurate ECN byte counters */ > + __u32 tcpi_delivered_e0_bytes; > + __u32 tcpi_delivered_ce_bytes; > + __u32 tcpi_received_e1_bytes; > + __u32 tcpi_received_e0_bytes; > + __u32 tcpi_received_ce_bytes; > }; > We do not add more fields to tcp_info, unless added fields are a multiple of 64 bits. Otherwise a hole is added and can not be recovered.