I??m encountering an issue during a high-concurrency short-lived connection
stress test in a distributed database system. The system under test is running
on a single machine. The server sometimes sends a TCP RST after completing the
three-way handshake, and while the client receives the RST and closes the
connection, the server-side socket remains in ESTABLISHED state.
System Information
Linux systest104 5.15.0-305.176.4.el9uek.x86_64 #2 SMP Tue Jan 28 20:15:04 PST
2025 x86_64 x86_64 x86_64 GNU/Linux
Problem Description
During the short-link stress test (many TCP connections being rapidly
established and closed), the server (PostgreSQL) sends an unexpected RST after
the handshake. Despite the RST and client-side closure, the server socket stays
in ESTABLISHED. This behavior repeats under high load.
netstat Output
[root@systest104 tools]# netstat -atpn | grep 45129 | grep LI
tcp 0 0 10.13.8.104:45129 0.0.0.0:* LISTEN 3961360/postgres:
tcp 0 0 10.13.8.104:45129 10.13.8.104:45052 ESTABLISHED 3961360/postgres:
TCP Traffic Capture (tcpdump)
From server (port 45129) to client (port 45052):
02:58:03.972859 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6),
length 60)
10.13.8.104.45129 > 10.13.8.104.45052: Flags [S.], cksum 0x2518 (incorrect
-> 0xe388), seq 175553476, ack 2894832976, win 65535, options [mss
65495,sackOK,TS val 1670382316 ecr 1670382316,nop,wscale 11], length 0
02:58:04.218997 IP (tos 0x0, ttl 64, id 9377, offset 0, flags [DF], proto TCP
(6), length 52)
10.13.8.104.45129 > 10.13.8.104.45052: Flags [.], cksum 0x2510 (incorrect
-> 0x0a6e), seq 1, ack 42, win 32, options [nop,nop,TS val 1670382564 ecr
1670382522], length 0
02:58:04.979788 IP (tos 0x0, ttl 64, id 9378, offset 0, flags [DF], proto TCP
(6), length 52)
10.13.8.104.45129 > 10.13.8.104.45052: Flags [F.], cksum 0x2510 (incorrect
-> 0x0776), seq 1, ack 42, win 32, options [nop,nop,TS val 1670383323 ecr
1670382522], length 0
02:58:06.494295 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6),
length 60)
10.13.8.104.45129 > 10.13.8.104.45052: Flags [S.], cksum 0x2518 (incorrect
-> 0x7ac8), seq 214961739, ack 2894898555, win 65535, options [mss 65495,
sackOK, TS val 1670384838 ecr 1670384838,nop,wscale 11], length 0
02:58:06.497830 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6),
length 40)
10.13.8.104.45129 > 10.13.8.104.45052: Flags [R], cksum 0x0f95 (correct),
seq 214961740, win 0, length 0
?6?7
From client (port 45052) to server (port 45129):
10.13.8.104.45052 > 10.13.8.104.45129: Flags [S], cksum 0x2518 (incorrect
-> 0x9bdf), seq 2894898554, win 65535, options [mss 65495,sackOK,TS val
1670384838 ecr 1670383323,nop,wscale 11], length 0
02:58:06.496318 IP (tos 0x0, ttl 64, id 15240, offset 0, flags [DF], proto TCP
(6), length 52)
10.13.8.104.45052 > 10.13.8.104.45129: Flags [.], cksum 0x2510 (incorrect
-> 0xa39b), seq 65579, ack 39408264, win 32, options [nop,nop,TS val
1670384839 ecr 1670384838], length 0
02:58:06.496321 IP (tos 0x0, ttl 64, id 15241, offset 0, flags [DF], proto TCP
(6), length 84)
10.13.8.104.45052 > 10.13.8.104.45129: Flags [P.], cksum 0x2530 (incorrect
-> 0x2b50), seq 65579:65611, ack 39408264, win 32, options [nop,nop,TS val
1670384839ecr 1670384838], length 32
Kernel Socket State Transitions
ffff9de27cb78000 3499722 postgres 10.13.8.104 45052 ?? 45129
SYN_RECV ?? ESTABLISHED 0.003s
ffff9dd18eb25580 3961360 postgres 10.13.8.104 45129 ?? 45052
SYN_SENT ?? ESTABLISHED 1.165s
ffff9dd18eb25580 353 ksoftirqd 10.13.8.104 45129 ?? 45052
ESTABLISHED ?? CLOSE 4.154s
Summary
The client (45052) initiates a connection to the server (45129).
The connection completes (ESTABLISHED).
Then a new SYN is seen, followed shortly by an RST from the server.
However, the server socket does not transition out of ESTABLISHED despite
sending the RST.
The connection appears "stuck" in the ESTABLISHED state from the server??s
perspective.
Question
What could cause the kernel to send an RST but still leave the server-side
socket in ESTABLISHED state? Shouldn??t the kernel remove the connection after
sending the RST?
Any insight into what may be going wrong here (socket leak, improper closure,
application issue, kernel bug) would be appreciated.
Thanks, ZHAO
_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies