[
https://issues.apache.org/jira/browse/TS-3299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305461#comment-14305461
]
ASF subversion and git services commented on TS-3299:
-----------------------------------------------------
Commit bb79d6334e445aee4200f0d5f58953f42c107c6a in trafficserver's branch
refs/heads/master from [~sudheerv]
[ https://git-wip-us.apache.org/repos/asf?p=trafficserver.git;h=bb79d63 ]
[TS-3299] Fix leaking vc's during inactivitycop, due to null continuation
> InactivityCop broken
> --------------------
>
> Key: TS-3299
> URL: https://issues.apache.org/jira/browse/TS-3299
> Project: Traffic Server
> Issue Type: Bug
> Components: Core
> Affects Versions: 5.2.0
> Reporter: Sudheer Vinukonda
> Fix For: 5.3.0
>
>
> The patch in TS-3196 seems to result in fd leak in our prod. There are a
> bunch of hung sockets (in close-wait state), stuck forever (remain leaked for
> days after stopping the traffic). Debugging further, it seems that the
> InactivityCop is broken by this patch. [NOTE: We have spdy enabled in prod,
> but, I am not entirely sure, if this bug only affects spdy connections]
> Some info below for the leaked sockets (in close_wait state):
> {code}
> $ ss -s ; sudo traffic_line -r
> proxy.process.net.connections_currently_open; sudo traffic_line -r
> proxy.process.http.current_client_connections; sudo traffic_line -r
> proxy.process.http.current_server_connections; sudo ls -l /proc/$(pidof
> traffic_server)/fd/ 3>/dev/null| wc -l
> Total: 29367 (kernel 29437)
> TCP: 78235 (estab 5064, closed 46593, orphaned 0, synrecv 0, timewait 15/0),
> ports 918
> Transport Total IP IPv6
> * 29437 - -
> RAW 0 0 0
> UDP 16 13 3
> TCP 31642 31637 5
> INET 31658 31650 8
> FRAG 0 0 0
> Password:
> 27689
> 1
> 1
> 27939
> A snippet from "lsof -p $(pidof traffic_server)"
> [ET_NET 10024 nobody 240u IPv4 2138575429 0t0 TCP
> 67.195.33.62:443->66.87.139.114:29381 (CLOSE_WAIT)
> [ET_NET 10024 nobody 241u IPv4 2137093945 0t0 TCP
> 67.195.33.62:443->201.122.209.180:49274 (CLOSE_WAIT)
> [ET_NET 10024 nobody 243u IPv4 2136018789 0t0 TCP
> 67.195.33.62:443->173.225.111.9:38133 (CLOSE_WAIT)
> [ET_NET 10024 nobody 245u IPv4 2135996293 0t0 TCP
> 67.195.33.62:443->172.243.79.180:52701 (CLOSE_WAIT)
> [ET_NET 10024 nobody 248u IPv4 2136468896 0t0 TCP
> 67.195.33.62:443->173.225.111.82:42273 (CLOSE_WAIT)
> [ET_NET 10024 nobody 253u IPv4 2140213864 0t0 TCP
> 67.195.33.62:443->174.138.185.120:34936 (CLOSE_WAIT)
> [ET_NET 10024 nobody 259u IPv4 2137861176 0t0 TCP
> 67.195.33.62:443->76.199.250.133:60631 (CLOSE_WAIT)
> [ET_NET 10024 nobody 260u IPv4 2139081493 0t0 TCP
> 67.195.33.62:443->187.74.154.214:58800 (CLOSE_WAIT)
> [ET_NET 10024 nobody 261u IPv4 2134948565 0t0 TCP
> 67.195.33.62:443->23.242.49.117:4127 (CLOSE_WAIT)
> [ET_NET 10024 nobody 262u IPv4 2135708046 0t0 TCP
> 67.195.33.62:443->66.241.71.243:50318 (CLOSE_WAIT)
> [ET_NET 10024 nobody 263u IPv4 2138896897 0t0 TCP
> 67.195.33.62:443->73.35.151.106:52414 (CLOSE_WAIT)
> [ET_NET 10024 nobody 264u IPv4 2135589029 0t0 TCP
> 67.195.33.62:443->96.251.12.27:62426 (CLOSE_WAIT)
> [ET_NET 10024 nobody 265u IPv4 2134930235 0t0 TCP
> 67.195.33.62:443->207.118.3.196:50690 (CLOSE_WAIT)
> [ET_NET 10024 nobody 267u IPv4 2137837515 0t0 TCP
> 67.195.33.62:443->98.112.195.98:52028 (CLOSE_WAIT)
> [ET_NET 10024 nobody 269u IPv4 2135272855 0t0 TCP
> 67.195.33.62:443->24.1.230.25:57265 (CLOSE_WAIT)
> [ET_NET 10024 nobody 270u IPv4 2135820802 0t0 TCP
> 67.195.33.62:443->24.75.122.66:14345 (CLOSE_WAIT)
> [ET_NET 10024 nobody 271u IPv4 2135475042 0t0 TCP
> 67.195.33.62:443->65.102.35.112:49188 (CLOSE_WAIT)
> [ET_NET 10024 nobody 272u IPv4 2135328974 0t0 TCP
> 67.195.33.62:443->209.242.195.252:54890 (CLOSE_WAIT)
> [ET_NET 10024 nobody 273u IPv4 2137542791 0t0 TCP
> 67.195.33.62:443->76.79.183.188:47048 (CLOSE_WAIT)
> [ET_NET 10024 nobody 274u IPv4 2134806135 0t0 TCP
> 67.195.33.62:443->189.251.149.36:58106 (CLOSE_WAIT)
> [ET_NET 10024 nobody 275u IPv4 2140126017 0t0 TCP
> 67.195.33.62:443->68.19.173.44:1397 (CLOSE_WAIT)
> [ET_NET 10024 nobody 276u IPv4 2134636089 0t0 TCP
> 67.195.33.62:443->67.44.192.72:22112 (CLOSE_WAIT)
> [ET_NET 10024 nobody 278u IPv4 2134708339 0t0 TCP
> 67.195.33.62:443->107.220.216.155:51242 (CLOSE_WAIT)
> [ET_NET 10024 nobody 279u IPv4 2134580888 0t0 TCP
> 67.195.33.62:443->50.126.116.209:59432 (CLOSE_WAIT)
> [ET_NET 10024 nobody 281u IPv4 2134868131 0t0 TCP
> 67.195.33.62:443->108.38.255.44:4612 (CLOSE_WAIT)
> [ET_NET 10024 nobody 282u IPv4 2139275601 0t0 TCP
> 67.195.33.62:443->168.99.78.5:63345 (CLOSE_WAIT)
> [ET_NET 10024 nobody 283u IPv4 2138019644 0t0 TCP
> 67.195.33.62:443->209.242.204.18:47547 (CLOSE_WAIT)
> [ET_NET 10024 nobody 284u IPv4 2134873211 0t0 TCP
> 67.195.33.62:443->99.47.247.0:58672 (CLOSE_WAIT)
> [ET_NET 10024 nobody 285u IPv4 2137679393 0t0 TCP
> 67.195.33.62:443->174.138.184.227:43908 (CLOSE_WAIT)
> [ET_NET 10024 nobody 287u IPv4 2137417894 0t0 TCP
> 67.195.33.62:443->50.46.200.8:4880 (CLOSE_WAIT)
> [ET_NET 10024 nobody 288u IPv4 2135541215 0t0 TCP
> 67.195.33.62:443->216.201.190.60:41016 (CLOSE_WAIT)
> [ET_NET 10024 nobody 289u IPv4 2137315400 0t0 TCP
> 67.195.33.62:443->98.180.195.225:53754 (CLOSE_WAIT)
> [ET_NET 10024 nobody 291u IPv4 2136931822 0t0 TCP
> 67.195.33.62:443->12.199.206.210:7271 (CLOSE_WAIT)
> [ET_NET 10024 nobody 292u IPv4 2137892940 0t0 TCP
> 67.195.33.62:443->166.173.60.100:16766 (CLOSE_WAIT)
> [ET_NET 10024 nobody 293u IPv4 2136627603 0t0 TCP
> 67.195.33.62:443->71.131.5.198:51552 (CLOSE_WAIT)
> [ET_NET 10024 nobody 294u IPv4 2135059554 0t0 TCP
> 67.195.33.62:443->209.95.183.181:39707 (CLOSE_WAIT)
> {code}
> Below are the traces showing how the inactivity cop is stuck in a loop,
> basically resetting the timer every time it expires, and not closing the vc.
> {code}
> [Jan 13 23:08:32.120] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose)
> vc: 0x2ae574f13730 timeout at: 1421190525 timeout in: 300
> [Jan 13 23:08:33.164] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose)
> vc: 0x2ae574f13730 timeout at: 1421190525 timeout in: 300
> .....
> [Jan 13 23:13:46.833] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose)
> vc: 0x2ae574f13730 timeout at: 1421190827 timeout in: 300
> [Jan 13 23:13:47.845] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose)
> vc: 0x2ae574f13730 timeout at: 1421190827 timeout in: 300
> [Jan 13 23:08:47.639] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop) vc:
> 0x2ae574f13730 inactivity timeout not set, setting a default of 300
> [Jan 13 23:08:48.666] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose)
> vc: 0x2ae574f13730 timeout at: 1421190827 timeout in: 300
> ........
> {code}
> Attached gdb to track who's resetting the inactivity timer - It seems that,
> when the inactivity timer expires for these connections, the audit correctly
> generates a signal to the vc's continuation. However, since, the patch in
> TS-3196 reset the read continuation to NULL, this event simply is ignored (in
> {{read_signal_and_update()}}), while at the same time, the inactivity timer
> is reset. This keeps looping forever resulting in never releasing these
> inactive connections. Below's the debug output from gdb.
> {code}
> (gdb) cont
> Continuing.
> Hardware watchpoint 22: *(0x2ae564011a28)
> Old value = 0
> New value = -794113713
> UnixNetVConnection::set_inactivity_timeout (this=0x2ae564011830,
> timeout=300000000000) at P_UnixNetVConnection.h:324
> 324 P_UnixNetVConnection.h: No such file or directory.
> in P_UnixNetVConnection.h
> (gdb) cont
> Continuing.
> Hardware watchpoint 22: *(0x2ae564011a28)
> Old value = -794113713
> New value = 0
> UnixNetVConnection::mainEvent (this=0x2ae564011830, event=1, e=0x1f37c90) at
> UnixNetVConnection.cc:1060
> 1060 UnixNetVConnection.cc: No such file or directory.
> in UnixNetVConnection.cc
> (gdb) break 1068
> Breakpoint 23 at 0x738358: file UnixNetVConnection.cc, line 1068.
> Breakpoint 23, UnixNetVConnection::mainEvent (this=0x2ae575679c90, event=1,
> e=0x1f34a50) at UnixNetVConnection.cc:1068
> 1068 in UnixNetVConnection.cc
> (gdb) p closed
> $24 = 0
> (gdb) p read.vio
> $25 = {_cont = 0x0, nbytes = 0, ndone = 0, op = 1, buffer = {mbuf = 0x0, entry
> = 0x0}, vc_server = 0x2ae575679c90, mutex = {m_ptr = 0x2ae6241a3720}}
> Breakpoint 24, read_signal_and_update (event=105, vc=0x2ae5758ffab0) at
> UnixNetVConnection.cc:137
> 137 in UnixNetVConnection.cc
> (gdb) p vc->read.vio
> $33 = {_cont = 0x0, nbytes = 0, ndone = 0, op = 1, buffer = {mbuf = 0x0, entry
> = 0x0}, vc_server = 0x2ae5758ffab0, mutex = {m_ptr = 0x2ae6100a3a20}}
> (gdb) cont
> Continuing.
> [Switching to Thread 0x2ae55e928700 (LWP 10088)]
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)