Continuing with the investigation, I changed the listen only to this:
listen proxy-test-tcp
bind *:81
option tcplog
server test1 192.168.1.101:80
And the difference between 1.7 and 1.8 tracing the process who receive
only 1 request is that the shutdown of the socket who receives the
request fails with an ENOTCONN. In 1.8 continue in CLOSE_WAIT a few
time, meanwhile in 1.7 pass to TIME_WAIT as usual.
1.8:
recvfrom(3, "GET / HTTP/1.1\r\nUser-Agent: curl"..., 15360, 0, NULL,
NULL) = 92
...
shutdown(3, SHUT_WR) = 0
...
shutdown(3, SHUT_WR) = -1 ENOTCONN (Transport
endpoint is not connected)
close(3) = 0
1.7:
recvfrom(1, "GET / HTTP/1.1\r\nUser-Agent: curl"..., 15360, 0, NULL,
NULL) = 92
...
shutdown(1, SHUT_WR) = 0
close(1) = 0
Here is the full trace:
1.8:
Process 15707 attached
epoll_wait(0, {}, 200, 1000) = 0
epoll_wait(0, {{EPOLLIN, {u32=5, u64=5}}}, 200, 1000) = 1
accept4(5, {sa_family=AF_INET, sin_port=htons(59086),
sin_addr=inet_addr("192.168.1.117")}, [16], SOCK_NONBLOCK) = 3
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
accept4(5, 0x7fff91adea10, [128], SOCK_NONBLOCK) = -1 EAGAIN (Resource
temporarily unavailable)
recvfrom(3, "GET / HTTP/1.1\r\nUser-Agent: curl"..., 15360, 0, NULL,
NULL) = 92
recvfrom(3, 0xfd4e90, 15268, 0, 0, 0) = -1 EAGAIN (Resource
temporarily unavailable)
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
connect(7, {sa_family=AF_INET, sin_port=htons(80),
sin_addr=inet_addr("192.168.1.101")}, 16) = -1 EINPROGRESS (Operation
now in progress)
epoll_ctl(0, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLRDHUP, {u32=3, u64=3}}) = 0
epoll_wait(0, {}, 200, 0) = 0
sendto(7, "GET / HTTP/1.1\r\nUser-Agent: curl"..., 92, MSG_DONTWAIT|
MSG_NOSIGNAL, NULL, 0) = 92
epoll_wait(0, {}, 200, 0) = 0
recvfrom(7, 0xfd0e14, 16384, 0, 0, 0) = -1 EAGAIN (Resource
temporarily unavailable)
epoll_ctl(0, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}) = 0
epoll_wait(0, {{EPOLLIN, {u32=7, u64=7}}}, 200, 1000) = 1
recvfrom(7, "HTTP/1.1 301 Moved Permanently\r\n"..., 16384, 0, NULL,
NULL) = 515
sendto(3, "HTTP/1.1 301 Moved Permanently\r\n"..., 515, MSG_DONTWAIT|
MSG_NOSIGNAL, NULL, 0) = 515
epoll_wait(0, {{EPOLLIN|EPOLLRDHUP, {u32=3, u64=3}}}, 200, 1000) = 1
recvfrom(3, "", 16384, 0, NULL, NULL) = 0
epoll_ctl(0, EPOLL_CTL_DEL, 3, 7f8513ecc7e0) = 0
epoll_wait(0, {}, 200, 1000) = 0
epoll_wait(0, {}, 200, 1000) = 0
epoll_wait(0, {{EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}}, 200, 1000) = 1
recvfrom(7, "", 16384, 0, NULL, NULL) = 0
close(7) = 0
shutdown(3, SHUT_WR) = 0
sendmsg(6, {msg_name(110)={sa_family=AF_LOCAL, sun_path="/dev/log"},
msg_iov(8)=[{"<174>Dec 22 12:09:25 ", 21}, {"haproxy", 7}, {"[", 1},
{"15707", 5}, {"]: ", 3}, {"", 0}, {"192.168.1.117:59086
[22/Dec/2017"..., 132}, {"\n", 1}], msg_controllen=0, msg_flags=0},
MSG_DONTWAIT|MSG_NOSIGNAL) = 170
shutdown(3, SHUT_WR) = -1 ENOTCONN (Transport
endpoint is not connected)
close(3) = 0
epoll_wait(0, {}, 200, 1000) = 0
epoll_wait(0, Process 15707 detached
<detached ...>
1.7:
Process 10408 attached
epoll_wait(0, {}, 200, 1000) = 0
epoll_wait(0, {}, 200, 1000) = 0
epoll_wait(0, {{EPOLLIN, {u32=7, u64=7}}}, 200, 1000) = 1
accept4(7, {sa_family=AF_INET, sin_port=htons(35835),
sin_addr=inet_addr("192.168.1.117")}, [16], SOCK_NONBLOCK) = 1
setsockopt(1, SOL_TCP, TCP_NODELAY, [1], 4) = 0
accept4(7, 0x7ffdfeb80e10, [128], SOCK_NONBLOCK) = -1 EAGAIN (Resource
temporarily unavailable)
recvfrom(1, "GET / HTTP/1.1\r\nUser-Agent: curl"..., 15360, 0, NULL,
NULL) = 92
recvfrom(1, 0x1016c60, 15268, 0, 0, 0) = -1 EAGAIN (Resource
temporarily unavailable)
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 2
fcntl(2, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
setsockopt(2, SOL_TCP, TCP_NODELAY, [1], 4) = 0
connect(2, {sa_family=AF_INET, sin_port=htons(80),
sin_addr=inet_addr("192.168.1.101")}, 16) = -1 EINPROGRESS (Operation
now in progress)
epoll_ctl(0, EPOLL_CTL_ADD, 1, {EPOLLIN|EPOLLRDHUP, {u32=1, u64=1}}) = 0
epoll_wait(0, {}, 200, 0) = 0
sendto(2, "GET / HTTP/1.1\r\nUser-Agent: curl"..., 92, MSG_DONTWAIT|
MSG_NOSIGNAL, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(0, EPOLL_CTL_ADD, 2, {EPOLLOUT, {u32=2, u64=2}}) = 0
epoll_wait(0, {{EPOLLOUT, {u32=2, u64=2}}}, 200, 1000) = 1
sendto(2, "GET / HTTP/1.1\r\nUser-Agent: curl"..., 92, MSG_DONTWAIT|
MSG_NOSIGNAL, NULL, 0) = 92
epoll_ctl(0, EPOLL_CTL_DEL, 2, 75ed50) = 0
epoll_wait(0, {}, 200, 0) = 0
recvfrom(2, 0x1012be4, 16384, 0, 0, 0) = -1 EAGAIN (Resource
temporarily unavailable)
epoll_ctl(0, EPOLL_CTL_ADD, 2, {EPOLLIN|EPOLLRDHUP, {u32=2, u64=2}}) = 0
epoll_wait(0, {{EPOLLIN, {u32=2, u64=2}}}, 200, 1000) = 1
recvfrom(2, "HTTP/1.1 301 Moved Permanently\r\n"..., 16384, 0, NULL,
NULL) = 515
sendto(1, "HTTP/1.1 301 Moved Permanently\r\n"..., 515, MSG_DONTWAIT|
MSG_NOSIGNAL, NULL, 0) = 515
epoll_wait(0, {{EPOLLIN|EPOLLRDHUP, {u32=1, u64=1}}}, 200, 1000) = 1
recvfrom(1, "", 16384, 0, NULL, NULL) = 0
shutdown(2, SHUT_WR) = 0
epoll_ctl(0, EPOLL_CTL_DEL, 1, 75ed50) = 0
epoll_wait(0, {{EPOLLIN|EPOLLHUP|EPOLLRDHUP, {u32=2, u64=2}}}, 200,
1000) = 1
recvfrom(2, "", 16384, 0, NULL, NULL) = 0
close(2) = 0
shutdown(1, SHUT_WR) = 0
close(1) = 0
sendmsg(6, {msg_name(110)={sa_family=AF_LOCAL, sun_path="/dev/log"},
msg_iov(8)=[{"<174>Dec 22 12:09:45 ", 21}, {"haproxy", 7}, {"[", 1},
{"10408", 5}, {"]: ", 3}, {"", 0}, {"192.168.1.117:35835
[22/Dec/2017"..., 129}, {"\n", 1}], msg_controllen=0, msg_flags=0},
MSG_DONTWAIT|MSG_NOSIGNAL) = 167
epoll_wait(0, {}, 200, 1000) = 0
epoll_wait(0, {}, 200, 1000) = 0
epoll_wait(0, {}, 200, 1000) = 0
epoll_wait(0, {}, 200, 1000) = 0
epoll_wait(0, Process 10408 detached
<detached ...>
El jue, 21-12-2017 a las 17:37 +0100, Ricardo Fraile escribió:
> [This sender failed our fraud detection checks and may not be who they appear
> to be. Learn about spoofing at http://aka.ms/LearnAboutSpoofing]
>
> Well, I isolate the service on a load balancer server with the minimal
> configuration, let me detail the problem.
>
> Two equal (cloned) Debian load balancers with 3.16.7-ckt9 kernel, both
> working with keepalived sharing the ip address of the proxy-tcp service
> (192.168.1.100). In A server the Haproxy is v.1.8.1 and in B v.1.7.4,
> both have the following configuration (only in 1.7 the "master-worker"
> line is removed):
>
> global
> master-worker
> node balancer
> log /dev/log local1 info
> stats socket /var/run/haproxy.sock mode 0660 user testuser1
> group testuser1 level admin
> pidfile /var/run/haproxy.pid
> maxconn 32000
> nbproc 1
>
> defaults
> mode tcp
> log global
> retries 3
> option redispatch
>
> maxconn 100000
> fullconn 100000
>
> timeout connect 5s
> timeout server 50s
> timeout client 50s
> timeout http-keep-alive 60s
>
> default-server on-marked-down shutdown-sessions inter 5s
> fastinter 1s downinter 1s rise 2 fall 2
>
> listen proxy-stats
> bind *:80
> mode http
> stats enable
> stats show-legends
> stats uri /
> stats realm Haproxy\ Statistics
> stats refresh 5s
>
> listen proxy-tcp
> bind 192.168.1.100:8080
> option tcplog
> balance roundrobin
>
> option httpchk GET /test
> http-check expect string ok
>
> server server1 192.168.1.101:8080 check
> server server2 192.168.1.102:8080 check
>
>
> After changing the traffic between both servers with keepalived, the
> results are:
>
> 1.7:
> Session rate: 861
> Sessions: 220
>
> 1.8:
> Session rate: 835
> Sessions: 31243
>
> These are the evidences in 1.8 server:
>
> - The ulimit-n is 64034 and the Sessions Max reach 31999 both in
> Frontend and Backend of the lister proxy-tcp, which I suppose that the
> limit is reached by consequence of the issue.
>
> - The system log report a lot of "TCP: too many orphaned sockets" and
> some like "net_ratelimit: 2094 callbacks suppressed"
>
> - The Haproxy log register the total time elapsed between the accept and
> the last close is equal to the 50s assigned to server and client
> timeout.
>
> - The termination state is ok in 99% of them. Yesterday I said that it
> was "sD", but today I check that is very rare, I put one line here only
> as example.
>
> Dec 21 15:38:09 balancer haproxy[8094]: 192.168.1.55:58674
> [21/Dec/2017:15:37:19.358] proxy-tcp proxy-tcp/server2 1/0/50001 1087 --
> 30211/30210/30209/15106/0 0/0
> Dec 21 15:38:09 balancer haproxy[8094]: 192.168.1.42:51027
> [21/Dec/2017:15:37:19.356] proxy-tcp proxy-tcp/server2 1/0/50008 5345 sD
> 30214/30213/30210/15106/0 0/0
> Dec 21 15:38:09 balancer haproxy[8094]: 192.168.1.55:40442
> [21/Dec/2017:15:37:19.364] proxy-tcp proxy-tcp/server1 1/0/50003 694 --
> 30216/30215/30211/15104/0 0/0
>
> - From 30522 tcp sockets to the proxy-tcp address, there are 30160 in
> CLOSE_WAIT state over local address 192.168.1.100:8080. Yesterday I
> point that it was from the backend side, but I'm wrong, all are from the
> fronted side.
>
>
> I have the ouput of "show sess all" and "show fd", but as it have a lot
> of private information, I send you using an alternative way. If there
> are any clear evidence there, I will take the time to anonymize and
> share.
>
>
> Thanks,
>
> El mié, 20-12-2017 a las 18:19 +0100, Willy Tarreau escribió:
> > Hello Ricardo,
> >
> > On Wed, Dec 20, 2017 at 05:00:33PM +0100, Ricardo Fraile wrote:
> > > Hello,
> > >
> > > After upgrade from 1.7.4 to 1.8.1, basically with the end of mail conf
> > > snippet, the sessions started to grow, as example:
> > >
> > > 1.7.4:
> > > Active sessions: ~161
> > > Active sessions rate: ~425
> > >
> > > 1.8.1:
> > > Active sessions: ~6700
> > > Active sessions rate: ~350
> >
> > Ah that's not good :-(
> >
> > > Looking into the linux (3.16.7) server, there are a high number of
> > > CLOSE_WAIT connections from the bind address of the listen service to
> > > the backend nodes.
> >
> > Strange, I don't understand well what type of traffic could cause this
> > except a loop, that sounds a bit unusual.
> >
> > > System logs reported "TCP: too many orphaned sockets", but after
> > > increase net.ipv4.tcp_max_orphans value, the message stops but nothing
> > > changes.
> >
> > Normally orphans correspond to closed sockets for which there are still
> > data in the system's buffers so this should be unrelated to the CLOSE_WAIT,
> > unless there's a loop somewhere where a backend reconnects to the frontend,
> > which can explain both situations at once when the timeout strikes.
> >
> > > Haproxy logs reported for that listen the indicator "sD", but only with
> > > 1.8.
> >
> > Thus a server timeout during the end of the transfer. That doesn't make
> > much sense either.
> >
> > > Any ideas to dig into the issue?
> >
> > It would be very useful that you share your configuration (please remove
> > any sensitive info like stats passwords or IP addresses you prefer to keep
> > private). When running 1.8, it would be useful to issue the following
> > commands on the CLI and capture the output to a file :
> > - "show sess all"
> > - "show fd"
> >
> > Warning, the first one will reveal a lot of info (internal addresses etc)
> > so you may want to send it privately and not to the list if this is the
> > case (though it takes longer to diagnose it :-)).
> >
> > If you think you can reproduce this on a test machine out of production,
> > that would be extremely useful.
> >
> > We have not noticed any single such issue on haproxy.org which has delivered
> > about 100 GB and 2 million requests over the last 2 weeks, with this exact
> > same version, so that makes me think that either the config or the type of
> > traffic count a lot there to trigger the problem you are observing.
> >
> > Regards,
> > Willy
>
>
>