Re: [PATCH] [MEDIUM] Improve "no free ports" error case

2017-03-15 Thread Krishna Kumar (Engineering)
Hi Willy,

I am facing one problem with using system port range,

Distro: Ubuntu 16.04.1, kernel: 4.4.0-53-generic

When I set to 5 to 50999, the kernel allocates port in the range 5
to
50499, the remaining 500 ports do not seem to ever get allocated despite
running
a few thousand connections in parallel. A simple test program that I wrote
that does
a bind to IP and then connects uses all 1000 ports. Quickly checking the
tcp code, I
noticed that the kernel tries to allocate an odd port for bind, leaving the
even ports
for connect. Any idea why I don't get the full port range in bind? I am
using something
like the following when specifying the server:
 server abcd google.com:80 source e1.e2.e3.e4
and with the following sysctl:
 sysctl -w net.ipv4.ip_local_port_range="5 50999"

I hope it is OK to add an unrelated question relating to a feature to this
thread:

Is it possible to tell haproxy to use one backend for a request (GET), and
if the
response was 404 (Not Found), use another backend? This resource may be
present in the 2nd backend, but is there any way to try that upon getting
404
from the first?

Thanks,
- Krishna


On Thu, Mar 9, 2017 at 2:22 PM, Krishna Kumar (Engineering) <
krishna...@flipkart.com> wrote:

> Hi Willy,
>
> Excellent, I will try this idea, it should definitely help!
> Thanks for the explanations.
>
> Regards,
> - Krishna
>
>
> On Thu, Mar 9, 2017 at 1:37 PM, Willy Tarreau  wrote:
>
>> On Thu, Mar 09, 2017 at 12:50:16PM +0530, Krishna Kumar (Engineering)
>> wrote:
>> > 1. About 'retries', I am not sure if it works for connect() failing
>> > synchronously on the
>> > local system (as opposed to getting a timeout/refused via callback).
>>
>> Yes it normally does. I've been using it for the same purpose in certain
>> situations (eg: binding to a source port range while some daemons are
>> later bound into that range).
>>
>> > The
>> > document
>> > on retries says:
>> >
>> > "  is the number of times a connection attempt should be
>> retried
>> > on
>> >   a server when a connection either is refused or times
>> out. The
>> >   default value is 3.
>> > "
>> >
>> > The two conditions above don't fall in our use case.
>>
>> It's still a refused connection :-)
>>
>> > The way I understood was that
>> > retries happens during the callback handler. Also I am not sure if
>> there is
>> > any way to circumvent the "1 second" gap for a retry.
>>
>> Hmmm I have to check. In fact when the LB algorithm is not determinist
>> we immediately retry on another server. If we're supposed to end up only
>> on the same server we indeed apply the delay. But if it's a synchronous
>> error, I don't know. And I think it's one case (especially -EADDRNOTAVAIL)
>> where we should immediately retry.
>>
>> > 2. For nolinger, it was not recommended in the document,
>>
>> It's indeed strongly recommended against, mainly because we've started
>> to see it in configs copy-pasted from blogs without understanding the
>> impacts.
>>
>> > and also I wonder if any data
>> > loss can happen if the socket is not lingered for some time beyond the
>> FIN
>> > packet that
>> > the remote server sent for doing the close(), delayed data packets, etc.
>>
>> The data loss happens only with outgoing data, so for HTTP it's data
>> sent to the client which are at risk. Data coming from the server are
>> properly consumed. In fact, when you configure "http-server-close",
>> the nolinger is automatically enabled in your back so that haproxy
>> can close the server connection without accumulating time-waits.
>>
>> > 3. Ports: Actually each HAProxy process has 400 ports limitation to a
>> > single backend,
>> > and there are many haproxy processes on this and other servers. The
>> ports
>> > are split per
>> > process and per system. E.g. system1 has 'n' processes and each have a
>> > separate port
>> > range from each other, system2 has 'n' processes and a completely
>> different
>> > port range.
>> > For infra reasons, we are restricting the total port range. The unique
>> > ports for different
>> > haproxy processes running on same system is to avoid attempting to use
>> the
>> > same port
>> > (first port# in the range) by two processes and failing in connect, when
>> > attempting to
>> > connect to the same remote server. Hope I explained that clearly.
>>
>> Yep I clearly see the use case. That's one of the rare cases where it's
>> interesting to use SNAT between your haproxy nodes and the internet. This
>> way you'll use a unified ports pool for all your nodes and will not have
>> to reserve port ranges per system and per process. Each process will then
>> share the system's local source ports, and each system will have a
>> different
>> address. Then the SNAT will convert these IP1..N:port1..N to the public IP
>> address and an available port. This will offer you more flexibility to add
>> or remove nodes/processes etc. Maybe your total traffic cannot pass

Re: [PATCH] [MEDIUM] Improve "no free ports" error case

2017-03-09 Thread Krishna Kumar (Engineering)
Hi Willy,

Excellent, I will try this idea, it should definitely help!
Thanks for the explanations.

Regards,
- Krishna


On Thu, Mar 9, 2017 at 1:37 PM, Willy Tarreau  wrote:

> On Thu, Mar 09, 2017 at 12:50:16PM +0530, Krishna Kumar (Engineering)
> wrote:
> > 1. About 'retries', I am not sure if it works for connect() failing
> > synchronously on the
> > local system (as opposed to getting a timeout/refused via callback).
>
> Yes it normally does. I've been using it for the same purpose in certain
> situations (eg: binding to a source port range while some daemons are
> later bound into that range).
>
> > The
> > document
> > on retries says:
> >
> > "  is the number of times a connection attempt should be
> retried
> > on
> >   a server when a connection either is refused or times out.
> The
> >   default value is 3.
> > "
> >
> > The two conditions above don't fall in our use case.
>
> It's still a refused connection :-)
>
> > The way I understood was that
> > retries happens during the callback handler. Also I am not sure if there
> is
> > any way to circumvent the "1 second" gap for a retry.
>
> Hmmm I have to check. In fact when the LB algorithm is not determinist
> we immediately retry on another server. If we're supposed to end up only
> on the same server we indeed apply the delay. But if it's a synchronous
> error, I don't know. And I think it's one case (especially -EADDRNOTAVAIL)
> where we should immediately retry.
>
> > 2. For nolinger, it was not recommended in the document,
>
> It's indeed strongly recommended against, mainly because we've started
> to see it in configs copy-pasted from blogs without understanding the
> impacts.
>
> > and also I wonder if any data
> > loss can happen if the socket is not lingered for some time beyond the
> FIN
> > packet that
> > the remote server sent for doing the close(), delayed data packets, etc.
>
> The data loss happens only with outgoing data, so for HTTP it's data
> sent to the client which are at risk. Data coming from the server are
> properly consumed. In fact, when you configure "http-server-close",
> the nolinger is automatically enabled in your back so that haproxy
> can close the server connection without accumulating time-waits.
>
> > 3. Ports: Actually each HAProxy process has 400 ports limitation to a
> > single backend,
> > and there are many haproxy processes on this and other servers. The ports
> > are split per
> > process and per system. E.g. system1 has 'n' processes and each have a
> > separate port
> > range from each other, system2 has 'n' processes and a completely
> different
> > port range.
> > For infra reasons, we are restricting the total port range. The unique
> > ports for different
> > haproxy processes running on same system is to avoid attempting to use
> the
> > same port
> > (first port# in the range) by two processes and failing in connect, when
> > attempting to
> > connect to the same remote server. Hope I explained that clearly.
>
> Yep I clearly see the use case. That's one of the rare cases where it's
> interesting to use SNAT between your haproxy nodes and the internet. This
> way you'll use a unified ports pool for all your nodes and will not have
> to reserve port ranges per system and per process. Each process will then
> share the system's local source ports, and each system will have a
> different
> address. Then the SNAT will convert these IP1..N:port1..N to the public IP
> address and an available port. This will offer you more flexibility to add
> or remove nodes/processes etc. Maybe your total traffic cannot pass through
> a single SNAT box though in which case I understand that you don't have
> much choice. However you could then at least not force each process' port
> range and instead fix the system's local port range so that you know that
> all processes of a single machine share a same port range. That's already
> better because you won't be forcing to assign ports from unfinished
> connections.
>
> Willy
>


Re: [PATCH] [MEDIUM] Improve "no free ports" error case

2017-03-09 Thread Willy Tarreau
On Thu, Mar 09, 2017 at 12:50:16PM +0530, Krishna Kumar (Engineering) wrote:
> 1. About 'retries', I am not sure if it works for connect() failing
> synchronously on the
> local system (as opposed to getting a timeout/refused via callback).

Yes it normally does. I've been using it for the same purpose in certain
situations (eg: binding to a source port range while some daemons are
later bound into that range).

> The
> document
> on retries says:
> 
> "  is the number of times a connection attempt should be retried
> on
>   a server when a connection either is refused or times out. The
>   default value is 3.
> "
> 
> The two conditions above don't fall in our use case.

It's still a refused connection :-)

> The way I understood was that
> retries happens during the callback handler. Also I am not sure if there is
> any way to circumvent the "1 second" gap for a retry.

Hmmm I have to check. In fact when the LB algorithm is not determinist
we immediately retry on another server. If we're supposed to end up only
on the same server we indeed apply the delay. But if it's a synchronous
error, I don't know. And I think it's one case (especially -EADDRNOTAVAIL)
where we should immediately retry.

> 2. For nolinger, it was not recommended in the document,

It's indeed strongly recommended against, mainly because we've started
to see it in configs copy-pasted from blogs without understanding the
impacts.

> and also I wonder if any data
> loss can happen if the socket is not lingered for some time beyond the FIN
> packet that
> the remote server sent for doing the close(), delayed data packets, etc.

The data loss happens only with outgoing data, so for HTTP it's data
sent to the client which are at risk. Data coming from the server are
properly consumed. In fact, when you configure "http-server-close",
the nolinger is automatically enabled in your back so that haproxy
can close the server connection without accumulating time-waits.

> 3. Ports: Actually each HAProxy process has 400 ports limitation to a
> single backend,
> and there are many haproxy processes on this and other servers. The ports
> are split per
> process and per system. E.g. system1 has 'n' processes and each have a
> separate port
> range from each other, system2 has 'n' processes and a completely different
> port range.
> For infra reasons, we are restricting the total port range. The unique
> ports for different
> haproxy processes running on same system is to avoid attempting to use the
> same port
> (first port# in the range) by two processes and failing in connect, when
> attempting to
> connect to the same remote server. Hope I explained that clearly.

Yep I clearly see the use case. That's one of the rare cases where it's
interesting to use SNAT between your haproxy nodes and the internet. This
way you'll use a unified ports pool for all your nodes and will not have
to reserve port ranges per system and per process. Each process will then
share the system's local source ports, and each system will have a different
address. Then the SNAT will convert these IP1..N:port1..N to the public IP
address and an available port. This will offer you more flexibility to add
or remove nodes/processes etc. Maybe your total traffic cannot pass through
a single SNAT box though in which case I understand that you don't have
much choice. However you could then at least not force each process' port
range and instead fix the system's local port range so that you know that
all processes of a single machine share a same port range. That's already
better because you won't be forcing to assign ports from unfinished
connections.

Willy



Re: [PATCH] [MEDIUM] Improve "no free ports" error case

2017-03-08 Thread Krishna Kumar (Engineering)
Hi Willy,

Thanks for your comments.

1. About 'retries', I am not sure if it works for connect() failing
synchronously on the
local system (as opposed to getting a timeout/refused via callback). The
document
on retries says:

"  is the number of times a connection attempt should be retried
on
  a server when a connection either is refused or times out. The
  default value is 3.
"

The two conditions above don't fall in our use case. The way I understood
was that
retries happens during the callback handler. Also I am not sure if there is
any way to
circumvent the "1 second" gap for a retry.

2. For nolinger, it was not recommended in the document, and also I wonder
if any data
loss can happen if the socket is not lingered for some time beyond the FIN
packet that
the remote server sent for doing the close(), delayed data packets, etc.

3. Ports: Actually each HAProxy process has 400 ports limitation to a
single backend,
and there are many haproxy processes on this and other servers. The ports
are split per
process and per system. E.g. system1 has 'n' processes and each have a
separate port
range from each other, system2 has 'n' processes and a completely different
port range.
For infra reasons, we are restricting the total port range. The unique
ports for different
haproxy processes running on same system is to avoid attempting to use the
same port
(first port# in the range) by two processes and failing in connect, when
attempting to
connect to the same remote server. Hope I explained that clearly.

Thanks,
- Krishna


On Thu, Mar 9, 2017 at 12:19 PM, Willy Tarreau  wrote:

> Hi Krishna,
>
> On Thu, Mar 09, 2017 at 12:03:19PM +0530, Krishna Kumar (Engineering)
> wrote:
> > Hi Willy,
> >
> > We use HAProxy as a Forward Proxy (I know this is not the intended
> > application for HAProxy) to access outside world from within the DC, and
> > this requires setting a source port range for return traffic to reach the
> > correct
> > box from which a connection was established. On our production boxes, we
> > see around 500 "no free ports" errors per day, but this could increase to
> > about 120K errors during big sale events. The reason for this is due to
> > connect getting a EADDRNOTAVAIL error, since an earlier closed socket
> > may be in last-ack state, as it may take some time for the remote server
> to
> > send the final ack.
> >
> > The attached patch reduces the number of errors by attempting more ports,
> > if they are available.
> >
> > Please review, and let me know if this sounds reasonable to implement.
>
> Well, while the patch looks clean I'm really not convinced it's the correct
> approach. Normally you should simply be using the "retries" parameter to
> increase the amount of connect retries. There's nothing wrong with setting
> it to a really high value if needed. Doesn't it work in your case ?
>
> Also a few other points :
>   - when the remote server sends the FIN with the last segment, your
> connection ends up in CLOSE_WAIT state. Haproxy then closes as
> well, sending a FIN and your socket ends up in LAST_ACK waiting
> for the server to respond. You may instead ask haproxy to close
> with an RST by setting "option nolinger" in the backend. The port
> will then always be free locally. The side effect is that if the
> RST is lost, the SYN of a new outgoing connection may get an ACK
> instead of a SYN-ACK as a reply and will respond to it with an
> RST and try again. This will result in all connections working,
> some taking slightly longer a time (typically 1 second).
>
>   - 500 outgoing ports is a very low value. You should keep in mind
> that nowadays most servers use 60 seconds FIN_WAIT/TIME_WAIT
> delays (the remote server remains in FIN_WAIT1 while waiting for
> your ACK, then enters TIME_WAIT when receiving your FIN). So with
> only 500 ports, you can *safely* support only 500/60 = 8 connections
> per second. Fortunately in practice it doesn't work like this
> since most of the time connections are correctly closed. But if
> you start to enter big trouble, you need to understand that you
> can very quickly reach some limits. And 500 outgoing ports means
> you don't expect to support more than 500 concurrent conns per
> proxy, which seems quite low.
>
> Thus normally what you're experiencing should only be dealt with
> using configuration :
>   - increase retries setting
>   - possibly enable option nolinger (backend only, never on a frontend)
>   - try to increase the available source port ranges.
>
> Regards,
> Willy
>


Re: [PATCH] [MEDIUM] Improve "no free ports" error case

2017-03-08 Thread Willy Tarreau
Hi Krishna,

On Thu, Mar 09, 2017 at 12:03:19PM +0530, Krishna Kumar (Engineering) wrote:
> Hi Willy,
> 
> We use HAProxy as a Forward Proxy (I know this is not the intended
> application for HAProxy) to access outside world from within the DC, and
> this requires setting a source port range for return traffic to reach the
> correct
> box from which a connection was established. On our production boxes, we
> see around 500 "no free ports" errors per day, but this could increase to
> about 120K errors during big sale events. The reason for this is due to
> connect getting a EADDRNOTAVAIL error, since an earlier closed socket
> may be in last-ack state, as it may take some time for the remote server to
> send the final ack.
> 
> The attached patch reduces the number of errors by attempting more ports,
> if they are available.
> 
> Please review, and let me know if this sounds reasonable to implement.

Well, while the patch looks clean I'm really not convinced it's the correct
approach. Normally you should simply be using the "retries" parameter to
increase the amount of connect retries. There's nothing wrong with setting
it to a really high value if needed. Doesn't it work in your case ?

Also a few other points :
  - when the remote server sends the FIN with the last segment, your
connection ends up in CLOSE_WAIT state. Haproxy then closes as
well, sending a FIN and your socket ends up in LAST_ACK waiting
for the server to respond. You may instead ask haproxy to close
with an RST by setting "option nolinger" in the backend. The port
will then always be free locally. The side effect is that if the
RST is lost, the SYN of a new outgoing connection may get an ACK
instead of a SYN-ACK as a reply and will respond to it with an
RST and try again. This will result in all connections working,
some taking slightly longer a time (typically 1 second).

  - 500 outgoing ports is a very low value. You should keep in mind
that nowadays most servers use 60 seconds FIN_WAIT/TIME_WAIT
delays (the remote server remains in FIN_WAIT1 while waiting for
your ACK, then enters TIME_WAIT when receiving your FIN). So with
only 500 ports, you can *safely* support only 500/60 = 8 connections
per second. Fortunately in practice it doesn't work like this
since most of the time connections are correctly closed. But if
you start to enter big trouble, you need to understand that you
can very quickly reach some limits. And 500 outgoing ports means
you don't expect to support more than 500 concurrent conns per
proxy, which seems quite low.

Thus normally what you're experiencing should only be dealt with
using configuration :
  - increase retries setting
  - possibly enable option nolinger (backend only, never on a frontend)
  - try to increase the available source port ranges.

Regards,
Willy



[PATCH] [MEDIUM] Improve "no free ports" error case

2017-03-08 Thread Krishna Kumar (Engineering)
Hi Willy,

We use HAProxy as a Forward Proxy (I know this is not the intended
application for HAProxy) to access outside world from within the DC, and
this requires setting a source port range for return traffic to reach the
correct
box from which a connection was established. On our production boxes, we
see around 500 "no free ports" errors per day, but this could increase to
about 120K errors during big sale events. The reason for this is due to
connect getting a EADDRNOTAVAIL error, since an earlier closed socket
may be in last-ack state, as it may take some time for the remote server to
send the final ack.

The attached patch reduces the number of errors by attempting more ports,
if they are available.

Please review, and let me know if this sounds reasonable to implement.

Thanks,
- Krishna
From 2946ac0b9ba5567284d5364445ad1f9102365e38 Mon Sep 17 00:00:00 2001
From: Krishna Kumar 
Date: Thu, 9 Mar 2017 11:24:06 +0530
Subject: [PATCH] [MEDIUM] Improve "no free ports" error.

 When source IP and source port range are specified, sometimes HAProxy fails
 to connect to a server and prints "no free ports" message. This happens
 when HAProxy recently closed the socket with the same port#, but was not
 completely closed in the kernel. To fix this, attempt a few more connects
 with different port numbers based on the available ports in the port_range.

Following are some log lines with this patch, and when running out of ports:

Early Connect() failed for backend bk-noports: no free ports.
Connect(port=50116) failed for backend bk-noports, socket not fully closed, RETRYING... (max_attempts = 2)
Connect(port=50227) failed for backend bk-noports, socket not fully closed, RETRYING... (max_attempts = 1)
Connect(port=50116) failed for backend bk-noports: no free ports.

When running #parallel wgets == #source-port-range, the following messages
were printed, but none of the connects failed (though it could fail if the
the ports were completely exhausted, e.g. max_attempts = 0):

Connect(port=50241) failed for backend bk-noports, socket not fully closed, RETRYING... (max_attempts = 2)
Connect(port=50226) failed for backend bk-noports, socket not fully closed, RETRYING... (max_attempts = 2)
---
 include/proto/port_range.h |  18 ++
 include/proto/proto_tcp.h  |  19 ++
 src/proto_tcp.c| 496 +++--
 3 files changed, 338 insertions(+), 195 deletions(-)

diff --git a/include/proto/port_range.h b/include/proto/port_range.h
index 8c63fac..7a64caa 100644
--- a/include/proto/port_range.h
+++ b/include/proto/port_range.h
@@ -55,6 +55,24 @@ static inline void port_range_release_port(struct port_range *range, int port)
 		range->put = 0;
 }
 
+/*
+ * Return the maximum number of ports that can be used to attempt a connect().
+ * This is to handle the following problem:
+ *	- haproxy closes a port (kernel does TCP close on socket).
+ *	- haproxy allocates the same port.
+ *	- haproxy attempts to connect, but fails as the kernel has not
+ *	  finished the close.
+ *
+ * To handle this, we attempt to connect() atmost 'range->avail' times, as
+ * this guarantees a different free port#'s each time. Beyond 'avail', we
+ * recycle the same ports which is likely to fail again, and hence is not
+ * useful. The caller must ensure that range is not NULL.
+ */
+static inline int port_range_avail(struct port_range *range)
+{
+	return range->avail;
+}
+
 /* return a new initialized port range of N ports. The ports are not
  * filled in, it's up to the caller to do it.
  */
diff --git a/include/proto/proto_tcp.h b/include/proto/proto_tcp.h
index 13d7a78..05c2b0e 100644
--- a/include/proto/proto_tcp.h
+++ b/include/proto/proto_tcp.h
@@ -40,6 +40,25 @@ int tcp_drain(int fd);
 /* Export some samples. */
 int smp_fetch_src(const struct arg *args, struct sample *smp, const char *kw, void *private);
 
+
+/*
+ * The maximum number of attempts to try to bind to a free source port. This
+ * is required in case some other process has bound to the same IP/port#.
+ */
+#define MAX_BIND_ATTEMPTS		10
+
+/*
+ * The maximum number of attempts to try to connect to a server. This is
+ * required when haproxy configuration file contains directive to bind to a
+ * source IP and port range. In this case, haproxy selects a port that we
+ * think is free, bind to it (which works even if the socket was not fully
+ * closed due to SO_REUSEADDR), but fail in connect() as the socket tuple
+ * may not be fully closed in the kernel, e.g., it may be in LAST-ACK state.
+ * These retries are to try avoiding getting an EADDRNOTAVAIL error during a
+ * socket connect.
+ */
+#define MAX_CONNECT_ATTEMPTS		3
+
 #endif /* _PROTO_PROTO_TCP_H */
 
 /*
diff --git a/src/proto_tcp.c b/src/proto_tcp.c
index 4741651..a09fd66 100644
--- a/src/proto_tcp.c
+++ b/src/proto_tcp.c
@@ -244,6 +244,139 @@ static int create_server_socket(struct connection *conn)
 }
 
 /*
+ * This internal function should not be called directly.