Re: [PATCH] MINOR: enable IP_BIND_ADDRESS_NO_PORT on backend connections

2016-09-20 Thread Willy Tarreau
Hi Pavlos,

On Wed, Sep 14, 2016 at 11:01:36PM +0200, Pavlos Parissis wrote:
> in our setup where we have haproxy in PoPs which forwards traffic to haproxy
> servers in main data-centers, I am planning to address the ephemeral port
> exhaustion symptom by having the frontends in data centers listening on 
> multiple
> IPs, so I can have the same server multiple times in the backend at PoP.
> 
> backend data_center_haproxies
>   server1_on_ip1 1.1.1.1 
>   server1_on_ip2 1.1.1.2 
> 
> with our system inventory/puppet infra assigning multiple IPs on servers at 
> PoP
> isn't that simple, I know it sounds weird.

Note that you can also make your servers listen on multiple ports, or use
multiple addresses on haproxy for this. I tend to prefer having multiple
ports because it multiplies the allocatable port ranges without adding IP
addresses anywhere.

Another point to note is that if you're running out of source ports due
to idle keep-alive connections between haproxy and the servers, you can
enable http-reuse to significantly improve the situation. It will also
remove one round-trip for the connect() and will reduce the memory usage
on the server side, so there are benefits everywhere.

Regards,
Willy



Re: [PATCH] MINOR: enable IP_BIND_ADDRESS_NO_PORT on backend connections

2016-09-14 Thread Pavlos Parissis
On 14/09/2016 06:26 μμ, Lukas Tribus wrote:
> Hi Pavlos,
> 
> 
> Am 14.09.2016 um 16:01 schrieb Pavlos Parissis:
>> The commit on Linux kernel mentions: """ The port will be automatically
>> chosen at connect() time, in a way that allows sharing a source port as long
>> as the 4-tuples are unique. """
>> 
>> confused me a bit as it says that the same source port can be used as long as
>> the 4-tuples are unique, which imply that we can not, without this option,
>> have the following 2 sockets:
>> 
>> 2.2.2.2 + 3232 + 1.1.1.1 + 80 3.3.3.3 + 3232 + 1.1.1.1 + 80
>> 
>> My understanding is that the ephemeral port limit of 65K is per unique
>> socket and not across all possible sockets.
>> 
>> Am I missing something here?
> 
> 
> You are right, but when the application has to force a source IP it has to
> call bind() before connect(). In this case, even though the bind() call 
> (port0)
> lets the kernel decide which source port to use, the kernel does not know the 
> destination IP and port yet (as the destination IP / port is only passed to
> the kernel in the subsequent connect() call). But bind() has to pick a source
> port immediately, which means it needs to assign a source port that is not 
> used
> for anything else, because the destination IP and the destination port is
> unknown at this point.
> 
> So because the kernel, when having to decide which free port to use, does not
> have the destination IP and destination port information, it cannot base its
> decision on the 4 tuple, as only the source IP is know at this point.
> 
> The IP_BIND_ADDRESS_NO_PORT option tells the kernel to delay the source-port 
> assignment until the connect() call provides the kernel with the rest of the 
> informations (destination IP and port), so the decision can be based on the
> full tuple.
> 
> 

Got it.

> But this is only a problem when the application has to bind to a source IP, 
> because only in that case the bind() call is necessary. If you just connect 
> without a specific source IP, you don't need this.
> 
> 
> So speaking in haproxy configuration terms, the following configuration would 
> benefit from this new feature: server s1 10.0.0.3:80 source 10.0.0.55
> 
> While without "source" keyword: server s1 10.0.0.3:80
> 
> or a source port pool handled by haproxy instead of the kernel: server s1
> 10.0.0.3:80 source 10.0.0.55:2-3
> 
> would not benefit from this feature.
> 
> 

in our setup where we have haproxy in PoPs which forwards traffic to haproxy
servers in main data-centers, I am planning to address the ephemeral port
exhaustion symptom by having the frontends in data centers listening on multiple
IPs, so I can have the same server multiple times in the backend at PoP.

backend data_center_haproxies
  server1_on_ip1 1.1.1.1 
  server1_on_ip2 1.1.1.2 

with our system inventory/puppet infra assigning multiple IPs on servers at PoP
isn't that simple, I know it sounds weird.

Thanks a lot for this very detailed explanation, it is very much appreciated.
May be some part of it can be added to the doc.

Cheers,
Pavlos



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] MINOR: enable IP_BIND_ADDRESS_NO_PORT on backend connections

2016-09-14 Thread Lukas Tribus


Am 14.09.2016 um 18:26 schrieb Lukas Tribus:


would not benefit from this feature.


This should have been "it is not even necessary" instead of "no benefit".


Lukas



Re: [PATCH] MINOR: enable IP_BIND_ADDRESS_NO_PORT on backend connections

2016-09-14 Thread Lukas Tribus

Hi Pavlos,


Am 14.09.2016 um 16:01 schrieb Pavlos Parissis:

The commit on Linux kernel mentions:
"""
The port will be automatically chosen at connect() time, in a way
that allows sharing a source port as long as the 4-tuples are unique.
"""

confused me a bit as it says that the same source port can be used as long as 
the
4-tuples are unique, which imply that we can not, without this option, have the
following 2 sockets:

2.2.2.2 + 3232 + 1.1.1.1 + 80
3.3.3.3 + 3232 + 1.1.1.1 + 80

My understanding is that the ephemeral port limit of 65K is per unique socket
and not across all possible sockets.

Am I missing something here?



You are right, but when the application has to force a source IP it has 
to call bind() before connect(). In this case, even though the bind() 
call (port0) lets the kernel decide which source port to use, the kernel 
does not know the destination IP and port yet (as the destination IP / 
port is only passed to the kernel in the subsequent connect() call). But 
bind() has to pick a source port immediately, which means it needs to 
assign a source port that is not used for anything else, because the 
destination IP and the destination port is unknown at this point.


So because the kernel, when having to decide which free port to use, 
does not have the destination IP and destination port information, it 
cannot base its decision on the 4 tuple, as only the source IP is know 
at this point.


The IP_BIND_ADDRESS_NO_PORT option tells the kernel to delay the 
source-port assignment until the connect() call provides the kernel with 
the rest of the informations (destination IP and port), so the decision 
can be based on the full tuple.



But this is only a problem when the application has to bind to a source 
IP, because only in that case the bind() call is necessary. If you just 
connect without a specific source IP, you don't need this.



So speaking in haproxy configuration terms, the following configuration 
would benefit from this new feature:

server s1 10.0.0.3:80 source 10.0.0.55

While without "source" keyword:
server s1 10.0.0.3:80

or a source port pool handled by haproxy instead of the kernel:
server s1 10.0.0.3:80 source 10.0.0.55:2-3

would not benefit from this feature.


So unless you have the "source" keyword configured in your 
backends/servers, you don't have to worry about this.



Lukas




Re: [PATCH] MINOR: enable IP_BIND_ADDRESS_NO_PORT on backend connections

2016-09-14 Thread Pavlos Parissis
On 13/09/2016 11:51 πμ, Lukas Tribus wrote:
> Enable IP_BIND_ADDRESS_NO_PORT on backend connections when the source
> address is specified without port or port ranges. This is supported
> since Linux 4.2/libc 2.23.
> 
> 

I am going to hijack this thread to ask something related to ephemeral port
exhaustion when HAProxy opens connections to servers.

A single haproxy process can open up to 65K connections to a single server
since those 65K connections are unique quadruple combinations of
source port + source IP + dst IP + dst port.

If you want to get more connections to the same dst IP then we need more source 
IPs.

What improvements in the context of ephemeral port exhaustion does this new bind
option bring?

The commit on Linux kernel mentions:
"""
The port will be automatically chosen at connect() time, in a way
that allows sharing a source port as long as the 4-tuples are unique.
"""

confused me a bit as it says that the same source port can be used as long as 
the
4-tuples are unique, which imply that we can not, without this option, have the
following 2 sockets:

2.2.2.2 + 3232 + 1.1.1.1 + 80
3.3.3.3 + 3232 + 1.1.1.1 + 80

My understanding is that the ephemeral port limit of 65K is per unique socket
and not across all possible sockets.

Am I missing something here?

Cheers,
Pavlos




signature.asc
Description: OpenPGP digital signature


Re: [PATCH] MINOR: enable IP_BIND_ADDRESS_NO_PORT on backend connections

2016-09-13 Thread Willy Tarreau
On Tue, Sep 13, 2016 at 09:51:15AM +, Lukas Tribus wrote:
> Enable IP_BIND_ADDRESS_NO_PORT on backend connections when the source
> address is specified without port or port ranges. This is supported
> since Linux 4.2/libc 2.23.
> 
> If the kernel supports it but the libc doesn't, we can define it at
> build time:
> make [...] DEFINE=-DIP_BIND_ADDRESS_NO_PORT=24
> 
> For more informations about this feature, see Linux commit 90c337da

Merged, thank you Lukas!
willy



[PATCH] MINOR: enable IP_BIND_ADDRESS_NO_PORT on backend connections

2016-09-13 Thread Lukas Tribus
Enable IP_BIND_ADDRESS_NO_PORT on backend connections when the source
address is specified without port or port ranges. This is supported
since Linux 4.2/libc 2.23.

If the kernel supports it but the libc doesn't, we can define it at
build time:
make [...] DEFINE=-DIP_BIND_ADDRESS_NO_PORT=24

For more informations about this feature, see Linux commit 90c337da
---
Testing was limited to strace, confirming that we only set it when we specify
the source IP, and only when the port is set to 0:

## no source address set ##
setsockopt(6, SOL_TCP, TCP_NODELAY, [1], 4) = 0
connect(6, {sa_family=AF_INET, sin_port=htons(80), 
sin_addr=inet_addr("10.0.0.3")}, 16) = -1 EINPROGRESS (Operation now in 
progress)

## source address set, port is non-zero  ##
setsockopt(6, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(6, {sa_family=AF_INET, sin_port=htons(2), 
sin_addr=inet_addr("10.0.0.55")}, 16) = 0
connect(6, {sa_family=AF_INET, sin_port=htons(80), 
sin_addr=inet_addr("10.0.0.3")}, 16) = -1 EINPROGRESS (Operation now in 
progress)

## source address set, port is zero (IP_BIND_ADDRESS_NO_PORT is 0x18) ##
setsockopt(6, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(6, SOL_IP, 0x18 /* IP_??? */, [1], 4) = 0
setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(6, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("10.0.0.55")}, 16) = 0
connect(6, {sa_family=AF_INET, sin_port=htons(80), 
sin_addr=inet_addr("10.0.0.3")}, 16) = -1 EINPROGRESS (Operation now in 
progress)

---
 doc/configuration.txt | 3 +++
 src/proto_tcp.c   | 4 
 2 files changed, 7 insertions(+)

diff --git a/doc/configuration.txt b/doc/configuration.txt
index 52e6cf4..dc43003 100644
--- a/doc/configuration.txt
+++ b/doc/configuration.txt
@@ -10936,6 +10936,9 @@ source [:[-]] [interface ] ...
   total concurrent connections. The limit will then reach 64k connections per
   server.
 
+  Since Linux 4.2/libc 2.23 IP_BIND_ADDRESS_NO_PORT is set for connections
+  specifying the source address without port(s).
+
   Supported in default-server: No
 
 ssl
diff --git a/src/proto_tcp.c b/src/proto_tcp.c
index 91d6688..424731a 100644
--- a/src/proto_tcp.c
+++ b/src/proto_tcp.c
@@ -467,6 +467,10 @@ int tcp_connect_server(struct connection *conn, int data, 
int delack)
} while (ret != 0); /* binding NOK */
}
else {
+#ifdef IP_BIND_ADDRESS_NO_PORT
+   static int bind_address_no_port = 1;
+   setsockopt(fd, SOL_IP, IP_BIND_ADDRESS_NO_PORT, (const 
void *) _address_no_port, sizeof(int));
+#endif
ret = tcp_bind_socket(fd, flags, >source_addr, 
>addr.from);
if (ret != 0)
conn->err_code = CO_ER_CANT_BIND;
-- 
1.9.1