RFC: receive side scaling, need help with approach to port ranges

2019-07-16 Thread Richard Russo
Here are my current patches for comments.

-- 
  Richard Russo
  to...@enslaves.us

On Fri, Jul 5, 2019, at 12:23 PM, Richard Russo wrote:
> Hi,
> 
> I've been experimenting with Recieve Side Scaling (RSS) for a tcp proxy 
> application. The basic idea with RSS is by configuring the NICs, 
> kernel, and application to use the same CPU for a given socket, cross 
> CPU locking and communication is eliminated or at least significantly 
> reduced. On my system, configuring RSS allowed me to handle about three 
> times as many sessions before reaching CPU saturation, with the 
> remaining bottleneck seeming to be kernel processing around socket 
> creation and closing which requires cross cpu coordination. 
> 
> Aligning the incoming sockets is very simple, setting a socket option 
> (IP_RSS_LISTEN_BUCKET) on the listen socket restricts the accepted 
> socket to that bucket, and that's straight forward to add to the tcp 
> listener code, and configuration.
> 
> Aligning outgoing sockets is trickier -- there's no kernel help with a 
> socket option or otherwise, an application has to run the hash 
> (toeplitz) on the 4-tuple of {local ip, local port, remote ip, remote 
> port } and only use an outgoing port if the hash matches.  I've had 
> trouble finding a good approach to handle this.
> 
> The simplest thing would be to run the hash when a port is assigned by 
> port_range and return the port if it hashes to the wrong bucket; but if 
> you've already used all the acceptable ports for that port range, you 
> spend a lot of time hashing the ports that are still in the range, 
> without making any progress.
> 
> If you have a port range per rss bucket, you could hash on port 
> assignment, and not return the ports in case they hash to a wrong 
> bucket; but in the case that the remote ip changes because you've 
> configured it to use DNS or if you change the IP via "set server addr", 
> the previously computed hashes are no longer valid -- you would really 
> want to try all the ports again.
> 
> What I ended up with was a lock on port ranges (instead of atomics as 
> used in 07425de71777b688e77a9c70a7088c13e66e41e9 BUG/MEDIUM: 
> port_range: Make the ring buffer lock-free), adding a revision counter 
> to the port range, and resetting the port range whenever the server IP 
> changed. To avoid running the hash during steady state, and because 
> checking all the ports when the range needs to be filled, I also made 
> port range filing incremental. 
> 
> This approach works, but it feels complicated, and it made my config 
> much more verbose --- I had to duplicate my frontend sections, one for 
> each RSS bucket, which sends to corresponding duplicated backends for 
> each bucket; the backends had additional configuration to indicate the 
> RSS bucket (and the number of buckets). Incidentally, because each RSS 
> bucket has a distinct set of ports, and because my use case doesn't use 
> any features which benefit from coordination within HAProxy (such as 
> stick tables etc), this makes it possible to run in process mode rather 
> than threaded mode without running into a lot of port already in use 
> warnings/errors that would happen otherwise when sharing a port range.
> 
> If it's helpful for the discussion, I can share my patches as-is, but 
> if there are better ideas on how to structure this, I'd rather try to 
> get the changes done in a nice way before sharing.
> 
> Thanks!
> 
> -- 
>   Richard Russo
>   to...@enslaves.us
> 
>

0001-Allow-for-binding-listen-sockets-to-a-provided-RSS-b.patch
Description: Binary data


0002-Revert-BUG-MEDIUM-port_range-Make-the-ring-buffer-lo.patch
Description: Binary data


0003-add-port_range-locking-to-protect-against-concurrent.patch
Description: Binary data


0004-refill-port-ranges-when-addresses-change.patch
Description: Binary data


0005-Allow-for-RSS-aligned-port-selection-for-outgoing-co.patch
Description: Binary data


receive side scaling, need help with approach to port ranges

2019-07-05 Thread Richard Russo
Hi,

I've been experimenting with Recieve Side Scaling (RSS) for a tcp proxy 
application. The basic idea with RSS is by configuring the NICs, kernel, and 
application to use the same CPU for a given socket, cross CPU locking and 
communication is eliminated or at least significantly reduced. On my system, 
configuring RSS allowed me to handle about three times as many sessions before 
reaching CPU saturation, with the remaining bottleneck seeming to be kernel 
processing around socket creation and closing which requires cross cpu 
coordination. 

Aligning the incoming sockets is very simple, setting a socket option 
(IP_RSS_LISTEN_BUCKET) on the listen socket restricts the accepted socket to 
that bucket, and that's straight forward to add to the tcp listener code, and 
configuration.

Aligning outgoing sockets is trickier -- there's no kernel help with a socket 
option or otherwise, an application has to run the hash (toeplitz) on the 
4-tuple of {local ip, local port, remote ip, remote port } and only use an 
outgoing port if the hash matches.  I've had trouble finding a good approach to 
handle this.

The simplest thing would be to run the hash when a port is assigned by 
port_range and return the port if it hashes to the wrong bucket; but if you've 
already used all the acceptable ports for that port range, you spend a lot of 
time hashing the ports that are still in the range, without making any progress.

If you have a port range per rss bucket, you could hash on port assignment, and 
not return the ports in case they hash to a wrong bucket; but in the case that 
the remote ip changes because you've configured it to use DNS or if you change 
the IP via "set server addr", the previously computed hashes are no longer 
valid -- you would really want to try all the ports again.

What I ended up with was a lock on port ranges (instead of atomics as used in 
07425de71777b688e77a9c70a7088c13e66e41e9 BUG/MEDIUM: port_range: Make the ring 
buffer lock-free), adding a revision counter to the port range, and resetting 
the port range whenever the server IP changed. To avoid running the hash during 
steady state, and because checking all the ports when the range needs to be 
filled, I also made port range filing incremental. 

This approach works, but it feels complicated, and it made my config much more 
verbose --- I had to duplicate my frontend sections, one for each RSS bucket, 
which sends to corresponding duplicated backends for each bucket; the backends 
had additional configuration to indicate the RSS bucket (and the number of 
buckets). Incidentally, because each RSS bucket has a distinct set of ports, 
and because my use case doesn't use any features which benefit from 
coordination within HAProxy (such as stick tables etc), this makes it possible 
to run in process mode rather than threaded mode without running into a lot of 
port already in use warnings/errors that would happen otherwise when sharing a 
port range.

If it's helpful for the discussion, I can share my patches as-is, but if there 
are better ideas on how to structure this, I'd rather try to get the changes 
done in a nice way before sharing.

Thanks!

-- 
  Richard Russo
  to...@enslaves.us