I've been experimenting with Recieve Side Scaling (RSS) for a tcp proxy 
application. The basic idea with RSS is by configuring the NICs, kernel, and 
application to use the same CPU for a given socket, cross CPU locking and 
communication is eliminated or at least significantly reduced. On my system, 
configuring RSS allowed me to handle about three times as many sessions before 
reaching CPU saturation, with the remaining bottleneck seeming to be kernel 
processing around socket creation and closing which requires cross cpu 

Aligning the incoming sockets is very simple, setting a socket option 
(IP_RSS_LISTEN_BUCKET) on the listen socket restricts the accepted socket to 
that bucket, and that's straight forward to add to the tcp listener code, and 

Aligning outgoing sockets is trickier -- there's no kernel help with a socket 
option or otherwise, an application has to run the hash (toeplitz) on the 
4-tuple of {local ip, local port, remote ip, remote port } and only use an 
outgoing port if the hash matches.  I've had trouble finding a good approach to 
handle this.

The simplest thing would be to run the hash when a port is assigned by 
port_range and return the port if it hashes to the wrong bucket; but if you've 
already used all the acceptable ports for that port range, you spend a lot of 
time hashing the ports that are still in the range, without making any progress.

If you have a port range per rss bucket, you could hash on port assignment, and 
not return the ports in case they hash to a wrong bucket; but in the case that 
the remote ip changes because you've configured it to use DNS or if you change 
the IP via "set server addr", the previously computed hashes are no longer 
valid -- you would really want to try all the ports again.

What I ended up with was a lock on port ranges (instead of atomics as used in 
07425de71777b688e77a9c70a7088c13e66e41e9 BUG/MEDIUM: port_range: Make the ring 
buffer lock-free), adding a revision counter to the port range, and resetting 
the port range whenever the server IP changed. To avoid running the hash during 
steady state, and because checking all the ports when the range needs to be 
filled, I also made port range filing incremental. 

This approach works, but it feels complicated, and it made my config much more 
verbose --- I had to duplicate my frontend sections, one for each RSS bucket, 
which sends to corresponding duplicated backends for each bucket; the backends 
had additional configuration to indicate the RSS bucket (and the number of 
buckets). Incidentally, because each RSS bucket has a distinct set of ports, 
and because my use case doesn't use any features which benefit from 
coordination within HAProxy (such as stick tables etc), this makes it possible 
to run in process mode rather than threaded mode without running into a lot of 
port already in use warnings/errors that would happen otherwise when sharing a 
port range.

If it's helpful for the discussion, I can share my patches as-is, but if there 
are better ideas on how to structure this, I'd rather try to get the changes 
done in a nice way before sharing.


  Richard Russo

Reply via email to