Good weekend, everyone!
Let me start by describing my problem first and then moving to proposed 
solution.

Problem:
Currently we have number of PoPs (Points-of-Presence) around the world with 
Linux/nginx doing TCP/TLS/HTTP termination. There we re-encrypt traffic and 
proxy_pass it to the upstream block with HUGE set of servers. Whole idea of 
those PoP nginxes is to have pool of keepalive connections with enormous tcp 
windows to upstreams.
But in reality we can not use any of nginx’es connection balancing methods 
because they almost never reuse connections (yet again, our upstream list is 
huge). Also each worker has it’s own keepalive pool which makes situation even 
worse. Of cause we can generate per-server config files and give each server in 
each PoP different(and small) set of upstream servers, but that solution sounds 
awfully “clunky”.

Solution:
IPVS for example, among it's numerous job scheduling modes has Locality-Based 
Least-Connection Scheduling[1], that looks quite close to what we want. The 
only problem is that if all the worker processes on all our boxes around the 
world will use same list of upstreams they will quickly overload first 
upstream, then second, etc, therefore I’ve added randomized mode in which each 
worker starts by filling upstreams w.r.t. some random starting point. That 
should give good locality for tcp connection reuse and as law of large numbers 
implies - good enough load distribution across upstreams globally.

Implementation:
PoC:
        coloured: https://gist.github.com/SaveTheRbtz/d6a505555cd02cb6aee6
        raw: 
https://gist.githubusercontent.com/SaveTheRbtz/d6a505555cd02cb6aee6/raw/5aba3b0709777d2a6e99217bd3e06e2178846dc4/least_conn_locality_randomized.diff

It basically tries to find first(starting from per-worker-random for randomized 
variant) not fully loaded peer and if it fails then it falls back to normal 
least_conn.

Followup questions:
Does anyone in the community have similar use cases? CloudFlare maybe?
Is Nginx Inc interested in incorporating something patch like that, or is that 
too specific to our workflow? Should I prettify that PoC or should I just throw 
the ball your way?

Alternative solution:
Original upstream keepalive module[2] had “single” keyword, that also suites 
our needs, though it was removed because, let me quote Maxim Dounin:
        The original idea was to optimize edge cases in case of interchangeable
        backends, i.e. don't establish a new connection if we have any one
        cached.  This causes more harm than good though, as it screws up
        underlying balancer's idea about backends used and may result in
        various unexpected problems.

[1] 
http://kb.linuxvirtualserver.org/wiki/Locality-Based_Least-Connection_Scheduling
[2] http://mdounin.ru/hg/ngx_http_upstream_keepalive/

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
nginx-devel mailing list
[email protected]
http://mailman.nginx.org/mailman/listinfo/nginx-devel

Reply via email to