Lukas, Thomas, thank you so much for your insights and advices.
Here's the update so far:
1) while there are considerably more TIME_WAIT connections when
tw_recycle = 0 (I turned it off an all 3 servers)
it doesn't seem to affect the performance so far. I'll check again later
when the traffic peaks but so far it seems fine.
2) Indeed the network connection was saturated when doing the test and
that explains the 5% of requests that were taking 3+ seconds on the live
server
while this not happening on the test server.
Thank you for pointing this out, it looks like I may need a 1GB
connection :D
3) Not sure if it is related: I noticed that on the frontend some 10% of
the responses are 4xx while on backend there are hardly any.
Is there an easy way to figure out what exactly generates these 4xx errors?
Thank you,
Alex