Hi Simon, On Wed, Jun 08, 2011 at 09:55:21AM +0900, Simon Horman wrote: (...) > I should note that I'm not entirely sure about the performance implications > of looping through all sessions like this. Perhaps a per-server list is > needed.
That's exactly what I wanted to respond to your mail :-) I always refused to implement that until we have a per-server session list, but I also know there it increasing demand for this feature due to long-running sessions (eg: WebSockets). There are sites that are running with more than 100k concurrent connections and the cost can be huge there. One could say that a site running at 100k conns will have a decent hardware which will not take too long to scan all the table though. I think that what could be done is to document the feature as possibly very expensive when servers go down or when servers are flapping. If we explain all the risks in the doc, users will be able to check the impact for their usage. Another point is that a flapping server can definitely make a service totally unusable and that must be indicated too. You can check any stats page you have access to, and consider that for each occurrence of a Down event on a server, all those server's users will be kicked out. Right now, when checks have too short timeouts, it can happen that a server goes down for a short period without much impact. It does not get any new connections for a few seconds but still serve the current ones. With this feature, all connections will be killed for each such event. Flapping servers are a common thing in environments where servers are overloaded or support too few concurrent connections. I think that a warning about this in the doc is absolutely necessary too. > I'm also unsure of the performance implications of waking up a potentially > large number of sessions. Yes it will cost a lot too. And it will flood the logs at extreme speed. Think that you're possibly sending 10-100k logs within a few hundreds of milliseconds, that can be up to one million log messages per second. Most likely, the time spent in sprintf() will sensibly extend the processing time (eg: about 100k messages per second, meaning a freeze for up to one second) and most of the logs will be lost. That can be documented too :-) > Perhaps their priority should be reduced? That's a good idea. At first read I was against this because there is no way to restore the priority after that, but since the tasks are dying, it's not an issue :-) Well, maybe we should still consider the possibility that we only kill the connection on the server side. For current state of affairs, it will not change anything, but when we later support server-side keep-alive and later connection pools, it will make it easier to retry the request over another connection. Oh, BTW, I think you'd better use TASK_WOKEN_OTHER than TASK_WOKEN_RES, because the later is to indicate that a previously unavailable resource has become available (reason why you see it in queue processing). Another point, please check that it works with servers put into maintenance mode, I'm not sure this part calls set_server_down(). Thanks! Willy

