Hi Simon,

On Wed, Jun 08, 2011 at 09:55:21AM +0900, Simon Horman wrote:
(...)
> I should note that I'm not entirely sure about the performance implications
> of looping through all sessions like this. Perhaps a per-server list is 
> needed.

That's exactly what I wanted to respond to your mail :-)
I always refused to implement that until we have a per-server session
list, but I also know there it increasing demand for this feature due to
long-running sessions (eg: WebSockets).

There are sites that are running with more than 100k concurrent connections
and the cost can be huge there. One could say that a site running at 100k
conns will have a decent hardware which will not take too long to scan all
the table though.

I think that what could be done is to document the feature as possibly
very expensive when servers go down or when servers are flapping. If we
explain all the risks in the doc, users will be able to check the impact
for their usage.

Another point is that a flapping server can definitely make a service
totally unusable and that must be indicated too. You can check any stats
page you have access to, and consider that for each occurrence of a Down
event on a server, all those server's users will be kicked out. Right now,
when checks have too short timeouts, it can happen that a server goes down
for a short period without much impact. It does not get any new connections
for a few seconds but still serve the current ones. With this feature, all
connections will be killed for each such event. Flapping servers are a
common thing in environments where servers are overloaded or support too
few concurrent connections. I think that a warning about this in the doc
is absolutely necessary too.

> I'm also unsure of the performance implications of waking up a potentially
> large number of sessions.

Yes it will cost a lot too. And it will flood the logs at extreme speed.
Think that you're possibly sending 10-100k logs within a few hundreds of
milliseconds, that can be up to one million log messages per second. Most
likely, the time spent in sprintf() will sensibly extend the processing
time (eg: about 100k messages per second, meaning a freeze for up to one
second) and most of the logs will be lost. That can be documented too :-)

> Perhaps their priority should be reduced?

That's a good idea. At first read I was against this because there is no
way to restore the priority after that, but since the tasks are dying,
it's not an issue :-)

Well, maybe we should still consider the possibility that we only kill the
connection on the server side. For current state of affairs, it will not
change anything, but when we later support server-side keep-alive and later
connection pools, it will make it easier to retry the request over another
connection.

Oh, BTW, I think you'd better use TASK_WOKEN_OTHER than TASK_WOKEN_RES,
because the later is to indicate that a previously unavailable resource
has become available (reason why you see it in queue processing).

Another point, please check that it works with servers put into maintenance
mode, I'm not sure this part calls set_server_down().

Thanks!
Willy


Reply via email to