On Thu, 2010-04-01 at 11:23 +1300, Tim Beale wrote: > Hi Steve, > > End-to-end flow control is something I'd really love to see. It sounds like > your proposal won't fix all the problems we're seeing with flow control > though. > > A problem we've seen is kind of permanent congestion - the receiver gets a > burst of several hundred CPG messages queued up and never really recovers. The > sender continues sending enough CPG messages that the receiver never clears > out > its queue, but doesn't run out of memory either. The receiver's queue could > hover in this state indefinitely. On our setup, a healthcheck mechanism > detects > the receiver has locked up (some operations are blocking due to flow control > congestion) and eventually restarts the process. > (As an interim workaround for this on our setup, I fudged the token backlog > calculation to gradually force the sender to backoff, so the sender's totem > message queue fills up and it starts getting TRY_AGAIN errors). > > I was wondering whether end-to-end flow control at the CPG group level is a > possible/feasible option that'd solve both this case and the oom one? E.g. in > the CPG library code it sends an internal message to notify the rest of the > CPG > group whenever the flow control status for an application changes? >
Tim, I attempted the proposal I suggested and it turns out that sending the message really doesn't work at fixing the memory allocation problem (system still ooms) because messages can be out of date wrt the current flow control state. Angus and I have brainstormed this problem for quite some time (over 6 months..:), and I rediscovered a patch he created a long time ago but I wasn't sure if I wanted to introduce it. Essentially his patch holds on to the token in the case that a node's ipc queues are congested. I don't like this though, because the token is used for recovery and healthchecking, so altering its behavior is problematic. I did take this idea as a basis for some work in one of my dev trees. I'll give a brief rundown of how it works currently. coroipcs keeps a count of how much memory is allocated by dispatch messages. If the memory allocated is greater then a maximum (currently 128mb), coroipcs tells Totem to muck with the flow control parameters in the token which stops regular messages from being ordered. If the node's memory allocation drops below a minimum (64mb) coroipcs tells totem to stop mucking with the flow control values. This mechanism allows us to limit corosync to a specific memory footprint (since coroipcs dispatch messages are one of the few allocators of memory during normal runtime). To correct for applications that block for too long (usually because they have failed in some kind of deadlock), I am planning for each ipc connection to register a timeout value at which point its ipc connection will be terminated (applications get back ERR_LIBRARY). Could you send me your backlog backoff calculation code (or preferably tarball of source tree)? I'd like to see what you have. Also I found some type of lockup bug in ipc dispatch under really heavy load that is fixed in my current rework. I would send you a patch, but its against a really old version and I haven't rebased the work against current trunk yet. Regards -steve > Regards, > Tim > _______________________________________________ > Openais mailing list > [email protected] > https://lists.linux-foundation.org/mailman/listinfo/openais _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
