Hi,
we've run across at least two scenarios in which the MCP becomes rather
unhappy.
The first one is where a network-facing send process gets stuck; this
will happen if the NIC loses its carrier, for example, the sendto() will
hang/block, eventually the IPC queue from the MCP to the child will fill
up, and then the old code used to busy-loop (should_send_block = TRUE)
until the queue was available again - which it never was, thus the MCP
was terminally stuck.
My initial attempt was to set the MSG_DONTWAIT flag on the sendto()
call. And that is probably alright for 1.2.x (where the bug actually was
reported against), but with 2.x, it isn't - our traffic pattern is very
bursty, so the socket buffers get flooded real quick.
The next attempt was to set send_should_block = FALSE for the network
writers. However, this means that when eventually the IPC queue fills up
(refer above: very bursty traffic patterns), the send will fail and glib
kicks the channel from the main loop. (And because of the brilliant
error handling regarding the network processes, heartbeat will never
recover from this, requires a restart.)
I think this probably can be "fixed" by introducing a new flag to make
send fail w/o an error if the queue is full in this case - but
ultimately it'd be better if the MCP would rate-limit the traffic
somehow.
Alan, any comments?
(Ultimately, the sender should be throttled if it sends faster than the
available network bandwidth - this might be something to be implemented
in the client lib?)
The second one is where a child process falls behind - for example the
CIB is too slow to process updates. It becomes disconnected and
apparently it doesn't recover.
I think this suggests we need larger queues and probably a way for the
children to recover - but a restart would be fatal to the node ...?
(Again, we actually would want to throttle the sender if it keeps
sending faster than the receivers can keep up.)
Regards,
Lars
--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/