Hello everyone, I realize this is primarily an OpenAIS issue, but let's discuss it here anyway to share some thoughts.
In Heartbeat-based clusters, we've always advised customers to use redundant network communication links. Given the fact that most of the clusters we build are DRBD based, we practically always have a second network link (the dedicated DRBD replication link) available for this purpose. In Heartbeat, when links get interrupted it's actually somewhat nontrivial to notice (which sucks), but links recover automatically when they are re-established (which is good). Now in OpenAIS, when we configure RRP and a link breaks, OpenAIS complains very loudly (which is good), but eventually the link settles in a faulty state from which it can only be re-enabled using "openais-cfgtool -r". Clearly this breaks the concept of a self-healing system. This discussion has been had before over on the openais list (http://www.mail-archive.com/open...@lists.linux-foundation.org/msg01205.html), but AFAICS it hasn't come to any reasonable conclusion. So my question is, what is the best practice for redundant network setups that should be included in the Pacemaker docs? 1. Set rrp_problem_count_timeout and/or rrp_problem_count_threshold ridiculously high so the ring status never goes to faulty. (It seems that RRP "problem counting" can't be disabled altogether). 2. Have package maintainers include some magic that does "openais-cfgtool -r" every time a network link changes its status to UP (where the network management subsystem permits this). 3. Instruct users to install cron jobs that do "openais-cfgtool -r" in specified intervals, causing OpenAIS to re-check the link status periodically. 4. Something else I haven't thought about. Thoughts? Comments? Cheers, Florian _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker