Here's one more scenario with a high load test, and my active content check fails. Instead of a recursive fork bomb, I slowly added a CPU intensive process to the system until I arrived at around 260 load on a single CPU box.
My memcached status check was returning: "INFO: memcached is down" But, this is what the cluster status looked like: Online: [ mem1 mem2 mem3 ] cluster-ip-mem2 (ocf::heartbeat:IPaddr2): Started mem2 cluster-ip-mem1 (ocf::heartbeat:IPaddr2): Started mem1 (unmanaged) FAILED Clone Set: memcached_clone [memcached] Started: [ mem3 mem2 ] Stopped: [ memcached:2 ] Failed actions: memcached:2_start_0 (node=mem1, call=21, rc=-2, status=Timed Out): unknown exec error cluster-ip-mem1_stop_0 (node=mem1, call=22, rc=-2, status=Timed Out): unknown exec error Why wouldn't my mem3 failover happen if it timed out stopping the cluster IP? Thank you, --Cal On Thu, Jul 26, 2012 at 4:09 PM, Cal Heldenbrand <c...@fbsdata.com> wrote: > A few more questions, as I test various outage scenarios: > > My memcached OCF script appears to give a false positive occasionally, and > pacemaker restarts the service. Under the hood, it uses netcat to > localhost with a 3 second connection timeout. I've run my script manually > in a loop and it never seems to time out. > > My primitive looks like this: > > primitive memcached ocf:fbs:memcached \ > > meta is-managed="true" target-role="Started" \ > op monitor interval="1s" timeout="5s" > > I've played around with the primitive's interval and timeout. All that > seems to do is decrease the frequency that the false positive happens. Is > there any way to add logic to the monitor to say "restart the service only > if 3 failures in a row happen?" > > Also, I've tried to create a massive load failure by using a fork bomb. A > few of the outages we've had on our memcache servers appear to be heavy > loads -- the machine response to ICMP on the ethernet card, but doesn't > respond on ssh. A fork bomb pretty much recreates the same problem. When > I fire off a fork bomb on my test machine, it seems to take 5 minutes or > more to actually trigger the failover event. It's difficult for me to make > sense of all the logging going on, but these two timeout values seem to be > interesting: > > crmd: error: crm_timer_popped: Election Timeout (I_ELECTION_DC) > just popped in state S_ELECTION! (120000ms) > crmd: error: crm_timer_popped: Integration Timer (I_INTEGRATED) > just popped in state S_INTEGRATION! (180000ms) > > Can those values be adjusted? Or is there a common configuration change > to be more responsive to an active content check like I'm doing? > > For reference, please see my attached script *memcached*. > > Thanks! > > --Cal > > > On Thu, Jul 26, 2012 at 1:35 PM, Phil Frost <p...@macprofessionals.com>wrote: > >> On 07/26/2012 02:16 PM, Cal Heldenbrand wrote: >> >>> That seems very handy -- and I don't need to specify 3 clones? Once my >>> memcached OCF script reports a downed service, one of them will >>> automatically transition to the current failover node? >>> >> >> There are options for the clone on how many instances of the cloned >> resource to create, but they default to the number of nodes in the cluster. >> See: http://www.clusterlabs.org/**doc/en-US/Pacemaker/1.1/html/** >> Pacemaker_Explained/**ch10s02s02.html<http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ch10s02s02.html> >> >> >> Is there any reason you specified just a single memcache_clone, instead >>> of both the memcache primitive and memcached_clone? I might not be >>> understanding exactly how a clone works. Is it like... maybe a "symbolic >>> link" to a primitive, with the ability to specify different metadata and >>> parameters? >>> >> >> Once you make a clone, the underlying primitive isn't referenced anywhere >> else (that I can think of). If you want to stop memcache, you don't stop >> the primitive; you add a location constraint forbidding the clone from >> running on the node where you want to stop memcache ("crm resource migrate" >> is easiest). I can't find the relevant documentation, but this is just how >> they work. The same is true for groups -- the member primitives are never >> referenced except by the group. I believe in most cases if you try to >> reference the primitive, you will get an error. >> >> >> Despite the advertisement of consistent hashing with memcache clients, >>> I've found that they still have long timeouts waiting on connecting to an >>> IP. So, keeping the clustered IPs up at all times is more important than >>> having a seasoned cache behind them. >>> >> >> I don't know a whole lot about memcache, but it sounds like you might >> even want to reduce the colocation score for the ips on memcache to be a >> large number, but not infinity. This way in the case that memcache is >> broken everywhere, the ips are still permitted to run. This might also >> cover you in the case that a bug in your resource agent thinks memcache has >> failed everywhere, but actually it's still running fine. The decision >> depends which failure the memcache clients handle better: the IP being >> down, or the IP being up but not having a working memcache server behind it. >> >> >
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org