2015-01-27 17:58 GMT+01:00 Dejan Muhamedagic <deja...@fastmail.fm>: > On Tue, Jan 27, 2015 at 03:18:13PM +0100, Oscar Salvador wrote: > > Hi, > > > > I've checked the resource graphs I have, and the resources were fine, so > I > > think it's not a problem due to a high use of memory or something like > that. > > And unfortunately I don't have a core dump to analize(I'll enable it for > a > > future case) so the only thing I have are the logs. > > > > For the line below, I though that was the process in charge to monitore > > nginx what was killed due to a segfault: > > > > RA output: (Nginx-rsc:monitor:stderr) Segmentation fault > > This is just output captured during the execution of the RA > monitor action. It could've been anything within the RA (which is > just a shell script) to segfault. >
Hi, Yes, I see. I've enabled core dumps on the system, so the next time I'll be able to check what is causing this. Thank you very much Oscar Salvador > > Thanks, > > Dejan > > > I've checked the Nginx logs, and there is nothing worth there, actually > > there is no activity, so I think it has to be something internal what > > caused the failure. > > I'll enable coredumps, it's the only thing I can do for now. > > > > Thank you very much > > > > Oscar > > > > 2015-01-27 10:39 GMT+01:00 Dejan Muhamedagic <deja...@fastmail.fm>: > > > > > Hi, > > > > > > On Mon, Jan 26, 2015 at 06:20:35PM +0100, Oscar Salvador wrote: > > > > Hi! > > > > > > > > I'm writing here because two days ago I experienced a strange > problem in > > > my > > > > Pacemaker Cluster. > > > > Everything was working fine, till suddenly a Segfault in Nginx > monitor > > > > resource happened: > > > > > > > > Jan 25 03:55:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition > > > 7551 > > > > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, > > > > Source=/var/lib/pengine/pe-input-90.bz2): Complete > > > > Jan 25 03:55:24 lb02 crmd: [9975]: notice: do_state_transition: State > > > > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > > > > cause=C_FSA_INTERNAL origin=notify_crmd ] > > > > Jan 25 04:00:08 lb02 cib: [9971]: info: cib_stats: Processed 1 > operations > > > > (0.00us average, 0% utilization) in the last 10min > > > > Jan 25 04:10:24 lb02 crmd: [9975]: info: crm_timer_popped: PEngine > > > Recheck > > > > Timer (I_PE_CALC) just popped (900000ms) > > > > Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State > > > > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > > > cause=C_TIMER_POPPED > > > > origin=crm_timer_popped ] > > > > Jan 25 04:10:24 lb02 crmd: [9975]: info: do_state_transition: > Progressed > > > to > > > > state S_POLICY_ENGINE after C_TIMER_POPPED > > > > Jan 25 04:10:24 lb02 pengine: [10028]: WARN: unpack_rsc_op: > Processing > > > > failed op Ldirector-rsc_last_failure_0 on lb02: not running (7) > > > > Jan 25 04:10:24 lb02 pengine: [10028]: notice: > common_apply_stickiness: > > > > Ldirector-rsc can fail 999997 more times on lb02 before being forced > off > > > > Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State > > > > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ > input=I_PE_SUCCESS > > > > cause=C_IPC_MESSAGE origin=handle_response ] > > > > Jan 25 04:10:24 lb02 pengine: [10028]: notice: process_pe_message: > > > > Transition 7552: PEngine Input stored in: > > > /var/lib/pengine/pe-input-90.bz2 > > > > Jan 25 04:10:24 lb02 crmd: [9975]: info: do_te_invoke: Processing > graph > > > > 7552 (ref=pe_calc-dc-1422155424-7644) derived from > > > > /var/lib/pengine/pe-input-90.bz2 > > > > Jan 25 04:10:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition > > > 7552 > > > > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, > > > > Source=/var/lib/pengine/pe-input-90.bz2): Complete > > > > Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State > > > > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > > > > cause=C_FSA_INTERNAL origin=notify_crmd ] > > > > > > > > > > > > Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output: > > > > (Nginx-rsc:monitor:stderr) Segmentation fault ******* here it > starts > > > > > > What exactly did segfault? Do you have a core dump to examine? > > > > > > > As you can see, the last line. > > > > And then: > > > > > > > > Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output: > > > > (Nginx-rsc:monitor:stderr) Killed > > > > /usr/lib/ocf/resource.d//heartbeat/nginx: 910: > > > > /usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork > > > > > > This could be related to the segfault, or due to other serious > > > system error. > > > > > > > I guess here Nginx was killed. > > > > > > > > And then I have some others errors till Pacemaker decide to move the > > > > resources to the node: > > > > > > > > Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM > operation > > > > Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, > confirmed=false) > > > > invalid parameter > > > > Jan 25 04:10:30 lb02 crmd: [9975]: info: process_graph_event: > Detected > > > > action Nginx-rsc_monitor_10000 from a different transition: 5739 vs. > 7552 > > > > Jan 25 04:10:30 lb02 crmd: [9975]: info: abort_transition_graph: > > > > process_graph_event:476 - Triggered transition abort (complete=1, > > > > tag=lrm_rsc_op, id=Nginx-rsc_last_failure_0, > > > > magic=0:2;4:5739:0:42d1ed53-9686-4174-84e7-d2c230ed8832, cib= > > > > 3.14.40) : Old event > > > > Jan 25 04:10:30 lb02 crmd: [9975]: WARN: update_failcount: Updating > > > > failcount for Nginx-rsc on lb02 after failed monitor: rc=2 > > > (update=value++, > > > > time=1422155430) > > > > Jan 25 04:10:30 lb02 crmd: [9975]: notice: do_state_transition: State > > > > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > > > cause=C_FSA_INTERNAL > > > > origin=abort_transition_graph ] > > > > Jan 25 04:10:30 lb02 attrd: [9974]: info: log-rotate detected on > logfile > > > > /var/log/ha-log > > > > Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: > Sending > > > > flush op to all hosts for: fail-count-Nginx-rsc (1) > > > > Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: > Preventing > > > > Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid > > > > parameter' (rc=2) > > > > Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: > Processing > > > > failed op Nginx-rsc_last_failure_0 on lb02: invalid parameter (2) > > > > Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: > Processing > > > > failed op Ldirector-rsc_last_failure_0 on lb02: not running (7) > > > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: > common_apply_stickiness: > > > > Ldirector-rsc can fail 999997 more times on lb02 before being forced > off > > > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop > > > > IP-rsc_mysql (lb02) > > > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop > > > > IP-rsc_nginx (lb02) > > > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop > > > > IP-rsc_nginx6 (lb02) > > > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop > > > > IP-rsc_elasticsearch (lb02) > > > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move > > > > Ldirector-rsc (Started lb02 -> lb01) > > > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move > > > > Nginx-rsc (Started lb02 -> lb01) > > > > Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_perform_update: > Sent > > > > update 23: fail-count-Nginx-rsc=1 > > > > Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: > Sending > > > > flush op to all hosts for: last-failure-Nginx-rsc (1422155430) > > > > > > > > I see that Pacemaker is complaining about some errors like "invalid > > > > paraemter", for example in these lines: > > > > > > That error code is what the nginx RA exited with. It's unusual, > > > but perhaps also due to the segfault. > > > > > > Thanks, > > > > > > Dejan > > > > > > > Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM > operation > > > > Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, > confirmed=false) > > > > invalid parameter > > > > > > > > Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: > Preventing > > > > Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid > > > > parameter' (rc=2) > > > > > > > > It sounds(for me) like a syntax problem defining the resources, but > I've > > > > checked the confic with crm_verify and there is no error: > > > > > > > > root# (S) crm_verify -LVV > > > > root# (S) > > > > > > > > So I'm just wondering why pacemaker is complaining about an invalid > > > > parameter. > > > > > > > > This is my CIB objetcs: > > > > > > > > node $id="43b2c5a1-9552-4438-962b-6e98a2dd67c7" lb01 > > > > node $id="68328520-68e0-42fd-9adf-062655691643" lb02 > > > > primitive IP-rsc_elasticsearch ocf:heartbeat:IPaddr2 \ > > > > params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224" > > > > primitive IP-rsc_elasticsearch6 ocf:heartbeat:IPv6addr \ > > > > params ipv6addr="xxxxxxxxxxxxxxxx" \ > > > > op monitor interval="10s" > > > > primitive IP-rsc_mysql ocf:heartbeat:IPaddr2 \ > > > > params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224" > > > > primitive IP-rsc_mysql6 ocf:heartbeat:IPv6addr \ > > > > params ipv6addr="xxxxxxxxxxxxxx" \ > > > > op monitor interval="10s" > > > > primitive IP-rsc_nginx ocf:heartbeat:IPaddr2 \ > > > > params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224" > > > > primitive IP-rsc_nginx6 ocf:heartbeat:IPv6addr \ > > > > params ipv6addr="xxxxxxxxxxxxxx" \ > > > > op monitor interval="10s" > > > > primitive Ldirector-rsc ocf:heartbeat:ldirectord \ > > > > op monitor interval="10s" timeout="30s" > > > > primitive Nginx-rsc ocf:heartbeat:nginx \ > > > > op monitor interval="10s" timeout="30s" > > > > location cli-standby-IP-rsc_elasticsearch6 IP-rsc_elasticsearch6 \ > > > > rule $id="cli-standby-rule-IP-rsc_elasticsearch6" -inf: #uname eq > lb01 > > > > location cli-standby-IP-rsc_mysql IP-rsc_mysql \ > > > > rule $id="cli-standby-rule-IP-rsc_mysql" -inf: #uname eq lb01 > > > > location cli-standby-IP-rsc_mysql6 IP-rsc_mysql6 \ > > > > rule $id="cli-standby-rule-IP-rsc_mysql6" -inf: #uname eq lb01 > > > > location cli-standby-IP-rsc_nginx IP-rsc_nginx \ > > > > rule $id="cli-standby-rule-IP-rsc_nginx" -inf: #uname eq lb01 > > > > location cli-standby-IP-rsc_nginx6 IP-rsc_nginx6 \ > > > > rule $id="cli-standby-rule-IP-rsc_nginx6" -inf: #uname eq lb01 > > > > colocation hcu_c inf: Nginx-rsc Ldirector-rsc IP-rsc_mysql > IP-rsc_nginx > > > > IP-rsc_nginx6 IP-rsc_elasticsearch > > > > order hcu_o inf: IP-rsc_nginx IP-rsc_nginx6 IP-rsc_mysql > Ldirector-rsc > > > > Nginx-rsc IP-rsc_elasticsearch > > > > property $id="cib-bootstrap-options" \ > > > > dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \ > > > > cluster-infrastructure="Heartbeat" \ > > > > stonith-enabled="false > > > > > > > > > > > > Do you have some hints that I can follow? > > > > > > > > Thanks in advance! > > > > > > > > Oscar > > > > > > > _______________________________________________ > > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > > > Project Home: http://www.clusterlabs.org > > > > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > > > Bugs: http://bugs.clusterlabs.org > > > > > > > > > _______________________________________________ > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > Project Home: http://www.clusterlabs.org > > > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > > Bugs: http://bugs.clusterlabs.org > > > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org