Hello Andrew, > As I said "The cluster only stops doing this if writing to disk fails > at some point - but there would have been an error in your logs if > that were the case."
I grepped in the logs and found out that there was a write error on 15 Juli and probably all changes after that did not went to the disk. (apache-03) [/var/adm/syslog/2013] grep 'Disk write failed' ??/??/* 07/15/daemon:Jul 15 17:55:04 apache-03 cib: [29394]: ERROR: cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0 07/15/daemon:Jul 15 17:55:04 172.19.0.2 cib: [23106]: ERROR: cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0 08/04/daemon:Aug 4 19:03:55 apache-03 cib: [3226]: ERROR: cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0 08/04/daemon:Aug 4 19:03:56 apache-04-intern cib: [3197]: ERROR: cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0 And it looks like the reason for that was not a bad disk, but a failure in another component: Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - <cib admin_epoch="0" epoch="19" num_updates="3" > Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - <configuration > Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - <resources > Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - <group id="nfs" > Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - <primitive id="gcl_fs" > Jul 15 17:55:04 apache-03 crmd: [29398]: info: abort_transition_graph: te_update_diff:126 - Triggered transition abort (complete=1, tag=diff, id=(null), magic=NA, cib=0.20.1) : Non-status change Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - <meta_attributes id="gcl_fs-meta_attributes" > Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - <nvpair id="gcl_fs-meta_attributes-target-role" name="target-role" value="Started" __crm_diff_marker__="removed:top" /> Jul 15 17:55:04 apache-03 crmd: [29398]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - </meta_attributes> Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - </primitive> Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - <meta_attributes id="nfs-meta_attributes" > Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - <nvpair value="Stopped" id="nfs-meta_attributes-target-role" /> Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - </meta_attributes> Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - </group> Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - </resources> Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - </configuration> Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - </cib> Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + <cib epoch="20" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.6" update-origin="apache-03" update-client="cibadmin" cib-last-written="Mon Jul 15 16:02:23 2013" have-quorum="1" dc-uuid="61e8f424-b538-4352-b3fe-955ca853e5fb" > Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + <configuration > Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + <resources > Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + <group id="nfs" > Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + <meta_attributes id="nfs-meta_attributes" > Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + <nvpair id="nfs-meta_attributes-target-role" name="target-role" value="Started" /> Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + </meta_attributes> Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + </group> Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + </resources> Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + </configuration> Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + </cib> Jul 15 17:55:04 apache-03 cib: [29394]: info: cib_process_request: Operation complete: op cib_replace for section resources (origin=local/cibadmin/2, version=0.20.1): ok (rc=0) Jul 15 17:55:04 apache-03 cib: [3583]: ERROR: validate_cib_digest: Digest comparision failed: expected 976068d203615e656547fdf60190ad16 (/var/lib/heartbeat/crm/cib.b9SItG), calculated 3f273f2cf3f97c0c02be83555ecabf0d Jul 15 17:55:04 apache-03 cib: [3583]: ERROR: retrieveCib: Checksum of /var/lib/heartbeat/crm/cib.p3FraX failed! Configuration contents ignored! Jul 15 17:55:04 apache-03 cib: [3583]: ERROR: retrieveCib: Usually this is caused by manual changes, please refer to http://clusterlabs.org/wiki/FAQ#cib_changes_detected Jul 15 17:55:04 apache-03 cib: [3583]: ERROR: crm_abort: write_cib_contents: Triggered fatal assert at io.c:662 : retrieveCib(tmp1, tmp2, FALSE) != NULL Jul 15 17:55:04 apache-03 pengine: [29464]: notice: LogActions: Start nfs-common (apache-03) Jul 15 17:55:04 apache-03 pengine: [29464]: notice: LogActions: Start nfs-kernel-server (apache-03) Jul 15 17:55:04 apache-03 pengine: [29464]: notice: LogActions: Start nfs_ipv4 (apache-03) Jul 15 17:55:04 apache-03 crmd: [29398]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Jul 15 17:55:04 apache-03 crmd: [29398]: info: do_te_invoke: Processing graph 43 (ref=pe_calc-dc-1373903704-255) derived from /var/lib/pengine/pe-input-30.bz2 Jul 15 17:55:04 apache-03 crmd: [29398]: info: te_rsc_command: Initiating action 34: start nfs-common_start_0 on apache-03 (local) Jul 15 17:55:04 apache-03 lrmd: [29395]: info: rsc:nfs-common start[75] (pid 3584) Jul 15 17:55:04 apache-03 cib: [29394]: WARN: Managed write_cib_contents process 3583 killed by signal 6 [SIGABRT - Abort]. Jul 15 17:55:04 apache-03 cib: [29394]: ERROR: Managed write_cib_contents process 3583 dumped core Jul 15 17:55:04 apache-03 cib: [29394]: ERROR: cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0 Jul 15 17:55:04 apache-03 cib: [29394]: ERROR: cib_diskwrite_complete: Disabling disk writes after write failure Jul 15 17:55:04 apache-03 pengine: [29464]: notice: process_pe_message: Transition 43: PEngine Input stored in: /var/lib/pengine/pe-input-30.bz2 Jul 15 17:55:04 apache-03 rpc.statd[3594]: Version 1.2.6 starting Jul 15 17:55:04 apache-03 rpc.statd[3594]: Flags: TI-RPC Jul 15 17:55:04 apache-03 sm-notify[3595]: Version 1.2.6 starting Jul 15 17:55:04 apache-03 sm-notify[3595]: Already notifying clients; Exiting! Jul 15 17:55:04 apache-03 rpc.statd[3594]: Running as root. chown /var/lib/nfs to choose different user Jul 15 17:55:04 172.19.0.2 cib: [23106]: info: apply_xml_diff: Digest mis-match: expected a0390a2d12f7339fb595c5894d0106db, calculated 38133aaf7fa362a8756002428700c77f Jul 15 17:55:04 172.19.0.2 cib: [23106]: notice: cib_process_diff: Diff 0.19.3 -> 0.20.1 not applied to 0.19.3: Failed application of an update diff Jul 15 17:55:04 172.19.0.2 cib: [23106]: info: cib_server_process_diff: Requesting re-sync from peer Jul 15 17:55:04 apache-03 lrmd: [29395]: info: operation start[75] on nfs-common for client 29398: pid 3584 exited with return code 0 Jul 15 17:55:04 apache-03 crmd: [29398]: info: process_lrm_event: LRM operation nfs-common_start_0 (call=75, rc=0, cib-update=318, confirmed=true) ok Jul 15 17:55:04 apache-03 crmd: [29398]: info: te_rsc_command: Initiating action 35: monitor nfs-common_monitor_60000 on apache-03 (local) Jul 15 17:55:04 apache-03 lrmd: [29395]: info: rsc:nfs-common monitor[76] (pid 3611) Jul 15 17:55:04 apache-03 crmd: [29398]: info: te_rsc_command: Initiating action 36: start nfs-kernel-server_start_0 on apache-03 (local) Jul 15 17:55:04 apache-03 lrmd: [29395]: info: rsc:nfs-kernel-server start[77] (pid 3612) Jul 15 17:55:04 apache-03 lrmd: [29395]: info: operation monitor[76] on nfs-common for client 29398: pid 3611 exited with return code 0 Jul 15 17:55:04 apache-03 crmd: [29398]: info: process_lrm_event: LRM operation nfs-common_monitor_60000 (call=76, rc=0, cib-update=319, confirmed=false) ok Jul 15 17:55:04 apache-03 rpc.mountd[3650]: Version 1.2.6 starting Jul 15 17:55:04 apache-03 lrmd: [29395]: info: operation start[77] on nfs-kernel-server for client 29398: pid 3612 exited with return code 0 Jul 15 17:55:04 apache-03 crmd: [29398]: info: process_lrm_event: LRM operation nfs-kernel-server_start_0 (call=77, rc=0, cib-update=320, confirmed=true) ok Jul 15 17:55:04 apache-03 crmd: [29398]: info: te_rsc_command: Initiating action 37: monitor nfs-kernel-server_monitor_60000 on apache-03 (local) Jul 15 17:55:04 apache-03 lrmd: [29395]: info: rsc:nfs-kernel-server monitor[78] (pid 3655) Jul 15 17:55:04 apache-03 crmd: [29398]: info: te_rsc_command: Initiating action 38: start nfs_ipv4_start_0 on apache-03 (local) Jul 15 17:55:04 apache-03 lrmd: [29395]: info: rsc:nfs_ipv4 start[79] (pid 3656) Jul 15 17:55:04 apache-03 lrmd: [29395]: info: operation monitor[78] on nfs-kernel-server for client 29398: pid 3655 exited with return code 0 Jul 15 17:55:04 apache-03 crmd: [29398]: info: process_lrm_event: LRM operation nfs-kernel-server_monitor_60000 (call=78, rc=0, cib-update=321, confirmed=false) ok Jul 15 17:55:04 apache-03 IPaddr2[3656]: [3691]: INFO: ip -f inet addr add 172.19.0.253/24 brd 172.19.0.255 dev bond0 Jul 15 17:55:04 apache-03 IPaddr2[3656]: [3694]: INFO: ip link set bond0 up Jul 15 17:55:04 apache-03 IPaddr2[3656]: [3697]: INFO: /usr/lib/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-172.19.0.253 bond0 172.19.0.253 auto not_used not_used Jul 15 17:55:04 apache-03 lrmd: [29395]: info: operation start[79] on nfs_ipv4 for client 29398: pid 3656 exited with return code 0 Jul 15 17:55:04 apache-03 crmd: [29398]: info: process_lrm_event: LRM operation nfs_ipv4_start_0 (call=79, rc=0, cib-update=322, confirmed=true) ok Jul 15 17:55:04 apache-03 crmd: [29398]: info: te_rsc_command: Initiating action 39: monitor nfs_ipv4_monitor_60000 on apache-03 (local) Jul 15 17:55:04 apache-03 lrmd: [29395]: info: rsc:nfs_ipv4 monitor[80] (pid 3706) Jul 15 17:55:04 apache-03 lrmd: [29395]: info: operation monitor[80] on nfs_ipv4 for client 29398: pid 3706 exited with return code 0 Jul 15 17:55:04 apache-03 crmd: [29398]: info: process_lrm_event: LRM operation nfs_ipv4_monitor_60000 (call=80, rc=0, cib-update=323, confirmed=false) ok Jul 15 17:55:04 apache-03 crmd: [29398]: notice: run_graph: ==== Transition 43 (Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-30.bz2): Complete Jul 15 17:55:04 apache-03 crmd: [29398]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Jul 15 17:55:04 apache-03 cib: [29394]: info: cib_process_request: Operation complete: op cib_sync_one for section 'all' (origin=apache-04/apache-04/(null), version=0.20.7): ok (rc=0) Jul 15 17:55:04 172.19.0.2 cib: [23106]: notice: cib_server_process_diff: Not applying diff 0.20.1 -> 0.20.2 (sync in progress) Jul 15 17:55:04 172.19.0.2 cib: [23106]: notice: cib_server_process_diff: Not applying diff 0.20.2 -> 0.20.3 (sync in progress) Jul 15 17:55:04 172.19.0.2 cib: [23106]: notice: cib_server_process_diff: Not applying diff 0.20.3 -> 0.20.4 (sync in progress) Jul 15 17:55:04 172.19.0.2 cib: [23106]: notice: cib_server_process_diff: Not applying diff 0.20.4 -> 0.20.5 (sync in progress) Jul 15 17:55:04 172.19.0.2 cib: [23106]: notice: cib_server_process_diff: Not applying diff 0.20.5 -> 0.20.6 (sync in progress) Jul 15 17:55:04 172.19.0.2 cib: [23106]: info: cib_process_diff: Diff 0.20.6 -> 0.20.7 not applied to 0.19.3: current "epoch" is less than required Jul 15 17:55:04 172.19.0.2 cib: [23106]: info: cib_server_process_diff: Requesting re-sync from peer Jul 15 17:55:04 172.19.0.2 cib: [23106]: info: cib_replace_notify: Replaced: -1.-1.-1 -> 0.20.7 from apache-03 Jul 15 17:55:04 172.19.0.2 attrd: [23109]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-apache-03-fencing (1373885111) Jul 15 17:55:04 172.19.0.2 attrd: [23109]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd0:0 (75) Jul 15 17:55:04 172.19.0.2 attrd: [23109]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Jul 15 17:55:04 172.19.0.2 attrd: [23109]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-router_ipv6 (1373893390) Jul 15 17:55:04 172.19.0.2 cib: [10259]: ERROR: validate_cib_digest: Digest comparision failed: expected 976068d203615e656547fdf60190ad16 (/var/lib/heartbeat/crm/cib.HupF60), calculated 3f273f2cf3f97c0c02be83555ecabf0d Jul 15 17:55:04 172.19.0.2 cib: [10259]: ERROR: retrieveCib: Checksum of /var/lib/heartbeat/crm/cib.97XTAT failed! Configuration contents ignored! Jul 15 17:55:04 172.19.0.2 cib: [10259]: ERROR: retrieveCib: Usually this is caused by manual changes, please refer to http://clusterlabs.org/wiki/FAQ#cib_changes_detected Jul 15 17:55:04 172.19.0.2 cib: [10259]: ERROR: crm_abort: write_cib_contents: Triggered fatal assert at io.c:662 : retrieveCib(tmp1, tmp2, FALSE) != NULL Jul 15 17:55:04 172.19.0.2 cib: [23106]: WARN: Managed write_cib_contents process 10259 killed by signal 6 [SIGABRT - Abort]. Jul 15 17:55:04 172.19.0.2 cib: [23106]: ERROR: Managed write_cib_contents process 10259 dumped core Jul 15 17:55:04 172.19.0.2 cib: [23106]: ERROR: cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0 Jul 15 17:55:04 172.19.0.2 cib: [23106]: ERROR: cib_diskwrite_complete: Disabling disk writes after write failure Jul 15 18:46:33 apache-03 cib: [29394]: info: cib:diff: + <cib epoch="21" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.6" update-origin="apache-03" update-client="cibadmin" cib-last-written="Mon Jul 15 17:55:04 2013" have-quorum="1" dc-uuid="61e8f424-b538-4352-b3fe-955ca853e5fb" > I'll upgrade and file a bugreport against Debian. Cheers, Thomas _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
