Hello Andrew,

> As I said "The cluster only stops doing this if writing to disk fails
> at some point - but there would have been an error in your logs if
> that were the case."

I grepped in the logs and found out that there was a write error on 15
Juli and probably all changes after that did not went to the disk.

(apache-03) [/var/adm/syslog/2013] grep 'Disk write failed' ??/??/*
07/15/daemon:Jul 15 17:55:04 apache-03 cib: [29394]: ERROR: 
cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0
07/15/daemon:Jul 15 17:55:04 172.19.0.2 cib: [23106]: ERROR: 
cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0
08/04/daemon:Aug  4 19:03:55 apache-03 cib: [3226]: ERROR: 
cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0
08/04/daemon:Aug  4 19:03:56 apache-04-intern cib: [3197]: ERROR: 
cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0

And it looks like the reason for that was not a bad disk, but a failure
in another component:

Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - <cib admin_epoch="0" 
epoch="19" num_updates="3" >
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -   <configuration >
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -     <resources >
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -       <group id="nfs" 
>
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -         <primitive 
id="gcl_fs" >
Jul 15 17:55:04 apache-03 crmd: [29398]: info: abort_transition_graph: 
te_update_diff:126 - Triggered transition abort (complete=1, tag=diff, 
id=(null), magic=NA, cib=0.20.1) : Non-status change
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -           
<meta_attributes id="gcl_fs-meta_attributes" >
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -             <nvpair 
id="gcl_fs-meta_attributes-target-role" name="target-role" value="Started" 
__crm_diff_marker__="removed:top" />
Jul 15 17:55:04 apache-03 crmd: [29398]: notice: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -           
</meta_attributes>
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -         </primitive>
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -         
<meta_attributes id="nfs-meta_attributes" >
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -           <nvpair 
value="Stopped" id="nfs-meta_attributes-target-role" />
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -         
</meta_attributes>
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -       </group>
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -     </resources>
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -   </configuration>
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - </cib>
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + <cib epoch="20" 
num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" 
crm_feature_set="3.0.6" update-origin="apache-03" update-client="cibadmin" 
cib-last-written="Mon Jul 15 16:02:23 2013" have-quorum="1" 
dc-uuid="61e8f424-b538-4352-b3fe-955ca853e5fb" >
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: +   <configuration >
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: +     <resources >
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: +       <group id="nfs" 
>
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: +         
<meta_attributes id="nfs-meta_attributes" >
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: +           <nvpair 
id="nfs-meta_attributes-target-role" name="target-role" value="Started" />
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: +         
</meta_attributes>
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: +       </group>
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: +     </resources>
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: +   </configuration>
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + </cib>
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib_process_request: Operation 
complete: op cib_replace for section resources (origin=local/cibadmin/2, 
version=0.20.1): ok (rc=0)
Jul 15 17:55:04 apache-03 cib: [3583]: ERROR: validate_cib_digest: Digest 
comparision failed: expected 976068d203615e656547fdf60190ad16 
(/var/lib/heartbeat/crm/cib.b9SItG), calculated 3f273f2cf3f97c0c02be83555ecabf0d
Jul 15 17:55:04 apache-03 cib: [3583]: ERROR: retrieveCib: Checksum of 
/var/lib/heartbeat/crm/cib.p3FraX failed!  Configuration contents ignored!
Jul 15 17:55:04 apache-03 cib: [3583]: ERROR: retrieveCib: Usually this is 
caused by manual changes, please refer to 
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
Jul 15 17:55:04 apache-03 cib: [3583]: ERROR: crm_abort: write_cib_contents: 
Triggered fatal assert at io.c:662 : retrieveCib(tmp1, tmp2, FALSE) != NULL
Jul 15 17:55:04 apache-03 pengine: [29464]: notice: LogActions: Start   
nfs-common      (apache-03)
Jul 15 17:55:04 apache-03 pengine: [29464]: notice: LogActions: Start   
nfs-kernel-server       (apache-03)
Jul 15 17:55:04 apache-03 pengine: [29464]: notice: LogActions: Start   
nfs_ipv4        (apache-03)
Jul 15 17:55:04 apache-03 crmd: [29398]: notice: do_state_transition: State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Jul 15 17:55:04 apache-03 crmd: [29398]: info: do_te_invoke: Processing graph 
43 (ref=pe_calc-dc-1373903704-255) derived from /var/lib/pengine/pe-input-30.bz2
Jul 15 17:55:04 apache-03 crmd: [29398]: info: te_rsc_command: Initiating 
action 34: start nfs-common_start_0 on apache-03 (local)
Jul 15 17:55:04 apache-03 lrmd: [29395]: info: rsc:nfs-common start[75] (pid 
3584)
Jul 15 17:55:04 apache-03 cib: [29394]: WARN: Managed write_cib_contents 
process 3583 killed by signal 6 [SIGABRT - Abort].
Jul 15 17:55:04 apache-03 cib: [29394]: ERROR: Managed write_cib_contents 
process 3583 dumped core
Jul 15 17:55:04 apache-03 cib: [29394]: ERROR: cib_diskwrite_complete: Disk 
write failed: status=134, signo=6, exitcode=0
Jul 15 17:55:04 apache-03 cib: [29394]: ERROR: cib_diskwrite_complete: 
Disabling disk writes after write failure
Jul 15 17:55:04 apache-03 pengine: [29464]: notice: process_pe_message: 
Transition 43: PEngine Input stored in: /var/lib/pengine/pe-input-30.bz2
Jul 15 17:55:04 apache-03 rpc.statd[3594]: Version 1.2.6 starting
Jul 15 17:55:04 apache-03 rpc.statd[3594]: Flags: TI-RPC
Jul 15 17:55:04 apache-03 sm-notify[3595]: Version 1.2.6 starting
Jul 15 17:55:04 apache-03 sm-notify[3595]: Already notifying clients; Exiting!
Jul 15 17:55:04 apache-03 rpc.statd[3594]: Running as root.  chown /var/lib/nfs 
to choose different user
Jul 15 17:55:04 172.19.0.2 cib: [23106]: info: apply_xml_diff: Digest 
mis-match: expected a0390a2d12f7339fb595c5894d0106db, calculated 
38133aaf7fa362a8756002428700c77f
Jul 15 17:55:04 172.19.0.2 cib: [23106]: notice: cib_process_diff: Diff 0.19.3 
-> 0.20.1 not applied to 0.19.3: Failed application of an update diff
Jul 15 17:55:04 172.19.0.2 cib: [23106]: info: cib_server_process_diff: 
Requesting re-sync from peer
Jul 15 17:55:04 apache-03 lrmd: [29395]: info: operation start[75] on 
nfs-common for client 29398: pid 3584 exited with return code 0
Jul 15 17:55:04 apache-03 crmd: [29398]: info: process_lrm_event: LRM operation 
nfs-common_start_0 (call=75, rc=0, cib-update=318, confirmed=true) ok
Jul 15 17:55:04 apache-03 crmd: [29398]: info: te_rsc_command: Initiating 
action 35: monitor nfs-common_monitor_60000 on apache-03 (local)
Jul 15 17:55:04 apache-03 lrmd: [29395]: info: rsc:nfs-common monitor[76] (pid 
3611)
Jul 15 17:55:04 apache-03 crmd: [29398]: info: te_rsc_command: Initiating 
action 36: start nfs-kernel-server_start_0 on apache-03 (local)
Jul 15 17:55:04 apache-03 lrmd: [29395]: info: rsc:nfs-kernel-server start[77] 
(pid 3612)
Jul 15 17:55:04 apache-03 lrmd: [29395]: info: operation monitor[76] on 
nfs-common for client 29398: pid 3611 exited with return code 0
Jul 15 17:55:04 apache-03 crmd: [29398]: info: process_lrm_event: LRM operation 
nfs-common_monitor_60000 (call=76, rc=0, cib-update=319, confirmed=false) ok
Jul 15 17:55:04 apache-03 rpc.mountd[3650]: Version 1.2.6 starting
Jul 15 17:55:04 apache-03 lrmd: [29395]: info: operation start[77] on 
nfs-kernel-server for client 29398: pid 3612 exited with return code 0
Jul 15 17:55:04 apache-03 crmd: [29398]: info: process_lrm_event: LRM operation 
nfs-kernel-server_start_0 (call=77, rc=0, cib-update=320, confirmed=true) ok
Jul 15 17:55:04 apache-03 crmd: [29398]: info: te_rsc_command: Initiating 
action 37: monitor nfs-kernel-server_monitor_60000 on apache-03 (local)
Jul 15 17:55:04 apache-03 lrmd: [29395]: info: rsc:nfs-kernel-server 
monitor[78] (pid 3655)
Jul 15 17:55:04 apache-03 crmd: [29398]: info: te_rsc_command: Initiating 
action 38: start nfs_ipv4_start_0 on apache-03 (local)
Jul 15 17:55:04 apache-03 lrmd: [29395]: info: rsc:nfs_ipv4 start[79] (pid 3656)
Jul 15 17:55:04 apache-03 lrmd: [29395]: info: operation monitor[78] on 
nfs-kernel-server for client 29398: pid 3655 exited with return code 0
Jul 15 17:55:04 apache-03 crmd: [29398]: info: process_lrm_event: LRM operation 
nfs-kernel-server_monitor_60000 (call=78, rc=0, cib-update=321, 
confirmed=false) ok
Jul 15 17:55:04 apache-03 IPaddr2[3656]: [3691]: INFO: ip -f inet addr add 
172.19.0.253/24 brd 172.19.0.255 dev bond0
Jul 15 17:55:04 apache-03 IPaddr2[3656]: [3694]: INFO: ip link set bond0 up
Jul 15 17:55:04 apache-03 IPaddr2[3656]: [3697]: INFO: 
/usr/lib/heartbeat/send_arp -i 200 -r 5 -p 
/var/run/resource-agents/send_arp-172.19.0.253 bond0 172.19.0.253 auto not_used 
not_used
Jul 15 17:55:04 apache-03 lrmd: [29395]: info: operation start[79] on nfs_ipv4 
for client 29398: pid 3656 exited with return code 0
Jul 15 17:55:04 apache-03 crmd: [29398]: info: process_lrm_event: LRM operation 
nfs_ipv4_start_0 (call=79, rc=0, cib-update=322, confirmed=true) ok
Jul 15 17:55:04 apache-03 crmd: [29398]: info: te_rsc_command: Initiating 
action 39: monitor nfs_ipv4_monitor_60000 on apache-03 (local)
Jul 15 17:55:04 apache-03 lrmd: [29395]: info: rsc:nfs_ipv4 monitor[80] (pid 
3706)
Jul 15 17:55:04 apache-03 lrmd: [29395]: info: operation monitor[80] on 
nfs_ipv4 for client 29398: pid 3706 exited with return code 0
Jul 15 17:55:04 apache-03 crmd: [29398]: info: process_lrm_event: LRM operation 
nfs_ipv4_monitor_60000 (call=80, rc=0, cib-update=323, confirmed=false) ok
Jul 15 17:55:04 apache-03 crmd: [29398]: notice: run_graph: ==== Transition 43 
(Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pengine/pe-input-30.bz2): Complete
Jul 15 17:55:04 apache-03 crmd: [29398]: notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib_process_request: Operation 
complete: op cib_sync_one for section 'all' (origin=apache-04/apache-04/(null), 
version=0.20.7): ok (rc=0)
Jul 15 17:55:04 172.19.0.2 cib: [23106]: notice: cib_server_process_diff: Not 
applying diff 0.20.1 -> 0.20.2 (sync in progress)
Jul 15 17:55:04 172.19.0.2 cib: [23106]: notice: cib_server_process_diff: Not 
applying diff 0.20.2 -> 0.20.3 (sync in progress)
Jul 15 17:55:04 172.19.0.2 cib: [23106]: notice: cib_server_process_diff: Not 
applying diff 0.20.3 -> 0.20.4 (sync in progress)
Jul 15 17:55:04 172.19.0.2 cib: [23106]: notice: cib_server_process_diff: Not 
applying diff 0.20.4 -> 0.20.5 (sync in progress)
Jul 15 17:55:04 172.19.0.2 cib: [23106]: notice: cib_server_process_diff: Not 
applying diff 0.20.5 -> 0.20.6 (sync in progress)
Jul 15 17:55:04 172.19.0.2 cib: [23106]: info: cib_process_diff: Diff 0.20.6 -> 
0.20.7 not applied to 0.19.3: current "epoch" is less than required
Jul 15 17:55:04 172.19.0.2 cib: [23106]: info: cib_server_process_diff: 
Requesting re-sync from peer
Jul 15 17:55:04 172.19.0.2 cib: [23106]: info: cib_replace_notify: Replaced: 
-1.-1.-1 -> 0.20.7 from apache-03
Jul 15 17:55:04 172.19.0.2 attrd: [23109]: notice: attrd_trigger_update: 
Sending flush op to all hosts for: last-failure-apache-03-fencing (1373885111)
Jul 15 17:55:04 172.19.0.2 attrd: [23109]: notice: attrd_trigger_update: 
Sending flush op to all hosts for: master-drbd0:0 (75)
Jul 15 17:55:04 172.19.0.2 attrd: [23109]: notice: attrd_trigger_update: 
Sending flush op to all hosts for: probe_complete (true)
Jul 15 17:55:04 172.19.0.2 attrd: [23109]: notice: attrd_trigger_update: 
Sending flush op to all hosts for: last-failure-router_ipv6 (1373893390)
Jul 15 17:55:04 172.19.0.2 cib: [10259]: ERROR: validate_cib_digest: Digest 
comparision failed: expected 976068d203615e656547fdf60190ad16 
(/var/lib/heartbeat/crm/cib.HupF60), calculated 3f273f2cf3f97c0c02be83555ecabf0d
Jul 15 17:55:04 172.19.0.2 cib: [10259]: ERROR: retrieveCib: Checksum of 
/var/lib/heartbeat/crm/cib.97XTAT failed!  Configuration contents ignored!
Jul 15 17:55:04 172.19.0.2 cib: [10259]: ERROR: retrieveCib: Usually this is 
caused by manual changes, please refer to 
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
Jul 15 17:55:04 172.19.0.2 cib: [10259]: ERROR: crm_abort: write_cib_contents: 
Triggered fatal assert at io.c:662 : retrieveCib(tmp1, tmp2, FALSE) != NULL
Jul 15 17:55:04 172.19.0.2 cib: [23106]: WARN: Managed write_cib_contents 
process 10259 killed by signal 6 [SIGABRT - Abort].
Jul 15 17:55:04 172.19.0.2 cib: [23106]: ERROR: Managed write_cib_contents 
process 10259 dumped core
Jul 15 17:55:04 172.19.0.2 cib: [23106]: ERROR: cib_diskwrite_complete: Disk 
write failed: status=134, signo=6, exitcode=0
Jul 15 17:55:04 172.19.0.2 cib: [23106]: ERROR: cib_diskwrite_complete: 
Disabling disk writes after write failure
Jul 15 18:46:33 apache-03 cib: [29394]: info: cib:diff: + <cib epoch="21" 
num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" 
crm_feature_set="3.0.6" update-origin="apache-03" update-client="cibadmin" 
cib-last-written="Mon Jul 15 17:55:04 2013" have-quorum="1" 
dc-uuid="61e8f424-b538-4352-b3fe-955ca853e5fb" >

I'll upgrade and file a bugreport against Debian.

Cheers,
        Thomas
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to