Hello, All the following bugs have been reported by Russell Coker. I'm posting them here in case someone knows if they have been fixed already. I'll appreciate any help/comment/pointer about these bugs. I'm working to get them fixed anyway and submit the patches to this list.
heartbeat should detect and recover from corrupt CIB ==================================================== Under XFS failure modes a recently created file may end up filled with zeros if there is a power outage (IPMI fence) at an inconvenient time. Heartbeat keeps a backup copy of /var/lib/heartbeat/crm/cib.xml but if the primary copy is filled with zeros it doesn't use the backup! I believe that heartbeat should use the backup copy of cib.xml whenever the primary does not conform to the XML schema. I also believe that given the XFS issue one backup copy is not enough and that we should have multiple backups so that if there is more than one change made to cib.xml in the 30 seconds before the reboot there will still be a good copy after the reboot. As the files in question are small there is no reason why we couldn't have 20 backup copies. Heartbeat has code to handle the situation of different versions of the cib.xml on different nodes, but currently does not seem to handle the situation of a corrupt cib.xml. To cope with the failure conditions of filesystems other than XFS some of these backups should be stored in different directories. heartbeat needs all code related to writing cib.xml audited =========================================================== There is a potential SEGV in the code for writing the cib.xml (fopen() followed immediately by fclose() with no error checking on line 636 of lib/crm/common/xml.c). The the code for writing cib.xml also calls fflush() as it's only mechanism for ensuring that the data gets to disk. The above two bugs need to be fixed and the code in question needs to be audited to ensure that there are no more bugs of the same nature. cibadmin -Q should indicate if the data is not available locally ================================================================ Currently if you have a cib.xml file that is corrupt it seems that there is no way for heartbeat to recover (neither automatically nor through manual intervention), see previous bug. When the machine is in this state (which could happen after the previous bug is fixed in the case of a disk error affecting the directory which contains the file in question) "cibadmin -Q" should indicate that the data is being obtained from another machine. If you have a two-node cluster and one node is in such a state then there is no redundancy and shutting down the node with the good copy of the CIB will destroy the cluster. The sys-admin should have some way of knowing that a routine operation (rebooting one node of a cluster) is certain to cause a catastrophy. It could be argued that a node that can't write to it's cib.xml file should disable itself and demand that the sys-admin fix the problem. I have no strong feelings on this issue apart from the fact that the current operation is wrong! cibadmin -E should verify that the change took place and report errors ====================================================================== The "cibadmin -E" operation should make a minimal effort to verify that the requested change took place. Currently "cibadmin -E" will return 0 and display no error message even in a situation where the local cib.xml file is corrupted (which with current code means that heartbeat won't modify it) and remote nodes are not available - this means silent data loss! This may require changes to the "cib" daemon as it may not be returning the status to the cibadmin tool. ADDITIONAL INFORMATION If the "-s" option to cibadmin is used for a synchronous operation when the local cib.xml is corrupt and the other node is not running then cibadmin will hang seemingly indefinitely (I observed it hanging for 20 minutes). But no error message though. So I guess we can run cibadmin -s to get an indication that things are going wrong, but with no idea of what is going wrong. cibadmin (heartbeat) should flag error conditions on delete =========================================================== When cibadmin can't complete an operation it should display an error message to stderr and return an error code to the environment. cibadmin -s --obj_type status -D -X "<lrm_resource id=\"appman-ui-resource\"/>" A command such as the above can be run repeatedly and you never know how many instances of it succeeded (if any). (heartbeat) ha_logd read process goes into an infinite loop =========================================================== Periodically when shutting down heartbeat I see the following error from the read process of ha_logd. It has 1024 open file handles, most of which are socket handles for /var/lib/log_daemon, the process will be in an infinite loop at the time and will be using 100% of one CPU core. Feb 22 14:57:00 ha-node-1 logd: [5642]: ERROR: socket_accept_connection: accept: Too many open files Feb 22 14:57:01 ha-node-1 logd: [5642]: ERROR: socket_accept_connection: accept: Too many open files Feb 22 14:57:01 ha-node-1 logd: [5642]: ERROR: socket_accept_connection: accept: Too many open files Feb 22 14:57:01 ha-node-1 logd: [5642]: ERROR: socket_accept_connection: accept: Too many open files Feb 22 14:57:02 ha-node-1 logd: [5642]: ERROR: socket_accept_connection: accept: Too many open files Feb 22 14:57:02 ha-node-1 logd: [5642]: ERROR: socket_accept_connection: accept: Too many open files inappropriate warnings about core files logged by heartbeat =========================================================== When /proc/sys/kernel/core_pattern is set, warning messages such as the following should not be logged. Mar 1 14:55:28 ha-node-0 logd: [29422]: WARN: Core dumps could be lost if multiple dumps occur Mar 1 14:55:28 ha-node-0 logd: [29422]: WARN: Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability wrong permissions on cib.xml =========================== Mar 1 15:42:23 ha-node-1 cib: [6010]: WARN: crm_is_writable: /var/lib/heartbeat/crm/cib.xml should be owned and r/w by group haclient The above message appears in /var/log/heartbeat.log on ha-node-1. The file in question is mode 0600, according to the warning it should be 0660. Either the warning or the permissions of the file (as created by heartbeat) needs to change. Also I think that both the xml file and the signature should have the same permissions, currently the .sig file is mode 0644. Thank you, Aníbal Monsalve Salazar -- R&D Software Engineer, SGI Australian Software Group _______________________________________________________ Linux-HA-Dev: [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
