[Linux-ha-dev] issues I'm working on at the moment

Aníbal Monsalve Salazar Mon, 12 Mar 2007 19:27:12 -0800

Hello,

All the following bugs have been reported by Russell Coker. I'm
posting them here in case someone knows if they have been fixed
already. I'll appreciate any help/comment/pointer about these bugs.
I'm working to get them fixed anyway and submit the patches to
this list.


heartbeat should detect and recover from corrupt CIB
====================================================

Under XFS failure modes a recently created file may end up filled
with zeros if there is a power outage (IPMI fence) at an
inconvenient time.

Heartbeat keeps a backup copy of /var/lib/heartbeat/crm/cib.xml but
if the primary copy is filled with zeros it doesn't use the backup!

I believe that heartbeat should use the backup copy of cib.xml
whenever the primary does not conform to the XML schema.

I also believe that given the XFS issue one backup copy is not
enough and that we should have multiple backups so that if there is
more than one change made to cib.xml in the 30 seconds before the
reboot there will still be a good copy after the reboot. As the
files in question are small there is no reason why we couldn't have
20 backup copies.

Heartbeat has code to handle the situation of different versions of
the cib.xml on different nodes, but currently does not seem to
handle the situation of a corrupt cib.xml. To cope with the failure
conditions of filesystems other than XFS some of these backups
should be stored in different directories.

heartbeat needs all code related to writing cib.xml audited
===========================================================

There is a potential SEGV in the code for writing the cib.xml
(fopen() followed immediately by fclose() with no error checking on
line 636 of lib/crm/common/xml.c).

The the code for writing cib.xml also calls fflush() as it's only
mechanism for ensuring that the data gets to disk.

The above two bugs need to be fixed and the code in question needs
to be audited to ensure that there are no more bugs of the same
nature.

cibadmin -Q should indicate if the data is not available locally
================================================================

Currently if you have a cib.xml file that is corrupt it seems that
there is no way for heartbeat to recover (neither automatically nor
through manual intervention), see previous bug.

When the machine is in this state (which could happen after the
previous bug is fixed in the case of a disk error affecting the
directory which contains the file in question) "cibadmin -Q" should
indicate that the data is being obtained from another machine. If
you have a two-node cluster and one node is in such a state then
there is no redundancy and shutting down the node with the good
copy of the CIB will destroy the cluster.

The sys-admin should have some way of knowing that a routine
operation (rebooting one node of a cluster) is certain to cause a
catastrophy.

It could be argued that a node that can't write to it's cib.xml file
should disable itself and demand that the sys-admin fix the problem.
I have no strong feelings on this issue apart from the fact that the
current operation is wrong!

cibadmin -E should verify that the change took place and report errors
======================================================================

The "cibadmin -E" operation should make a minimal effort to verify
that the requested change took place.

Currently "cibadmin -E" will return 0 and display no error message
even in a situation where the local cib.xml file is corrupted (which
with current code means that heartbeat won't modify it) and remote
nodes are not available - this means silent data loss!

This may require changes to the "cib" daemon as it may not be
returning the status to the cibadmin tool.

ADDITIONAL INFORMATION

If the "-s" option to cibadmin is used for a synchronous operation
when the local cib.xml is corrupt and the other node is not running
then cibadmin will hang seemingly indefinitely (I observed it
hanging for 20 minutes). But no error message though.

So I guess we can run cibadmin -s to get an indication that things
are going wrong, but with no idea of what is going wrong.

cibadmin (heartbeat) should flag error conditions on delete
===========================================================

When cibadmin can't complete an operation it should display an error
message to stderr and return an error code to the environment.

cibadmin -s --obj_type status -D -X "<lrm_resource id=\"appman-ui-resource\"/>"

A command such as the above can be run repeatedly and you never know
how many instances of it succeeded (if any).

(heartbeat) ha_logd read process goes into an infinite loop
===========================================================

Periodically when shutting down heartbeat I see the following error
from the read process of ha_logd. It has 1024 open file handles,
most of which are socket handles for /var/lib/log_daemon, the
process will be in an infinite loop at the time and will be using
100% of one CPU core.

Feb 22 14:57:00 ha-node-1 logd: [5642]: ERROR: socket_accept_connection: 
accept: Too many open files
Feb 22 14:57:01 ha-node-1 logd: [5642]: ERROR: socket_accept_connection: 
accept: Too many open files
Feb 22 14:57:01 ha-node-1 logd: [5642]: ERROR: socket_accept_connection: 
accept: Too many open files
Feb 22 14:57:01 ha-node-1 logd: [5642]: ERROR: socket_accept_connection: 
accept: Too many open files
Feb 22 14:57:02 ha-node-1 logd: [5642]: ERROR: socket_accept_connection: 
accept: Too many open files
Feb 22 14:57:02 ha-node-1 logd: [5642]: ERROR: socket_accept_connection: 
accept: Too many open files

inappropriate warnings about core files logged by heartbeat
===========================================================

When /proc/sys/kernel/core_pattern is set, warning messages such as
the following should not be logged.

Mar  1 14:55:28 ha-node-0 logd: [29422]: WARN: Core dumps could be lost if 
multiple dumps occur
Mar  1 14:55:28 ha-node-0 logd: [29422]: WARN: Consider setting 
/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability

wrong permissions on cib.xml
===========================

Mar  1 15:42:23 ha-node-1 cib: [6010]: WARN: crm_is_writable: 
/var/lib/heartbeat/crm/cib.xml should be owned and r/w by group haclient

The above message appears in /var/log/heartbeat.log on ha-node-1. The
file in question is mode 0600, according to the warning it should be
0660. Either the warning or the permissions of the file (as created
by heartbeat) needs to change.

Also I think that both the xml file and the signature should have
the same permissions, currently the .sig file is mode 0644.

Thank you,

Aníbal Monsalve Salazar
--
R&D Software Engineer, SGI Australian Software Group
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

[Linux-ha-dev] issues I'm working on at the moment

Reply via email to