On 2/23/07, Keisuke MORI <[EMAIL PROTECTED]> wrote:
Hi,

We have found a several problems with pgsql RA through our testing.
It 'fails to failover' in some scenarios.
I'm proposing a patch to fix them.

Problem description:

1) The first 'monitor' may fail even if the postmaster was
   successfully launched.

   This is because 'start' of the pgsql may return before the
   postmaster gets ready to answer to a psql query issued by
   'monitor', since it only checks the existance of postmaster
   process. The postmaster can take a few minitues to get ready
   to answer, particularly when it needs to recover the database
   after a crash. Even if no recovery is necessary, we observed
   that it sometimes fails in some of our test cases.

2) The postmaster fails to startup when 'postmaster.pid' file
   was left over from the previous crash.

3) 'stop' doest not execute the fast mode shutdown effectively,
   because it executes the immediate mode shutdown at the very
   next moment.  The fast mode shutdown can take a few minutes
   to complete to flush the database log.

   This isn't a critical problem, but it may result to take a
   time longer to complete the failover (according to our
   database team). It is preferable to wait to complete the fast
   mode shutdown as long as possible.


Proposals to fix:

1) In 'start', wait until the postmaster gets ready to answer by
   checking as same as 'monitor' does.
   The maximum wait time to complete to startup can be
   customized by an additional parameter 'start_wait'.

2) Add a cleanup code for 'postmaster.pid' when stop and before starting.

3) In 'stop', wait until the postmaster completes to the fast
   mode shutdown.
   The maximum wait time to complete to shutdown can be
   customized by an additional parameter 'stop_wait.


The attached patch is for the latest -dev.

I'd be more inclined to go with something like the patch below.

The function of start_wait and stop_wait is just as easily achieved by
setting the action's timeout.  Its also harder to mess up (ie. by
setting start_wait to longer than the start action's timeout).

diff -r 959f2c429fc3 resources/OCF/pgsql.in
--- a/resources/OCF/pgsql.in    Fri Feb 23 10:59:12 2007 +0100
+++ b/resources/OCF/pgsql.in    Fri Feb 23 12:18:53 2007 +0100
@@ -197,15 +197,12 @@ pgsql_start() {
       return $OCF_ERR_GENERIC
    fi

-    if ! pgsql_status
-    then
-       sleep 5
-       if ! pgsql_status
-       then
-           echo "ERROR: PostgreSQL is not running!"
-            return $OCF_ERR_GENERIC
-       fi
-    fi
+
+    active=0
+    while [ $active != 0 ]; do
+       pgsql_monitor
+       active=$?
+    done

    return $OCF_SUCCESS
}
@@ -227,6 +224,13 @@ pgsql_stop() {
       runasowner "$PGCTL -D $PGDATA stop -m immediate > /dev/null 2>&1"
    fi

+    active=$OCF_NOT_RUNNING
+    while [ $active != $OCF_NOT_RUNNING ]; do
+       pgsql_monitor
+       active=$?
+    done
+
+    rm -f $PIDFILE
    return $OCF_SUCCESS
}
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to