I like the idea of the patch, but honestly I don't like how it's
implemented. It shall call (as Andrew suggested) "monitor" function to
check that pgsql is up or down instead of spreading the same code all
around the script. I'd like to review the idea and prepare another
patch if everybody is agree.

On 2/23/07, Keisuke MORI <[EMAIL PROTECTED]> wrote:
Hi,

We have found a several problems with pgsql RA through our testing.
It 'fails to failover' in some scenarios.
I'm proposing a patch to fix them.

Problem description:

1) The first 'monitor' may fail even if the postmaster was
  successfully launched.

  This is because 'start' of the pgsql may return before the
  postmaster gets ready to answer to a psql query issued by
  'monitor', since it only checks the existance of postmaster
  process. The postmaster can take a few minitues to get ready
  to answer, particularly when it needs to recover the database
  after a crash. Even if no recovery is necessary, we observed
  that it sometimes fails in some of our test cases.

2) The postmaster fails to startup when 'postmaster.pid' file
  was left over from the previous crash.

3) 'stop' doest not execute the fast mode shutdown effectively,
  because it executes the immediate mode shutdown at the very
  next moment.  The fast mode shutdown can take a few minutes
  to complete to flush the database log.

  This isn't a critical problem, but it may result to take a
  time longer to complete the failover (according to our
  database team). It is preferable to wait to complete the fast
  mode shutdown as long as possible.


Proposals to fix:

1) In 'start', wait until the postmaster gets ready to answer by
  checking as same as 'monitor' does.
  The maximum wait time to complete to startup can be
  customized by an additional parameter 'start_wait'.

2) Add a cleanup code for 'postmaster.pid' when stop and before starting.

3) In 'stop', wait until the postmaster completes to the fast
  mode shutdown.
  The maximum wait time to complete to shutdown can be
  customized by an additional parameter 'stop_wait.


The attached patch is for the latest -dev.

Regards,

Keisuke MORI
NTT DATA Intellilink Corporation


_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/



_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to