Re: [HACKERS] Automatic Client Failover

Dimitri Fontaine Tue, 05 Aug 2008 01:36:41 -0700

Le mardi 05 août 2008, Tom Lane a écrit :
> Huh?  The problem case is that the primary server goes down, which would
> certainly mean that a pgbouncer instance on the same machine goes with
> it.  So it seems to me that integrating pgbouncer is 100% backwards.


With all due respect, it seems to me you're missing an important piece of the 
scheme here: I certainly failed to explain correctly. Of course, I'm not sure 
(by and large) that detailing what I have in mind will answer your concerns, 
but still...

What I have in mind is having the pgbouncer listening process both at master 
and slave sites. So your clients can already connect to slave for normal 
operations, and the listener process simply connects them to the master, 
transparently.
When we later provider RO slave, some queries could be processed locally 
instead of getting sent to the master.
The point being that the client does not have to care itself whether it's 
connecting to a master or a slave, -core knows what it can handle for the 
client and handles it (proxying the connection).

Now, that does not solve the client side automatic failover per-se, it's 
another way to think about it:
 - both master & slave accept connection in any mode
 - master & slave are able to "speak" to each other (life link)
 - when master knows it's crashing (elog(FATAL)), it can say so to the slave
 - when said so, slave can switch to master

It obviously only catches some errors on master, the ones we're able to log 
about. So it does nothing on its own for allowing HA in case of master crash.
But...

> Failover that actually works is not something we can provide with
> trivial changes to Postgres.  It's really a major project in its
> own right: you need heartbeat detection, STONITH capability,
> IP address redirection, etc.  I think we should be recommending
> external failover-management project(s) instead of offering a
> half-baked home-grown solution.  Searching freshmeat for "failover"
> finds plenty of potential candidates, but not having used any of
> them I'm not sure which are worth closer investigation.

We have worked here with heartbeat, and automating failover is hard. Not for 
technical reasons only, also because:
 - current PostgreSQL offers no sync replication, switching means trading or
   losing the D in ACID,
 - you do not want to lose any commited data.

If 8.4 resolve this, failover implementation will be a lot easier.

What I see my proposal fit is the ability to handle a part of the smartness 
in -core directly, so the hard part of the STONITH/failover/switchback could 
be implemented in cooperation with -core, not playing tricks against it.

For example, switching back when master gets back online would only means for 
the master to tell the slave to now redirect the queries to him as soon as 
it's ready --- which still is the hard part, sync back data.

Having clients able to blindly connect to master or any slave and having the 
current cluster topology smartness into -core would certainly help here, even 
if not fullfilling all HA goals.

Of course, in the case of master hard crash, we still have to get sure it 
won't restart on its own, and we have to have an external way to get a chosen 
slave become the master.

I'm even envisioning than -core could help STONITH projects with having sth 
like the recovery.conf file for the master to restart in not-up-to-date slave 
mode. Whether we implement resyncing to the new master in -core or from 
external scripts is another concern, but certainly -core could help here 
(even if not in 8.4, of course).

I'm still thinking that this proposal has a place in the scheme of an 
integrated HA solution and offers interresting bits.

Regards,
-- 
dim

signature.asc
Description: This is a digitally signed message part.

Re: [HACKERS] Automatic Client Failover

Reply via email to