Re: [Pgpool-general] Cannot add node after failure

Fernando Morgenstern Thu, 17 Dec 2009 04:41:38 -0800

Hello,

Sorry for the lack replies.

Before checking your email, i realized that recovery was only workingon the nodes that pgpool was compiled. I decided to compile it on theothers and now it works ok ( just compiled, didn't run pgpool there ).

During the last day, i have been simulating recoveries by killingpostgres and shutting down servers randomly. Most of the time,recovery is perfectly done, but there are some specific cases whenpcp_recovery_node reports that the command is complete but therecovery isn't done ( eg.: some databases that i created while thefailed node was down, are not present when it starts up ).

I tried to isolate the log messsages of when this behaviour happensand here it is http://pastebin.ca/1718115

I really don't see anything different, but the fact is that some datais missing on node 1, which is being recovered.

Do you mind giving some kind of advice of things that i should checkor something that you think it is wrong?

By the way, i am using pgpool_recovery as 1st and 2nd recovery commandon pgpool and the script looks like this one:


#! /bin/sh

if [ $# -ne 3 ]
then
    echo "pgpool_recovery datadir remote_host remote_datadir"
    exit 1
fi

datadir=$1
DEST=$2
DESTDIR=$3

rsync -aurz --delete -e ssh $datadir/global/ $DEST:$DESTDIR/global/ &
rsync -aurz --delete -e ssh $datadir/base/ $DEST:$DESTDIR/base/ &

rsync -aurz --delete -e ssh $datadir/pg_multixact/ $DEST:$DESTDIR/pg_multixact/ &rsync -aurz --delete -e ssh $datadir/pg_subtrans/ $DEST:$DESTDIR/pg_subtrans/ &

rsync -aurz --delete -e ssh $datadir/pg_clog/ $DEST:$DESTDIR/pg_clog/ &
rsync -aurz --delete -e ssh $datadir/pg_xlog/ $DEST:$DESTDIR/pg_xlog/ &

rsync -aurz --delete -e ssh $datadir/pg_twophase/ $DEST:$DESTDIR/pg_twophase/ &

wait

Regards,
---

Fernando Marcelo
www.consultorpc.com
[email protected]
Tel: +34 902 998971
Fax: +34 91 7903701

## legal disclaimer

The information contained in this email is confidential. It isintended only

for the stated addressee(s) and access to it by any other person is
unauthorized. If you are not an addressee, you must not disclose, copy,

circulate or in any other way use or rely on the information containedinthis email. Such unauthorized use may be unlawful. If you havereceived this

email in error, please inform us immediately by emailing [email protected]
and delete it and all copies from your system.

## end mail

Em 16/12/2009, às 06:47, Tatsuo Ishii escreveu:

Hello,

Thanks for your info!

I was able to do some progress with node recovery when using
pgpool_recovery on both recovery command.

I am able to recovery most of the times, but sometimes it fails with
the following error:

$ pcp_recovery_node  -d 90 localhost 9898 postgres ******* 2
DEBUG: send: tos="R", len=46
DEBUG: recv: tos="r", len=21, data=AuthenticationOK
DEBUG: send: tos="D", len=6
DEBUG: recv: tos="e", len=20, data=recovery failed
DEBUG: command failed. reason=recovery failed
BackendError
DEBUG: send: tos="X", len=4

pgpool log

2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: received PCP packet
type of service 'M'

2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: salt sent to theclient

2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: received PCP packet
type of service 'R'
2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: authentication OK
2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: received PCP packet
type of service 'O'
2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: start online recovery
2009-12-15 20:10:56 LOG:   pid 8747: starting recovering node 2

2009-12-15 20:10:56 DEBUG: pid 8747: exec_checkpoint: startcheckpoint2009-12-15 20:10:56 DEBUG: pid 8747: exec_checkpoint: finishcheckpoint

2009-12-15 20:10:56 LOG:   pid 8747: CHECKPOINT in the 1st stage done
2009-12-15 20:10:56 LOG:   pid 8747: starting recovery command:

"SELECT pgpool_recovery('pgpool_recovery', 'im-pp3', '/usr/local/pgsql/

data')"
2009-12-15 20:10:56 DEBUG: pid 8747: exec_recovery: start recovery
2009-12-15 20:10:56 ERROR: pid 8747: exec_recovery: pgpool_recovery
command failed at 1st stage
2009-12-15 20:10:56 DEBUG: pid 8747: exec_recovery: finish recovery
2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: received PCP packet
type of service 'X'
2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: client disconnecting.
close connection
2009-12-15 20:11:22 DEBUG: pid 8446: starting health checking

Unfortunately i am not sure what this error means. Did it failed at

"SELECT pgpool_recovery('pgpool_recovery', 'im-pp3', '/usr/local/pgsql/

data')"? How can i find the reason?


Recovery command "pgpool_recovery" failed for some reason. Check
PostgreSQL log on master node. If it is not clear, try to add -x to
shell in your pgpool_recovery script. i.e.

#! /bin/sh -x

--
Tatsuo Ishii
SRA OSS, Inc. Japan

Best Regards,
---

Fernando Marcelo
www.consultorpc.com
[email protected]


Em 15/12/2009, às 13:36, Jaume Sabater escreveu:

On Tue, Dec 15, 2009 at 4:20 PM, Fernando Morgenstern
<[email protected]> wrote:

While reading pgpool manual i found this:
Note that there is a restriction about online recovery. If pgpool-
II works
on multiple hosts, online recovery does not work correctly, because
pgpool-II stops clients on the 2nd stage of online recovery. If
there are
some pgpool hosts, pgpool-II excepted for receiving online recovery
request
cannot block connections.

It means running two or more pgpool-II instances simultaneously,which

won't be your case since, with Heartbeat, you'll configure pgpool-II
as a resource, hence it will only be active in one node at a given
time.

--
Jaume Sabater
http://linuxsilo.net/

"Ubi sapientas ibi libertas"


_______________________________________________
Pgpool-general mailing list
[email protected]
http://pgfoundry.org/mailman/listinfo/pgpool-general


_______________________________________________
Pgpool-general mailing list
[email protected]
http://pgfoundry.org/mailman/listinfo/pgpool-general

Re: [Pgpool-general] Cannot add node after failure

Reply via email to