Hello,
Sorry for the lack replies.
Before checking your email, i realized that recovery was only working
on the nodes that pgpool was compiled. I decided to compile it on the
others and now it works ok ( just compiled, didn't run pgpool there ).
During the last day, i have been simulating recoveries by killing
postgres and shutting down servers randomly. Most of the time,
recovery is perfectly done, but there are some specific cases when
pcp_recovery_node reports that the command is complete but the
recovery isn't done ( eg.: some databases that i created while the
failed node was down, are not present when it starts up ).
I tried to isolate the log messsages of when this behaviour happens
and here it is http://pastebin.ca/1718115
I really don't see anything different, but the fact is that some data
is missing on node 1, which is being recovered.
Do you mind giving some kind of advice of things that i should check
or something that you think it is wrong?
By the way, i am using pgpool_recovery as 1st and 2nd recovery command
on pgpool and the script looks like this one:
#! /bin/sh
if [ $# -ne 3 ]
then
echo "pgpool_recovery datadir remote_host remote_datadir"
exit 1
fi
datadir=$1
DEST=$2
DESTDIR=$3
rsync -aurz --delete -e ssh $datadir/global/ $DEST:$DESTDIR/global/ &
rsync -aurz --delete -e ssh $datadir/base/ $DEST:$DESTDIR/base/ &
rsync -aurz --delete -e ssh $datadir/pg_multixact/ $DEST:$DESTDIR/
pg_multixact/ &
rsync -aurz --delete -e ssh $datadir/pg_subtrans/ $DEST:$DESTDIR/
pg_subtrans/ &
rsync -aurz --delete -e ssh $datadir/pg_clog/ $DEST:$DESTDIR/pg_clog/ &
rsync -aurz --delete -e ssh $datadir/pg_xlog/ $DEST:$DESTDIR/pg_xlog/ &
rsync -aurz --delete -e ssh $datadir/pg_twophase/ $DEST:$DESTDIR/
pg_twophase/ &
wait
Regards,
---
Fernando Marcelo
www.consultorpc.com
[email protected]
Tel: +34 902 998971
Fax: +34 91 7903701
## legal disclaimer
The information contained in this email is confidential. It is
intended only
for the stated addressee(s) and access to it by any other person is
unauthorized. If you are not an addressee, you must not disclose, copy,
circulate or in any other way use or rely on the information contained
in
this email. Such unauthorized use may be unlawful. If you have
received this
email in error, please inform us immediately by emailing [email protected]
and delete it and all copies from your system.
## end mail
Em 16/12/2009, às 06:47, Tatsuo Ishii escreveu:
Hello,
Thanks for your info!
I was able to do some progress with node recovery when using
pgpool_recovery on both recovery command.
I am able to recovery most of the times, but sometimes it fails with
the following error:
$ pcp_recovery_node -d 90 localhost 9898 postgres ******* 2
DEBUG: send: tos="R", len=46
DEBUG: recv: tos="r", len=21, data=AuthenticationOK
DEBUG: send: tos="D", len=6
DEBUG: recv: tos="e", len=20, data=recovery failed
DEBUG: command failed. reason=recovery failed
BackendError
DEBUG: send: tos="X", len=4
pgpool log
2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: received PCP packet
type of service 'M'
2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: salt sent to the
client
2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: received PCP packet
type of service 'R'
2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: authentication OK
2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: received PCP packet
type of service 'O'
2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: start online recovery
2009-12-15 20:10:56 LOG: pid 8747: starting recovering node 2
2009-12-15 20:10:56 DEBUG: pid 8747: exec_checkpoint: start
checkpoint
2009-12-15 20:10:56 DEBUG: pid 8747: exec_checkpoint: finish
checkpoint
2009-12-15 20:10:56 LOG: pid 8747: CHECKPOINT in the 1st stage done
2009-12-15 20:10:56 LOG: pid 8747: starting recovery command:
"SELECT pgpool_recovery('pgpool_recovery', 'im-pp3', '/usr/local/
pgsql/
data')"
2009-12-15 20:10:56 DEBUG: pid 8747: exec_recovery: start recovery
2009-12-15 20:10:56 ERROR: pid 8747: exec_recovery: pgpool_recovery
command failed at 1st stage
2009-12-15 20:10:56 DEBUG: pid 8747: exec_recovery: finish recovery
2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: received PCP packet
type of service 'X'
2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: client disconnecting.
close connection
2009-12-15 20:11:22 DEBUG: pid 8446: starting health checking
Unfortunately i am not sure what this error means. Did it failed at
"SELECT pgpool_recovery('pgpool_recovery', 'im-pp3', '/usr/local/
pgsql/
data')"? How can i find the reason?
Recovery command "pgpool_recovery" failed for some reason. Check
PostgreSQL log on master node. If it is not clear, try to add -x to
shell in your pgpool_recovery script. i.e.
#! /bin/sh -x
--
Tatsuo Ishii
SRA OSS, Inc. Japan
Best Regards,
---
Fernando Marcelo
www.consultorpc.com
[email protected]
Em 15/12/2009, às 13:36, Jaume Sabater escreveu:
On Tue, Dec 15, 2009 at 4:20 PM, Fernando Morgenstern
<[email protected]> wrote:
While reading pgpool manual i found this:
Note that there is a restriction about online recovery. If pgpool-
II works
on multiple hosts, online recovery does not work correctly, because
pgpool-II stops clients on the 2nd stage of online recovery. If
there are
some pgpool hosts, pgpool-II excepted for receiving online recovery
request
cannot block connections.
It means running two or more pgpool-II instances simultaneously,
which
won't be your case since, with Heartbeat, you'll configure pgpool-II
as a resource, hence it will only be active in one node at a given
time.
--
Jaume Sabater
http://linuxsilo.net/
"Ubi sapientas ibi libertas"
_______________________________________________
Pgpool-general mailing list
[email protected]
http://pgfoundry.org/mailman/listinfo/pgpool-general
_______________________________________________
Pgpool-general mailing list
[email protected]
http://pgfoundry.org/mailman/listinfo/pgpool-general