I have a follow_master_command set which calls the pcp_recovery_node script after editing a config file on the replica which has been degenerated. Occasionally this fails with BackendError. It looks like the node doesn't get degenerated fast enough and so it's marked as still alive, thus cancelling the recovery that I just requested:
BTW, my backend hostnames are: backend_hostname0: pgpool backend_hostname1: node1 backend_hostname2: node2 2011-10-05 14:30:02 LOG: pid 31743: execute command: /var/lib/postgresql/scripts/failover 0 0 node1 Warning: Permanently added 'node1,216.121.61.228' (RSA) to the list of known hosts. Warning: Permanently added 'node1,216.121.61.228' (RSA) to the list of known hosts. 2011-10-05 14:30:14 LOG: pid 31743: find_primary_node_repeatedly: waiting for finding a primary node 2011-10-05 14:30:14 LOG: pid 31743: find_primary_node: primary node id is 1 2011-10-05 14:30:14 LOG: pid 31743: starting follow degeneration. shutdown host pgpool(5432) 2011-10-05 14:30:14 LOG: pid 31743: starting follow degeneration. shutdown host node2(5432) 2011-10-05 14:30:14 LOG: pid 31743: failover: 2 follow backends have been degenerated 2011-10-05 14:30:14 LOG: pid 31743: failover: set new primary node: 1 2011-10-05 14:30:14 LOG: pid 31743: failover: set new master node: 1 2011-10-05 14:30:14 LOG: pid 2941: start triggering follow command. 2011-10-05 14:30:14 LOG: pid 2941: execute command: /var/lib/postgresql/scripts/pgpool_remaster_replica pgpool node1 0 2011-10-05 14:30:14 LOG: pid 31743: failover done. shutdown host pgpool(5432) DEBUG: send: tos="R", len=44 DEBUG: recv: tos="r", len=21, data=AuthenticationOK 2011-10-05 14:30:15 LOG: pid 2536: starting recovering node 0 2011-10-05 14:30:15 ERROR: pid 2536: start_recovery: backend node 0 is alive DEBUG: send: tos="D", len=6 DEBUG: recv: tos="e", len=20, data=recovery failed DEBUG: command failed. reason=recovery failed DEBUG: send: tos="X", len=4 BackendError Notice that the 'shutdown host pgpool(5432)' message was logged AFTER my follow command was executed, so does this: ERROR: pid 2536: start_recovery: backend node 0 is alive mean that the node is not detached or that postgresql is still running? or something else I'm not thinking of? Looks like this is the bit of code in question: pool_log("starting recovering node %d", recovery_node); if (VALID_BACKEND(recovery_node)) { pool_error("start_recovery: backend node %d is alive", recovery_node); return 1; } So, how does it define a VALID_BACKEND and how can I modify my follow_master_command to allow the pcp_recovery_node to finish successfully? If I run the pcp_recovery_node again by hand afterward, it always works fine. I thought perhaps adding a pcp_detach_node before the pcp_recovery_node would do the trick, but it does not. /var/lib/postgresql/scripts/pgpool_remaster_replica: #!/bin/bash CONF_FILE=/var/lib/postgresql/scripts/pitr-replication.conf . ${CONF_FILE} SCRIPT_DIR=$(dirname ${CONF_FILE}) echo "$0 $* run on $(hostname)" >> ${REPLICATION_LOG} REPLICA=$1 MASTER=$2 NODEID=$3 # Update the MASTER value in the replication config file ${SSH} ${SSH_OPT} ${REPLICA} /bin/sh -c "'sed -e \"s/MASTER=.*$/MASTER=${MASTER}/\" ${CONF_FILE} > ${CONF_FILE}.swap && mv ${CONF_FILE}.swap ${CONF_FILE}'" ${SSH} ${SSH_OPT} -T ${REPLICA} $PGCTL ${PGCTLOPTIONS} -m fast -w -D ${PGDATA} stop 2>/dev/null 1>/dev/null < /dev/null & #/usr/local/bin/pcp_detach_node --debug ${PCPTIMEOUT} ${PGPOOLHOST} ${PCPPORT} ${PGPOOLUSER} ${PGPOOLPASS} ${NODEID} /usr/local/bin/pcp_recovery_node --debug ${PCPTIMEOUT} ${PGPOOLHOST} ${PCPPORT} ${PGPOOLUSER} ${PGPOOLPASS} ${NODEID} -- Jeff Frost <j...@pgexperts.com> CTO, PostgreSQL Experts, Inc. Phone: 1-888-PG-EXPRT x506 FAX: 415-762-5122 http://www.pgexperts.com/ _______________________________________________ Pgpool-general mailing list Pgpool-general@pgfoundry.org http://pgfoundry.org/mailman/listinfo/pgpool-general