Le 09/03/2011 10:19, Tatsuo Ishii a écrit : > By further testing, it seems the error occurs when online recovery > repeats two or more times. This time I got: > > 2011-03-09 18:13:04 ERROR: pid 13569: health check failed. 1 th host /tmp at > port 5434 is down > 2011-03-09 18:13:04 LOG: pid 13569: set 1 th backend down status > 2011-03-09 18:13:04 LOG: pid 13569: starting degeneration. shutdown host > /tmp(5434) > 2011-03-09 18:13:04 LOG: pid 13569: execute command: > /usr/local/etc/failover.sh 1 "/tmp" 5434 /usr/local/pgsql/standby 0 1 "/tmp" 1 > 2011-03-09 18:13:05 LOG: pid 13569: find_primary_node: 0 node is standby > 2011-03-09 18:13:05 LOG: pid 13569: find_primary_node: no primary node found > 2011-03-09 18:13:05 LOG: pid 13569: Primary node id saved: -1 > 2011-03-09 18:13:05 LOG: pid 13569: failover done. shutdown host /tmp(5434) > 2011-03-09 18:13:18 LOG: pid 13604: starting recovering node 1 > 2011-03-09 18:13:18 ERROR: pid 13604: start_recover: could not connect master > node. > > I did the testing in following sequences: > > 1) node 0 down, node 1 primary > 2) recover node 0 (fine) > 3) node 0 standby, node 1 primary > 4) node 1 down, node 0 promotes to proimary > 5) recover node 1 and got above errors Ok, I was able to reproduce the problem. It occurs when the new promoted node start too slowly after trigger file is created so that find_primary_node() could not connect to it.
Forgot this patch for the moment, I don't have time to work on it for now. I'm also pretty sure I've already fixed that somewhere. I will check and fix that asap, sorry for the noise. Regards, -- Gilles Darold http://dalibo.com - http://dalibo.org _______________________________________________ Pgpool-hackers mailing list [email protected] http://pgfoundry.org/mailman/listinfo/pgpool-hackers
