Le 10/03/2011 01:03, Tatsuo Ishii a écrit : >>> By further testing, it seems the error occurs when online recovery >>> repeats two or more times. This time I got: >>> >>> 2011-03-09 18:13:04 ERROR: pid 13569: health check failed. 1 th host /tmp >>> at port 5434 is down >>> 2011-03-09 18:13:04 LOG: pid 13569: set 1 th backend down status >>> 2011-03-09 18:13:04 LOG: pid 13569: starting degeneration. shutdown host >>> /tmp(5434) >>> 2011-03-09 18:13:04 LOG: pid 13569: execute command: >>> /usr/local/etc/failover.sh 1 "/tmp" 5434 /usr/local/pgsql/standby 0 1 >>> "/tmp" 1 >>> 2011-03-09 18:13:05 LOG: pid 13569: find_primary_node: 0 node is standby >>> 2011-03-09 18:13:05 LOG: pid 13569: find_primary_node: no primary node >>> found >>> 2011-03-09 18:13:05 LOG: pid 13569: Primary node id saved: -1 >>> 2011-03-09 18:13:05 LOG: pid 13569: failover done. shutdown host >>> /tmp(5434) >>> 2011-03-09 18:13:18 LOG: pid 13604: starting recovering node 1 >>> 2011-03-09 18:13:18 ERROR: pid 13604: start_recover: could not connect >>> master node. >>> >>> I did the testing in following sequences: >>> >>> 1) node 0 down, node 1 primary >>> 2) recover node 0 (fine) >>> 3) node 0 standby, node 1 primary >>> 4) node 1 down, node 0 promotes to proimary >>> 5) recover node 1 and got above errors >> Ok, I was able to reproduce the problem. It occurs when the new promoted >> node start too slowly after trigger file is created so that >> find_primary_node() could not connect to it. >> >> Forgot this patch for the moment, I don't have time to work on it for >> now. I'm also pretty sure I've already fixed that somewhere. I will >> check and fix that asap, sorry for the noise. > Hum. In your patches you changed the condition to check if the node is > the standby or not: > > SELECT pg_is_in_recovery() AND pgpool_walrecrunning() > > to this: > > not (SELECT not pg_is_in_recovery() AND not pgpool_walrecrunning()) > > which is logically equal to: > > SELECT pg_is_in_recovery() OR pgpool_walrecrunning() > > Problem is, pg_is_in_recovery() returns true even if it is promoting > to primary. So find_primary_node() can not find the primary node if > the promotion is too slow. > > However, this one: > > SELECT pg_is_in_recovery() AND pgpool_walrecrunning() > > returns true only if the node is standby *AND* not promoting. If the > node is promoting, wal reciver process is not running, which is > checked by pgpool_walrecrunning() (otherwise we don't need > pgpool_walrecrunning() at all). > > In summary I think you need to revert the partches for > find_primary_node(). > -- > Tatsuo Ishii > SRA OSS, Inc. Japan > English: http://www.sraoss.co.jp/index_en.php > Japanese: http://www.sraoss.co.jp
Yes I'm agree but it doesn't cover all cases too, please take a look at the following bug report http://pgfoundry.org/pipermail/pgpool-hackers/2011-January/000525.html We need to fix that, any idea ? I've attached a video for demonstration in the last thread response. -- Gilles Darold http://dalibo.com - http://dalibo.org _______________________________________________ Pgpool-hackers mailing list [email protected] http://pgfoundry.org/mailman/listinfo/pgpool-hackers
