Hi Lia
I found a similar issue testing a 2-node configuration with LB and
replicator.
I was using cybercluster, but it does't change anything with respect to
pgcluster.
When I started the crashed node with -R (hot-recovery mode), no connection
was allowed through the LB
even if the surviving node was (obviously) up and running.
Moreover, local connections to that database were allowed, and every DML got
replicated on the second node at the end of the recovery.
So, looking at source code, I found that in case of recovery, LB's
set_recovery() function calls PGRset_status_on_cluster_tbl passing TBL_STOP
flag as the first argument. In this case, PGRset_status_on_cluster_tbl
decrements the number of cluster members.
This lead the LB to assume that no cluster members are available at all,
even if one is acting as master in the pg_dump-based recovery and it is
definitely available for any kind of operation.
I resolved this commenting out the call to
PGRset_status_on_cluster_tbl(TBL_STOP,ptr); within the following block (file
src/cybercluster/pglb/recovery.c)
201 case RECOVERY_PGDATA_ANS:
202
/***********************************************************************
203 * aoggianu 20080104
204 *
205 * Modified this case in order to allow connections
206 * through the lb even if we are in a 2-node configuration.
207 * Actual change DO NOT stop_db (as the originale cybercluster did)
208 * BUT will set the status of the cluster to TBL_INIT.
209 * This really allow clients to connect to the already opened db
210 * (which is acting as MASTER) and continue to work.
211 * This should not have any side effect, as access to the surviving
node
212 * is allowed (bypassing lb) during recovery phase.
213 *
214
************************************************************************/
215 /* DO NOT REALLY stop cluster db */
216 ptr = PGRsearch_cluster_tbl(&key);
217 if (ptr != NULL)
218 {
219 #ifdef PRINT_DEBUG
220 show_debug("%s:DO_NOT_stop_db_aoggianu
host:%s port:%d max:%d",
221 func,
packet->hostName,ntohs(packet->port),ntohs(packet->max_connect));
222 #endif
223
224 /********************************************************
225 *
226 * aoggianu 20080103
227 * Modified the following set_status
228 * in order to allow connections to cluster
229 * even in a 2-node conf
230 * Now passing TBL_INIT instead of TBL_STOP
231 *
232 ********************************************************/
233 /********************************************************
234 *
235 * aoggianu 20080104
236 * Commenting at all the following call to
PGRset_status_on_cluster_tbl.
237 * This way, lb should see just ONE active member, which is
238 * really what happens during -U recovery
239 *
240 * PGRset_status_on_cluster_tbl(TBL_STOP,ptr);
241 ********************************************************/
242
243 }
Hope this helps
Regards
--alessandro
On Jan 16, 2008 3:43 PM, Lia Domide <[EMAIL PROTECTED]> wrote:
> Hi everybody,
>
>
>
> I am trying to organize a highly available DB solution using Postgresql(
> 8.2.5) and pgCluster(1.7.7rc7).
>
> I use 2 Ubuntu 7.04 (x32)machines, currently in virtual machines.
>
>
>
> Pg1 (192.168.123.31)
>
> Rep1
>
> Lb1
>
>
>
> Pg2(193.168.123.29)
>
> Rep2
>
> Lb2
>
>
>
> - I managed to make the replication working;
>
> - I checked the etc/hosts file;
>
> - When one node is recovering from failure (-R) any operations executed
> on the other node is correctly replicated;
>
>
>
> But the load balancers seem to work in a wrong way (at least not the way I
> am expecting them to work).
>
> - first all DB nodes are initialized, as the *pglb.sts* file
> shows
>
> - immediately after:" PGRscan_cluster:X ClusterDB can be used"
> decreases with one (X -1)
>
> - I tried to add 3 DB nodes in cluster, and I have the same
> problem: at the beginning "3 ClusterDB nodes can be used" and immediately
> after that "2 ClusterDB..", even if all three DB nodes are running.
>
> - In the 3 nodes scenario, when only the last DB node is up, the
> cluster is unreachable, but with any of the first two DB nodes is alive, the
> cluster is running.
>
> - In the 2 nodes scenario, when the first DB node is down the
> cluster is unreachable, even if the second DB node is alive.
>
>
>
> Does anyone knows why "PGRscan_cluster:X ClusterDB can be used" decreases,
> and when the X number is updated?
>
> A supplementary node must be always kept for safety reasons? (E.g. from a
> 3 nodes cluster only 2 may be used)?
>
>
>
> Below, some logs from load balancers, in the 2 nodes scenario:
>
> On PG2, LB2 log: (PG1 DB node stopped):
>
> *2008-01-16 15:30:30 [13087] DEBUG:PGRset_status_on_cluster_tbl():host:pg1
> port:5432 max:32 use:0 status1
> 2008-01-16 15:30:30 [13087] DEBUG:PGRset_status_on_cluster_tbl():host:pg2
> port:5432 max:32 use:0 status1
> 2008-01-16 15:30:30 [13087] DEBUG:init_pglb():Child_Tbl size is[49536]
> 2008-01-16 15:31:07 [13087] DEBUG:PGRscan_cluster:2 ClusterDB can be used
> 2008-01-16 15:31:07 [13087] DEBUG:PGRscan_cluster:pg1 [5432],useFlag->1
> max->32 use_num->0
>
> 2008-01-16 15:31:07 [13087] DEBUG:PGRset_status_on_cluster_tbl():host:pg1
> port:5432 max:32 use:1 status2
> 2008-01-16 15:31:07 [13116] DEBUG:PGRdo_child():I am 13116
> 2008-01-16 15:31:07 [13116] DEBUG:do_accept():I am 13116 accept fd 6
> 2008-01-16 15:31:07 [13116] DEBUG:read_startup_packet():Protocol Major: 3
> Minor: 0 database: TEST user: postgres
> 2008-01-16 15:31:07 [13116] ERROR:connect_inet_domain_socket(): connect()
> failed: Connection refused
> 2008-01-16 15:31:07 [13116] DEBUG:PGRset_status_on_cluster_tbl():host:pg1
> port:5432 max:32 use:2 status98
> 2008-01-16 15:31:09 [13087] DEBUG:PGRscan_cluster:1 ClusterDB can be used
> 2008-01-16 15:31:09 [13087] DEBUG:PGRscan_cluster:pg1 [5432],useFlag->98
> max->32 use_num->1
>
> 2008-01-16 15:31:09 [13087] DEBUG:PGRscan_cluster:pg2 [5432],useFlag->1
> max->32 use_num->0
>
> 2008-01-16 15:31:09 [13087] DEBUG:PGRset_status_on_cluster_tbl():host:pg2
> port:5432 max:32 use:1 status2
> 2008-01-16 15:31:09 [13117] DEBUG:PGRdo_child():I am 13117
> 2008-01-16 15:31:09 [13117] DEBUG:do_accept():I am 13117 accept fd 6
> 2008-01-16 15:31:09 [13117] DEBUG:read_startup_packet():Protocol Major: 3
> Minor: 0 database: TEST user: postgres
> 2008-01-16 15:31:09 [13117] DEBUG:create_cp():[pg2] [pg2] is same
> 2008-01-16 15:31:09 [13117] DEBUG:connect_unix_domain_socket():postmaster
> Unix domain socket: /tmp/.s.PGSQL.5432
> 2008-01-16 15:31:09 [13117] DEBUG:connect_unix_domain_socket():connected
> to postmaster Unix domain socket: /tmp/.s.PGSQL.5432 fd: 7
> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ### HERE I created a JDBC connection
> from another host…………………………………
> 2008-01-16 15:31:09 [13117] DEBUG:ReadyForQuery(): message length: 5
> 2008-01-16 15:31:09 [13117] DEBUG:ReadyForQuery(): transaction state: I
> 2008-01-16 15:31:09 [13117] DEBUG:ProcessFrontendResponse():read kind from
> frontend X(58)
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:0 ClusterDB can be used
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:pg1 [5432],useFlag->99
> max->32 use_num->1
>
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:0 ClusterDB can be used
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:pg1 [5432],useFlag->99
> max->32 use_num->1
>
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:0 ClusterDB can be used
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:pg1 [5432],useFlag->99
> max->32 use_num->1
>
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:0 ClusterDB can be used
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:pg1 [5432],useFlag->99
> max->32 use_num->1
>
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:0 ClusterDB can be used
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:pg1 [5432],useFlag->99
> max->32 use_num->1
>
> 2008-01-16 15:33:32 [13087] ERROR:PGRload_balance():no cluster available
> 2008-01-16 15:33:32 [13087] ERROR:load_balance_main():load balance process
> failed
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:0 ClusterDB can be used
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:pg1 [5432],useFlag->99
> max->32 use_num->1
>
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:0 ClusterDB can be used
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:pg1 [5432],useFlag->99
> max->32 use_num->1
>
> 2008-01-16 15:33:32 [13087] ERROR:PGRload_balance():no cluster available
> 2008-01-16 15:33:32 [13087] ERROR:load_balance_main():load balance process
> failed
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:0 ClusterDB can be used
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:pg1 [5432],useFlag->99
> max->32 use_num->1
>
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:0 ClusterDB can be used
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:pg1 [5432],useFlag->99
> max->32 use_num->1
> ……………………………………………….
> 2008-01-16 15:33:32 [13087] ERROR:PGRload_balance():no cluster available
> 2008-01-16 15:33:32 [13087] ERROR:load_balance_main():load balance process
> failed
> 2008-01-16 15:33:32 [13087] ERROR:load_balance_main():no cluster available
> 2008-01-16 15:33:32 [13087] DEBUG:do_accept():I am 13087 accept fd 6
> 2008-01-16 15:33:32 [13087] DEBUG:read_startup_packet():Protocol Major: 3
> Minor: 0 database: TEST user: postgres
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:0 ClusterDB can be used
> 2008-01-16 15:33:32 [13087] DEBUG:PGRscan_cluster:pg1 [5432],useFlag->99
> max->32 use_num->1
>
> *
>
> * *
>
> On PG1, LB1 log (both PG1 and PG2 DB services were previously started on
> 5432 port, with postgres user):
>
> *2008-01-16 16:13:27 [29688] DEBUG:PGRset_status_on_cluster_tbl():host:pg1
> port:5432 max:42 use:0 status1
> 2008-01-16 16:13:27 [29688] DEBUG:PGRset_status_on_cluster_tbl():host:pg2
> port:5432 max:42 use:0 status1
> 2008-01-16 16:13:27 [29688] DEBUG:init_pglb():Child_Tbl size is[65016]
> 2008-01-16 16:13:28 [29688] DEBUG:PGRscan_cluster:2 ClusterDB can be used
> 2008-01-16 16:13:28 [29688] DEBUG:PGRscan_cluster:pg1 [5432],useFlag->1
> max->42 use_num->0
>
> 2008-01-16 16:13:28 [29688] DEBUG:PGRset_status_on_cluster_tbl():host:pg1
> port:5432 max:42 use:1 status2
> 2008-01-16 16:13:28 [29695] DEBUG:PGRdo_child():I am 29695
> 2008-01-16 16:13:28 [29695] DEBUG:do_accept():I am 29695 accept fd 6
> 2008-01-16 16:13:28 [29695] ERROR:pool_read: read failed (Connection reset
> by peer)
> 2008-01-16 16:13:30 [29688] DEBUG:PGRscan_cluster:1 ClusterDB can be used
> 2008-01-16 16:13:30 [29688] DEBUG:PGRscan_cluster:pg1 [5432],useFlag->2
> max->42 use_num->0***
>
> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>
> *2008-01-16 16:24:58 [31526] DEBUG:PGRdo_child():I am 31526
> 2008-01-16 16:24:58 [31526] DEBUG:do_accept():I am 31526 accept fd 6
> 2008-01-16 16:24:58 [31526] ERROR:pool_read: read failed (Connection reset
> by peer)
> 2008-01-16 16:25:00 [29688] DEBUG:PGRscan_cluster:1 ClusterDB can be used
> 2008-01-16 16:25:00 [29688] DEBUG:PGRscan_cluster:pg1 [5432],useFlag->2
> max->42 use_num->0
>
> 2008-01-16 16:25:00 [31531] DEBUG:PGRdo_child():I am 31531
> 2008-01-16 16:25:00 [31531] DEBUG:do_accept():I am 31531 accept fd 6
> 2008-01-16 16:25:00 [31531] ERROR:pool_read: read failed (Connection reset
> by peer)
> 2008-01-16 16:25:02 [29688] DEBUG:PGRscan_cluster:1 ClusterDB can be used
> 2008-01-16 16:25:02 [29688] DEBUG:PGRscan_cluster:pg1 [5432],useFlag->2
> max->42 use_num->0*
>
> * *
>
> * *
>
> Thanks in
> advance,
>
> Lia Domide.
>
> _______________________________________________
> Pgcluster-general mailing list
> [email protected]
> http://pgfoundry.org/mailman/listinfo/pgcluster-general
>
>
_______________________________________________
Pgcluster-general mailing list
[email protected]
http://pgfoundry.org/mailman/listinfo/pgcluster-general