Hi,

As per my previous mail, I'm testing with 4.99.7 on ubuntu 13.04, Postgres 9.1.9

I have 4 source databases and one target, with a very simple test schema, and am testing the scenario when one db goes offline.

For now, I test this using:

 - sudo /etc/init.d/postgres stop # (kill the pg server)


I repeat the test a few times, and sometimes after taking the node offline and inserting into an active node to break replication, status incorrectly reports good. I expect to see 'Bad' here. This is unpredictable, sometimes it reports correctly and sometimes it does not.

 Name       State    Last good    Time    Last I/D    Last bad Time
==========+========+============+=======+===========+===========+=======
 testsync | Good   | 16:33:28   | 25s   | 0/4       | 16:32:55  | 58s


In the log I can see it has trapped the connection error:

(12265) [Wed May  8 16:34:17 2013] MCP DB "testa" Postgres version: 90109
(12265) [Wed May  8 16:34:17 2013] MCP DB "testa" Database port: 5432
(12265) [Wed May 8 16:34:17 2013] MCP Connecting to database "testa_remote" (source) (12265) [Wed May 8 16:34:17 2013] MCP Warning: Killed (line 100): DBI connect('dbname=testa;host=192.168.97.93','bucardo',...) failed: could not connect to server: Connection refused
    Is the server running on host "192.168.97.93" and accepting
TCP/IP connections on port 5432? at /usr/local/share/perl/5.14.2/Bucardo.pm line 4941. (12265) [Wed May 8 16:34:17 2013] MCP Database problem, will respawn after a short sleep: 15 (12265) [Wed May 8 16:34:18 2013] MCP End of cleanup_mcp. Sys time: Wed May 8 16:34:18 2013. Database time: 2013-05-08 16:34:18.900294+01
(12265) [Wed May  8 16:34:18 2013] MCP Sleep time: 15
(12270) [Wed May 8 16:34:19 2013] KID New kid, sync "testsync" alive=1 Parent=11974 PID=12270 kicked=1
(12270) [Wed May  8 16:34:19 2013] KID Missing testa_remote database handle
(12270) [Wed May  8 16:34:19 2013] KID Missing testb database handle
(12270) [Wed May  8 16:34:19 2013] KID Missing testc database handle
(12270) [Wed May  8 16:34:19 2013] KID Missing testd database handle
(12270) [Wed May 8 16:34:19 2013] KID Kid 12270 exiting at cleanup_kid. Sync "testsync" Reason: DBI connect('dbname=testa;host=192.168.97.93','bucardo',...) failed: could not connect to server: Connection refused Is the server running on host "192.168.97.93" and accepting TCP/IP connections on port 5432? at /usr/local/share/perl/5.14.2/Bucardo.pm line 4941. Line: 2718 (11974) [Wed May 8 16:34:19 2013] CTL Created new kid 12270 for sync "testsync" (12275) [Wed May 8 16:34:30 2013] KID New kid, sync "testsync" alive=1 Parent=11974 PID=12275 kicked=1
(12275) [Wed May  8 16:34:30 2013] KID Missing testa_remote database handle
(12275) [Wed May  8 16:34:30 2013] KID Missing testb database handle
(12275) [Wed May  8 16:34:30 2013] KID Missing testc database handle
(12275) [Wed May  8 16:34:30 2013] KID Missing testd database handle
(12275) [Wed May 8 16:34:30 2013] KID Kid 12275 exiting at cleanup_kid. Sync "testsync" Reason: DBI connect('dbname=testa;host=192.168.97.93','bucardo',...) failed: could not connect to server: Connection refused Is the server running on host "192.168.97.93" and accepting TCP/IP connections on port 5432? at /usr/local/share/perl/5.14.2/Bucardo.pm line 4941. Line: 2718 (11974) [Wed May 8 16:34:30 2013] CTL Created new kid 12275 for sync "testsync"


But status is still reporting good:

 Name       State    Last good    Time     Last I/D    Last bad Time
==========+========+============+========+===========+===========+=======
 testsync | Good   | 16:33:28   | 1m 33s | 0/4       | 16:32:55  | 2m 6s



I try inserting a few more rows into the online node for luck, but status is not affected.

Now I bring the offline node back up. Replication catches up correctly, but the 'Last Bad' was not registered'

 Name       State    Last good    Time    Last I/D    Last bad Time
==========+========+============+=======+===========+===========+========
 testsync | Good   | 16:36:20   | 4s    | 0/12      | 16:32:55  | 3m 29s


and the log file ...

(12381) [Wed May  8 16:36:01 2013] MCP Activating sync "testsync"
(12381) [Wed May  8 16:36:01 2013] MCP Running validate_sync on "testsync"
(12381) [Wed May 8 16:36:01 2013] MCP Connecting to database "testa" (source) (12381) [Wed May 8 16:36:01 2013] MCP DB "testa" Local epoch: 1368027361.55505 DB epoch: 1368027361.55507 (12381) [Wed May 8 16:36:01 2013] MCP DB "testa" Local time: Wed May 8 16:36:01 2013 DB time: 2013-05-08 16:36:01.555066+01 (12381) [Wed May 8 16:36:01 2013] MCP DB "testa" Local timezone: BST (+0100) DB timezone: GB
(12381) [Wed May  8 16:36:01 2013] MCP DB "testa" Postgres version: 90109
(12381) [Wed May  8 16:36:01 2013] MCP DB "testa" Database port: 5432
(12381) [Wed May 8 16:36:01 2013] MCP Connecting to database "testa_remote" (source) (12381) [Wed May 8 16:36:01 2013] MCP Warning: Killed (line 100): DBI connect('dbname=testa;host=192.168.97.93','bucardo',...) failed: could not connect to server: Connection refused
    Is the server running on host "192.168.97.93" and accepting
TCP/IP connections on port 5432? at /usr/local/share/perl/5.14.2/Bucardo.pm line 4941. (12381) [Wed May 8 16:36:01 2013] MCP Database problem, will respawn after a short sleep: 15 (12381) [Wed May 8 16:36:03 2013] MCP End of cleanup_mcp. Sys time: Wed May 8 16:36:03 2013. Database time: 2013-05-08 16:36:03.076622+01
(12381) [Wed May  8 16:36:03 2013] MCP Sleep time: 15
(12387) [Wed May 8 16:36:09 2013] KID New kid, sync "testsync" alive=1 Parent=11974 PID=12387 kicked=1
(12387) [Wed May  8 16:36:09 2013] KID Missing testa_remote database handle
(12387) [Wed May  8 16:36:09 2013] KID Missing testb database handle
(12387) [Wed May  8 16:36:09 2013] KID Missing testc database handle
(12387) [Wed May  8 16:36:09 2013] KID Missing testd database handle
(12387) [Wed May 8 16:36:09 2013] KID Kid 12387 exiting at cleanup_kid. Sync "testsync" Reason: DBI connect('dbname=testa;host=192.168.97.93','bucardo',...) failed: could not connect to server: Connection refused Is the server running on host "192.168.97.93" and accepting TCP/IP connections on port 5432? at /usr/local/share/perl/5.14.2/Bucardo.pm line 4941. Line: 2718 (11974) [Wed May 8 16:36:09 2013] CTL Created new kid 12387 for sync "testsync" (12381) [Wed May 8 16:36:18 2013] MCP Respawn attempt: /usr/local/bin/bucardo start 'Attempting automatic respawn after MCP death'
(12394) [Wed May  8 16:36:18 2013] MCP Starting Bucardo version 4.99.7
(12394) [Wed May  8 16:36:18 2013] MCP Log level: normal
(12395) [Wed May 8 16:36:18 2013] MCP Master DB Local epoch: 1368027378.27916 DB epoch: 1368027378.27921 (12395) [Wed May 8 16:36:18 2013] MCP Master DB Local time: Wed May 8 16:36:18 2013 DB time: 2013-05-08 16:36:18.279208+01 (12395) [Wed May 8 16:36:18 2013] MCP Master DB Local timezone: BST (+0100) DB timezone: GB
(12395) [Wed May  8 16:36:18 2013] MCP Master DB Postgres version: 90109
(12395) [Wed May  8 16:36:18 2013] MCP Master DB Database port: 5432
(12395) [Wed May  8 16:36:18 2013] MCP PID: 12395
(12395) [Wed May  8 16:36:18 2013] MCP Postgres backend PID: 12396
(12395) [Wed May  8 16:36:18 2013] MCP Postgres library version: 90106
(12395) [Wed May  8 16:36:18 2013] MCP bucardo: /usr/local/bin/bucardo
(12395) [Wed May 8 16:36:18 2013] MCP Bucardo.pm: /usr/local/share/perl/5.14.2/Bucardo.pm
(12395) [Wed May  8 16:36:18 2013] MCP OS: linux  Perl: /usr/bin/perl 5.14.2
(12395) [Wed May 8 16:36:18 2013] MCP DBI version: 1.622 DBD::Pg version: 2.19.2 (21902) DBIx::Safe version: 1.2.5
(12395) [Wed May  8 16:36:18 2013] MCP Bucardo object:
 batch               => '0'
 bcverbose           => '1'
 created             => 'Wed May  8 16:36:18 2013'
 dbdpgversion        => '21902'
 dbhlist             => 'HASH(0x1adaf48)'
 dbhost              => '127.0.0.1'
 dbname              => 'bucardo'
 dbpass              => '<not shown>'
 dbport              => '5432'
 dbuser              => 'bucardo'
 dryrun              => '0'
 exit_on_nosync      => '0'
 extraname           => ''
 listening           => 'HASH(0x1eedf98)'
 logclean            => '0'
 logcodes            => 'ARRAY(0x19eb378)'
 logdest             => 'ARRAY(0x1630218)'
 logextension        => ''
 logpid              => '12395'
 logprefix           => 'MCP'
 logseparate         => '0'
 masterdbh           => 'DBI::db=HASH(0x1c633d0)'
 mcp_backend         => '12396'
 mcp_clock_timestamp => 'clock_timestamp()'
 mcppid              => '12381'
 pidfile             => '/var/run/bucardo/bucardo.mcp.pid'
 pidmap              => 'HASH(0x17b4378)'
 sendmail            => '0'
 sendmail_file       => ''
 sqlprefix           => '/* Bucardo 4.99.7 */'
 stopfile            => '/var/run/bucardo/fullstopbucardo'
 verbose             => '1'
 version             => '4.99.7'
 warning_file        => ''
(12395) [Wed May  8 16:36:18 2013] MCP Bucardo config:
 autosync_ddl              => 'newcol'
 bucardo_current_version   => '4.99.7'
 bucardo_vac               => '1'
 bucardo_version           => '4.99.6'
 ctl_checkonkids_time      => '10'
 ctl_createkid_time        => '0.5'
 ctl_sleep                 => '0.2'
 default_conflict_strategy => 'bucardo_latest'
 default_email_from        => 'root'
 default_email_host        => 'localhost'
 default_email_to          => 'root'
 email_debug_file          => ''
 endsync_sleep             => '1.0'
 flatfile_dir              => '.'
 host_safety_check         => ''
 kid_deadlock_sleep        => '0.5'
 kid_nodeltarows_sleep     => '0.5'
 kid_pingtime              => '60'
 kid_restart_sleep         => '1'
 kid_serial_sleep          => '0.5'
 kid_sleep                 => '0.5'
 log_conflict_file         => '/var/log/bucardo/conflict.log'
 log_level                 => 'normal'
 log_level_number          => '2'
 log_microsecond           => '0'
 log_showlevel             => '0'
 log_showline              => '0'
 log_showpid               => '1'
 log_showtime              => '3'
 mcp_dbproblem_sleep       => '15'
 mcp_loop_sleep            => '0.2'
 mcp_pingtime              => '60'
 mcp_vactime               => '60'
 piddir                    => '/var/run/bucardo'
 reason_file               => '/var/log/bucardo/restart.reason'
 semaphore_table           => 'bucardo_status'
 statement_chunk_size      => '10000'
 stats_script_url          => 'http://www.bucardo.org/'
 stopfile                  => 'fullstopbucardo'
 syslog_facility           => 'log_local1'
 tcp_keepalives_count      => '0'
 tcp_keepalives_idle       => '0'
 tcp_keepalives_interval   => '0'
 vac_run                   => '30'
 vac_sleep                 => '120'
 warning_file              => ''
(12395) [Wed May  8 16:36:18 2013] MCP Activating sync "testsync"
(12395) [Wed May  8 16:36:18 2013] MCP Running validate_sync on "testsync"
(12395) [Wed May 8 16:36:18 2013] MCP Connecting to database "testa" (source) (12395) [Wed May 8 16:36:18 2013] MCP DB "testa" Local epoch: 1368027378.91922 DB epoch: 1368027378.91924 (12395) [Wed May 8 16:36:18 2013] MCP DB "testa" Local time: Wed May 8 16:36:18 2013 DB time: 2013-05-08 16:36:18.919237+01 (12395) [Wed May 8 16:36:18 2013] MCP DB "testa" Local timezone: BST (+0100) DB timezone: GB
(12395) [Wed May  8 16:36:18 2013] MCP DB "testa" Postgres version: 90109
(12395) [Wed May  8 16:36:18 2013] MCP DB "testa" Database port: 5432
(12395) [Wed May 8 16:36:18 2013] MCP Connecting to database "testa_remote" (source) (12395) [Wed May 8 16:36:18 2013] MCP DB "testa_remote" Local epoch: 1368027378.94052 DB epoch: 1368027378.62312 (12395) [Wed May 8 16:36:18 2013] MCP DB "testa_remote" Local time: Wed May 8 16:36:18 2013 DB time: 2013-05-08 16:36:18.623115+01 (12395) [Wed May 8 16:36:18 2013] MCP DB "testa_remote" Local timezone: BST (+0100) DB timezone: GB (12395) [Wed May 8 16:36:18 2013] MCP DB "testa_remote" Postgres version: 90109
(12395) [Wed May  8 16:36:18 2013] MCP DB "testa_remote" Database port: 5432
(12395) [Wed May 8 16:36:18 2013] MCP Connecting to database "testb" (source) (12395) [Wed May 8 16:36:18 2013] MCP DB "testb" Local epoch: 1368027378.94788 DB epoch: 1368027378.94789 (12395) [Wed May 8 16:36:18 2013] MCP DB "testb" Local time: Wed May 8 16:36:18 2013 DB time: 2013-05-08 16:36:18.94789+01 (12395) [Wed May 8 16:36:18 2013] MCP DB "testb" Local timezone: BST (+0100) DB timezone: GB
(12395) [Wed May  8 16:36:18 2013] MCP DB "testb" Postgres version: 90109
(12395) [Wed May  8 16:36:18 2013] MCP DB "testb" Database port: 5432
(12395) [Wed May 8 16:36:18 2013] MCP Connecting to database "testc" (source) (12395) [Wed May 8 16:36:18 2013] MCP DB "testc" Local epoch: 1368027378.95366 DB epoch: 1368027378.95371 (12395) [Wed May 8 16:36:18 2013] MCP DB "testc" Local time: Wed May 8 16:36:18 2013 DB time: 2013-05-08 16:36:18.953713+01 (12395) [Wed May 8 16:36:18 2013] MCP DB "testc" Local timezone: BST (+0100) DB timezone: GB
(12395) [Wed May  8 16:36:18 2013] MCP DB "testc" Postgres version: 90109
(12395) [Wed May  8 16:36:18 2013] MCP DB "testc" Database port: 5432
(12395) [Wed May 8 16:36:18 2013] MCP Connecting to database "testd" (source) (12395) [Wed May 8 16:36:18 2013] MCP DB "testd" Local epoch: 1368027378.96011 DB epoch: 1368027378.96012 (12395) [Wed May 8 16:36:18 2013] MCP DB "testd" Local time: Wed May 8 16:36:18 2013 DB time: 2013-05-08 16:36:18.960125+01 (12395) [Wed May 8 16:36:18 2013] MCP DB "testd" Local timezone: BST (+0100) DB timezone: GB
(12395) [Wed May  8 16:36:18 2013] MCP DB "testd" Postgres version: 90109
(12395) [Wed May  8 16:36:18 2013] MCP DB "testd" Database port: 5432
(12395) [Wed May 8 16:36:19 2013] MCP Inspecting source table "public.test" on database "testa"
(12395) [Wed May  8 16:36:19 2013] MCP Active syncs: 1
(12395) [Wed May  8 16:36:19 2013] MCP Entering main loop
(12405) [Wed May  8 16:36:19 2013] VAC New VAC daemon. PID=12405
(12405) [Wed May 8 16:36:19 2013] VAC Connected to database "testd" with backend PID of 12407 (12405) [Wed May 8 16:36:19 2013] VAC Connected to database "testb" with backend PID of 12408 (12405) [Wed May 8 16:36:19 2013] VAC Connected to database "testa" with backend PID of 12409 (12405) [Wed May 8 16:36:19 2013] VAC Connected to database "testc" with backend PID of 12410 (12405) [Wed May 8 16:36:19 2013] VAC Connected to database "testa_remote" with backend PID of 6085 NOTICE: Rows deleted from delta_public_test: 3 Rows deleted from track_public_test: 3
(12395) [Wed May  8 16:36:19 2013] MCP Created VAC 12405
NOTICE: Rows deleted from delta_public_test: 0 Rows deleted from track_public_test: 0 NOTICE: Rows deleted from delta_public_test: 0 Rows deleted from track_public_test: 0 NOTICE: Rows deleted from delta_public_test: 0 Rows deleted from track_public_test: 0 NOTICE: Rows deleted from delta_public_test: 0 Rows deleted from track_public_test: 0 (12411) [Wed May 8 16:36:19 2013] CTL New controller for sync "testsync". Relgroup is "testherd", dbs is "testgroup". PID=12411
(12411) [Wed May  8 16:36:19 2013] CTL   stayalive: 1 checksecs: 0 kicked: 1
(12411) [Wed May 8 16:36:19 2013] CTL kidsalive: 1 onetimecopy: 0 lifetimesecs: 0 (NULL) maxkicks: 0
(12411) [Wed May  8 16:36:19 2013] CTL Database "testa" backend PID: 12413
(12395) [Wed May 8 16:36:19 2013] MCP Created controller 12411 for sync "testsync". Kick is 1 (12411) [Wed May 8 16:36:19 2013] CTL Database "testa_remote" backend PID: 6086
(12411) [Wed May  8 16:36:19 2013] CTL Database "testb" backend PID: 12414
(12411) [Wed May  8 16:36:19 2013] CTL Database "testc" backend PID: 12415
(12411) [Wed May  8 16:36:19 2013] CTL Database "testd" backend PID: 12416
(12417) [Wed May 8 16:36:19 2013] KID New kid, sync "testsync" alive=1 Parent=12411 PID=12417 kicked=1
(11974) [Wed May  8 16:36:19 2013] CTL Got a stop sync request, so exiting
(11974) [Wed May 8 16:36:19 2013] CTL Warning! Controller for "testsync" was killed at line 1779: Stop sync request at /usr/local/share/perl/5.14.2/Bucardo.pm line 1779.
(12417) [Wed May  8 16:36:19 2013] KID Got a stop sync request, so exiting
(12417) [Wed May 8 16:36:19 2013] KID Kid 12417 exiting at cleanup_kid. Sync "testsync" public.test Reason: Stop sync request at /usr/local/share/perl/5.14.2/Bucardo.pm line 2754. Line: 4762 (12411) [Wed May 8 16:36:19 2013] CTL Created new kid 12417 for sync "testsync" (11974) [Wed May 8 16:36:19 2013] CTL Controller 11974 exiting at cleanup_controller. Reason: Stop sync request at /usr/local/share/perl/5.14.2/Bucardo.pm line 1779. (12425) [Wed May 8 16:36:20 2013] KID New kid, sync "testsync" alive=1 Parent=12411 PID=12425 kicked=1 (12411) [Wed May 8 16:36:20 2013] CTL Created new kid 12425 for sync "testsync" (12425) [Wed May 8 16:36:20 2013] KID Delta count for testa.public.test : 3 (12425) [Wed May 8 16:36:20 2013] KID Totals: deletes=0 inserts=12 conflicts=0


Thanks for any help with this, I hope it is useful for your beta testing.

Regards,

Mike Tonks

_______________________________________________
Bucardo-general mailing list
[email protected]
https://mail.endcrypt.com/mailman/listinfo/bucardo-general

Reply via email to