Re: [Slony1-general] slony1 drop node failure
Checked the logs for our other replicas. Details below. With a 6-7 sec gap, it’s less clear how the problem would have come from those. Reproducing this would probably be hard. This is the first time it’s ever happened for us. On the theory there is a race condition, one way to reproduce would be to hack the code to delay the SYNC processing thread at just the wrong time so it doesn’t land until disableNode processing is complete. 018-02-21 19:19:47 UTC [32652] INFO remoteWorkerThread_8: SYNC 5002075962 done in 0.479 seconds ... 2018-02-21 19:19:50 UTC [32652] CONFIG disableNode: no_id=8 2018-02-21 19:19:46 UTC [9360] INFO remoteWorkerThread_8: SYNC 5002075962 done in 3.979 seconds ... 2018-02-21 19:19:52 UTC [9360] CONFIG disableNode: no_id=8 2018-02-21 19:19:48 UTC [7420] INFO remoteWorkerThread_8: SYNC 5002075962 done in 2.445 seconds ... 2018-02-21 19:19:55 UTC [7420] CONFIG disableNode: no_id=8 Tom( On 2/28/18, 10:28 PM, "Steve Singer"wrote: On Mon, 26 Feb 2018, Tignor, Tom wrote: > > In the slony1 log of our primary host (the same one which later showed the bug) we had: > > 2018-02-21 19:19:49 UTC [22582] INFO remoteWorkerThread_8: SYNC 5002075962 done in 1.725 seconds > 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8 > > The timestamp of the problem event was somewhat earlier: 2018-02-21 > 19:19:41. > To me it looks like there is a race condition and despite the order > of the log events, the DROP NODE was processed first and the SYNC was > accepted later, and thus created an sl_event record. Of course, that’s > simply a guess. This might be possible. If the drop being droped is 8, and the event node from the drop is 4. Some third node say 3 might have a sequence like this remoteWorkerThread_8: Starts processing SYNC remoteWorkerThread_4: process drop node. Updates runtime config remoteWorkerThread_8: finishes processing sync COMMITS I'd feel more comfortable if we had a way of reproducing this though. One option might be to have the remoteWorkerThread recheck the runtime config before the COMMIT to make sure that the node is still active. > > Tom( > > > On 2/26/18, 12:01 PM, "Steve Singer" wrote: > >On Mon, 26 Feb 2018, Tignor, Tom wrote: > >> >> Thanks. I see the deletes added for sl_seqlog and sl_log_script. The >> constraint violation appearing in the errors was for sl_event. Do we >> expect these changes fully remove all state, including sl_event? The >> checkNodeDeleted function doesn’t look at sl_event. > >It could be that this is a different issue then. >That function (the dropNode_int) should have removed the rows from sl_event. >The question the becomes did it not remove them for some reason, or did they >get added back later, and if so how? > > >> Tom( >> >> >> On 2/26/18, 11:03 AM, "Steve Singer" wrote: >> >>On Mon, 26 Feb 2018, Tignor, Tom wrote: >> >>You can get it from the github branch (latest commit) at >>https://github.com/ssinger/slony1-engine/tree/bug375 >> >> >> >> >>> >>> Steve, >>> The patch link actually goes to a page which says “Bugzilla is down for maintenance.” Is there a way to see the patch currently? Does it exist or is it scheduled in some Slony-I release? >>> >>> Tom( >>> >>> >>> On 2/22/18, 6:06 PM, "Steve Singer" wrote: >>> >>>On Thu, 22 Feb 2018, Tignor, Tom wrote: >>> >>>Looks like? >>> http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html >>> >>>I can't remember if that was what prompted >>> http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html >>> >>> >>>https://github.com/ssinger/slony1-engine/tree/bug375 >>> >>>I can't seem to find a reason why this wasn't committed. >>> >>> >>>> >>>> >>>> >>>> Hello slony1 community, >>>> >>>> We have a head scratcher here. It appears a DROP NODE command was not fully processed. The >>>> command was issued and confirmed on all our nodes at approximately 2018-02-21 19:19:50 UTC. When we went to >>>> restore it over two hours later, all
Re: [Slony1-general] slony1 drop node failure
On Mon, 26 Feb 2018, Tignor, Tom wrote: In the slony1 log of our primary host (the same one which later showed the bug) we had: 2018-02-21 19:19:49 UTC [22582] INFO remoteWorkerThread_8: SYNC 5002075962 done in 1.725 seconds 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8 The timestamp of the problem event was somewhat earlier: 2018-02-21 19:19:41. To me it looks like there is a race condition and despite the order of the log events, the DROP NODE was processed first and the SYNC was accepted later, and thus created an sl_event record. Of course, that’s simply a guess. This might be possible. If the drop being droped is 8, and the event node from the drop is 4. Some third node say 3 might have a sequence like this remoteWorkerThread_8: Starts processing SYNC remoteWorkerThread_4: process drop node. Updates runtime config remoteWorkerThread_8: finishes processing sync COMMITS I'd feel more comfortable if we had a way of reproducing this though. One option might be to have the remoteWorkerThread recheck the runtime config before the COMMIT to make sure that the node is still active. Tom( On 2/26/18, 12:01 PM, "Steve Singer"wrote: On Mon, 26 Feb 2018, Tignor, Tom wrote: > > Thanks. I see the deletes added for sl_seqlog and sl_log_script. The > constraint violation appearing in the errors was for sl_event. Do we > expect these changes fully remove all state, including sl_event? The > checkNodeDeleted function doesn’t look at sl_event. It could be that this is a different issue then. That function (the dropNode_int) should have removed the rows from sl_event. The question the becomes did it not remove them for some reason, or did they get added back later, and if so how? > Tom( > > > On 2/26/18, 11:03 AM, "Steve Singer" wrote: > >On Mon, 26 Feb 2018, Tignor, Tom wrote: > >You can get it from the github branch (latest commit) at >https://github.com/ssinger/slony1-engine/tree/bug375 > > > > >> >> Steve, >> The patch link actually goes to a page which says “Bugzilla is down for maintenance.” Is there a way to see the patch currently? Does it exist or is it scheduled in some Slony-I release? >> >> Tom( >> >> >> On 2/22/18, 6:06 PM, "Steve Singer" wrote: >> >>On Thu, 22 Feb 2018, Tignor, Tom wrote: >> >>Looks like? >> http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html >> >>I can't remember if that was what prompted >> http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html >> >> >>https://github.com/ssinger/slony1-engine/tree/bug375 >> >>I can't seem to find a reason why this wasn't committed. >> >> >>> >>> >>> >>> Hello slony1 community, >>> >>> We have a head scratcher here. It appears a DROP NODE command was not fully processed. The >>> command was issued and confirmed on all our nodes at approximately 2018-02-21 19:19:50 UTC. When we went to >>> restore it over two hours later, all replication stopped on an sl_event constraint violation. Investigation >>> showed a SYNC event for the dropped node with a timestamp of just a few seconds before the drop. I believe this >>> is a first for us. The DROP NODE command is supposed to remove all state for the dropped node. Is that right? Is >>> there a potential race condition somewhere which could leave behind state? >>> >>> Thanks in advance, >>> >>> >>> >>> master log replication freeze error >>> >>> 2018-02-21 21:38:52 UTC [5775] ERROR remoteWorkerThread_8: "insert into "_ams_cluster".sl_event (ev_origin, >>> ev_seqno, ev_timestamp, ev_snapshot, ev\ >>> >>> _type ) values ('8', '5002075962', '2018-02-21 19:19:41.958719+00', '87044110:87044110:', 'SYNC'); insert >>> into "_ams_cluster".sl_confirm (con_origi\ >>> >>> n, con_received, con_seqno, con_timestamp)values (8, 1, '5002075962', now()); select >>> "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 s'::inter\ >>> >>> val); commit transaction;" PGRES_FATAL_ERROR ERROR: duplicate key value violates unique constraint >>> "sl_event-pkey" >>> >>> DETAIL: Key (ev_origin, ev_seqno)=(8, 5002075962) already exists. >>> >>> 2018-02-21 21:38:52 UTC [13649] CONFIG slon: child
Re: [Slony1-general] slony1 drop node failure
In the slony1 log of our primary host (the same one which later showed the bug) we had: 2018-02-21 19:19:49 UTC [22582] INFO remoteWorkerThread_8: SYNC 5002075962 done in 1.725 seconds 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8 The timestamp of the problem event was somewhat earlier: 2018-02-21 19:19:41. To me it looks like there is a race condition and despite the order of the log events, the DROP NODE was processed first and the SYNC was accepted later, and thus created an sl_event record. Of course, that’s simply a guess. Tom( On 2/26/18, 12:01 PM, "Steve Singer"wrote: On Mon, 26 Feb 2018, Tignor, Tom wrote: > > Thanks. I see the deletes added for sl_seqlog and sl_log_script. The > constraint violation appearing in the errors was for sl_event. Do we > expect these changes fully remove all state, including sl_event? The > checkNodeDeleted function doesn’t look at sl_event. It could be that this is a different issue then. That function (the dropNode_int) should have removed the rows from sl_event. The question the becomes did it not remove them for some reason, or did they get added back later, and if so how? > Tom( > > > On 2/26/18, 11:03 AM, "Steve Singer" wrote: > >On Mon, 26 Feb 2018, Tignor, Tom wrote: > >You can get it from the github branch (latest commit) at >https://github.com/ssinger/slony1-engine/tree/bug375 > > > > >> >> Steve, >> The patch link actually goes to a page which says “Bugzilla is down for maintenance.” Is there a way to see the patch currently? Does it exist or is it scheduled in some Slony-I release? >> >> Tom( >> >> >> On 2/22/18, 6:06 PM, "Steve Singer" wrote: >> >>On Thu, 22 Feb 2018, Tignor, Tom wrote: >> >>Looks like? >> http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html >> >>I can't remember if that was what prompted >> http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html >> >> >>https://github.com/ssinger/slony1-engine/tree/bug375 >> >>I can't seem to find a reason why this wasn't committed. >> >> >>> >>> >>> >>> Hello slony1 community, >>> >>> We have a head scratcher here. It appears a DROP NODE command was not fully processed. The >>> command was issued and confirmed on all our nodes at approximately 2018-02-21 19:19:50 UTC. When we went to >>> restore it over two hours later, all replication stopped on an sl_event constraint violation. Investigation >>> showed a SYNC event for the dropped node with a timestamp of just a few seconds before the drop. I believe this >>> is a first for us. The DROP NODE command is supposed to remove all state for the dropped node. Is that right? Is >>> there a potential race condition somewhere which could leave behind state? >>> >>> Thanks in advance, >>> >>> >>> >>> master log replication freeze error >>> >>> 2018-02-21 21:38:52 UTC [5775] ERROR remoteWorkerThread_8: "insert into "_ams_cluster".sl_event (ev_origin, >>> ev_seqno, ev_timestamp, ev_snapshot, ev\ >>> >>> _type ) values ('8', '5002075962', '2018-02-21 19:19:41.958719+00', '87044110:87044110:', 'SYNC'); insert >>> into "_ams_cluster".sl_confirm (con_origi\ >>> >>> n, con_received, con_seqno, con_timestamp)values (8, 1, '5002075962', now()); select >>> "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 s'::inter\ >>> >>> val); commit transaction;" PGRES_FATAL_ERROR ERROR: duplicate key value violates unique constraint >>> "sl_event-pkey" >>> >>> DETAIL: Key (ev_origin, ev_seqno)=(8, 5002075962) already exists. >>> >>> 2018-02-21 21:38:52 UTC [13649] CONFIG slon: child terminated signal: 9; pid: 5775, current worker pid: 5775 >>> >>> 2018-02-21 21:38:52 UTC [13649] CONFIG slon: restart of worker in 10 seconds >>> >>> master log replication freeze error >>> >>> >>> >>> >>> >>> master DB leftover event >>> >>> a...@ams6.cmb.netmgmt:~$ psql -U akamai -d ams >>> >>>
Re: [Slony1-general] slony1 drop node failure
On Mon, 26 Feb 2018, Tignor, Tom wrote: Thanks. I see the deletes added for sl_seqlog and sl_log_script. The constraint violation appearing in the errors was for sl_event. Do we expect these changes fully remove all state, including sl_event? The checkNodeDeleted function doesn’t look at sl_event. It could be that this is a different issue then. That function (the dropNode_int) should have removed the rows from sl_event. The question the becomes did it not remove them for some reason, or did they get added back later, and if so how? Tom( On 2/26/18, 11:03 AM, "Steve Singer"wrote: On Mon, 26 Feb 2018, Tignor, Tom wrote: You can get it from the github branch (latest commit) at https://github.com/ssinger/slony1-engine/tree/bug375 > > Steve, > The patch link actually goes to a page which says “Bugzilla is down for maintenance.” Is there a way to see the patch currently? Does it exist or is it scheduled in some Slony-I release? > > Tom( > > > On 2/22/18, 6:06 PM, "Steve Singer" wrote: > >On Thu, 22 Feb 2018, Tignor, Tom wrote: > >Looks like? > http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html > >I can't remember if that was what prompted > http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html > > >https://github.com/ssinger/slony1-engine/tree/bug375 > >I can't seem to find a reason why this wasn't committed. > > >> >> >> >> Hello slony1 community, >> >> We have a head scratcher here. It appears a DROP NODE command was not fully processed. The >> command was issued and confirmed on all our nodes at approximately 2018-02-21 19:19:50 UTC. When we went to >> restore it over two hours later, all replication stopped on an sl_event constraint violation. Investigation >> showed a SYNC event for the dropped node with a timestamp of just a few seconds before the drop. I believe this >> is a first for us. The DROP NODE command is supposed to remove all state for the dropped node. Is that right? Is >> there a potential race condition somewhere which could leave behind state? >> >> Thanks in advance, >> >> >> >> master log replication freeze error >> >> 2018-02-21 21:38:52 UTC [5775] ERROR remoteWorkerThread_8: "insert into "_ams_cluster".sl_event (ev_origin, >> ev_seqno, ev_timestamp, ev_snapshot, ev\ >> >> _type ) values ('8', '5002075962', '2018-02-21 19:19:41.958719+00', '87044110:87044110:', 'SYNC'); insert >> into "_ams_cluster".sl_confirm (con_origi\ >> >> n, con_received, con_seqno, con_timestamp)values (8, 1, '5002075962', now()); select >> "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 s'::inter\ >> >> val); commit transaction;" PGRES_FATAL_ERROR ERROR: duplicate key value violates unique constraint >> "sl_event-pkey" >> >> DETAIL: Key (ev_origin, ev_seqno)=(8, 5002075962) already exists. >> >> 2018-02-21 21:38:52 UTC [13649] CONFIG slon: child terminated signal: 9; pid: 5775, current worker pid: 5775 >> >> 2018-02-21 21:38:52 UTC [13649] CONFIG slon: restart of worker in 10 seconds >> >> master log replication freeze error >> >> >> >> >> >> master DB leftover event >> >> a...@ams6.cmb.netmgmt:~$ psql -U akamai -d ams >> >> psql (9.1.24) >> >> Type "help" for help. >> >> >> >> ams=# select * from sl_event_bak; >> >> ev_origin | ev_seqno | ev_timestamp | ev_snapshot | ev_type | ev_data1 | ev_data2 | >> ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_ >> >> data7 | ev_data8 >> >> ---++---++-+--+--+- >> -+--+--+--+ >> >> --+-- >> >> 8 | 5002075962 | 2018-02-21 19:19:41.958719+00 | 87044110:87044110: | SYNC| | | >> | | | | >> >> | >> >> (1 row) >> >> >> >> ams=# >> >> master DB leftover event >> >> >> >> master log drop node record >> >> 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8 >> >> 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=4 li_receiver=1 li_provider=4 >
Re: [Slony1-general] slony1 drop node failure
Thanks. I see the deletes added for sl_seqlog and sl_log_script. The constraint violation appearing in the errors was for sl_event. Do we expect these changes fully remove all state, including sl_event? The checkNodeDeleted function doesn’t look at sl_event. Tom( On 2/26/18, 11:03 AM, "Steve Singer"wrote: On Mon, 26 Feb 2018, Tignor, Tom wrote: You can get it from the github branch (latest commit) at https://github.com/ssinger/slony1-engine/tree/bug375 > > Steve, > The patch link actually goes to a page which says “Bugzilla is down for maintenance.” Is there a way to see the patch currently? Does it exist or is it scheduled in some Slony-I release? > > Tom( > > > On 2/22/18, 6:06 PM, "Steve Singer" wrote: > >On Thu, 22 Feb 2018, Tignor, Tom wrote: > >Looks like? > http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html > >I can't remember if that was what prompted > http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html > > >https://github.com/ssinger/slony1-engine/tree/bug375 > >I can't seem to find a reason why this wasn't committed. > > >> >> >> >> Hello slony1 community, >> >> We have a head scratcher here. It appears a DROP NODE command was not fully processed. The >> command was issued and confirmed on all our nodes at approximately 2018-02-21 19:19:50 UTC. When we went to >> restore it over two hours later, all replication stopped on an sl_event constraint violation. Investigation >> showed a SYNC event for the dropped node with a timestamp of just a few seconds before the drop. I believe this >> is a first for us. The DROP NODE command is supposed to remove all state for the dropped node. Is that right? Is >> there a potential race condition somewhere which could leave behind state? >> >> Thanks in advance, >> >> >> >> master log replication freeze error >> >> 2018-02-21 21:38:52 UTC [5775] ERROR remoteWorkerThread_8: "insert into "_ams_cluster".sl_event (ev_origin, >> ev_seqno, ev_timestamp, ev_snapshot, ev\ >> >> _type ) values ('8', '5002075962', '2018-02-21 19:19:41.958719+00', '87044110:87044110:', 'SYNC'); insert >> into "_ams_cluster".sl_confirm (con_origi\ >> >> n, con_received, con_seqno, con_timestamp)values (8, 1, '5002075962', now()); select >> "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 s'::inter\ >> >> val); commit transaction;" PGRES_FATAL_ERROR ERROR: duplicate key value violates unique constraint >> "sl_event-pkey" >> >> DETAIL: Key (ev_origin, ev_seqno)=(8, 5002075962) already exists. >> >> 2018-02-21 21:38:52 UTC [13649] CONFIG slon: child terminated signal: 9; pid: 5775, current worker pid: 5775 >> >> 2018-02-21 21:38:52 UTC [13649] CONFIG slon: restart of worker in 10 seconds >> >> master log replication freeze error >> >> >> >> >> >> master DB leftover event >> >> a...@ams6.cmb.netmgmt:~$ psql -U akamai -d ams >> >> psql (9.1.24) >> >> Type "help" for help. >> >> >> >> ams=# select * from sl_event_bak; >> >> ev_origin | ev_seqno | ev_timestamp | ev_snapshot | ev_type | ev_data1 | ev_data2 | >> ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_ >> >> data7 | ev_data8 >> >> ---++---++-+--+--+- >> -+--+--+--+ >> >> --+-- >> >> 8 | 5002075962 | 2018-02-21 19:19:41.958719+00 | 87044110:87044110: | SYNC| | | >> | | | | >> >> | >> >> (1 row) >> >> >> >> ams=# >> >> master DB leftover event >> >> >> >> master log drop node record >> >> 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8 >> >> 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=4 li_receiver=1 li_provider=4 >> >> 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=7 li_receiver=1 li_provider=7 >> >> 2018-02-21 19:19:50 UTC [22582] CONFIG
Re: [Slony1-general] slony1 drop node failure
On Mon, 26 Feb 2018, Tignor, Tom wrote: You can get it from the github branch (latest commit) at https://github.com/ssinger/slony1-engine/tree/bug375 Steve, The patch link actually goes to a page which says “Bugzilla is down for maintenance.” Is there a way to see the patch currently? Does it exist or is it scheduled in some Slony-I release? Tom( On 2/22/18, 6:06 PM, "Steve Singer"wrote: On Thu, 22 Feb 2018, Tignor, Tom wrote: Looks like? http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html I can't remember if that was what prompted http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html https://github.com/ssinger/slony1-engine/tree/bug375 I can't seem to find a reason why this wasn't committed. > > > > Hello slony1 community, > > We have a head scratcher here. It appears a DROP NODE command was not fully processed. The > command was issued and confirmed on all our nodes at approximately 2018-02-21 19:19:50 UTC. When we went to > restore it over two hours later, all replication stopped on an sl_event constraint violation. Investigation > showed a SYNC event for the dropped node with a timestamp of just a few seconds before the drop. I believe this > is a first for us. The DROP NODE command is supposed to remove all state for the dropped node. Is that right? Is > there a potential race condition somewhere which could leave behind state? > > Thanks in advance, > > > > master log replication freeze error > > 2018-02-21 21:38:52 UTC [5775] ERROR remoteWorkerThread_8: "insert into "_ams_cluster".sl_event (ev_origin, > ev_seqno, ev_timestamp, ev_snapshot, ev\ > > _type ) values ('8', '5002075962', '2018-02-21 19:19:41.958719+00', '87044110:87044110:', 'SYNC'); insert > into "_ams_cluster".sl_confirm (con_origi\ > > n, con_received, con_seqno, con_timestamp)values (8, 1, '5002075962', now()); select > "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 s'::inter\ > > val); commit transaction;" PGRES_FATAL_ERROR ERROR: duplicate key value violates unique constraint > "sl_event-pkey" > > DETAIL: Key (ev_origin, ev_seqno)=(8, 5002075962) already exists. > > 2018-02-21 21:38:52 UTC [13649] CONFIG slon: child terminated signal: 9; pid: 5775, current worker pid: 5775 > > 2018-02-21 21:38:52 UTC [13649] CONFIG slon: restart of worker in 10 seconds > > master log replication freeze error > > > > > > master DB leftover event > > a...@ams6.cmb.netmgmt:~$ psql -U akamai -d ams > > psql (9.1.24) > > Type "help" for help. > > > > ams=# select * from sl_event_bak; > > ev_origin | ev_seqno | ev_timestamp |ev_snapshot | ev_type | ev_data1 | ev_data2 | > ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_ > > data7 | ev_data8 > > ---++---++-+--+--+- > -+--+--+--+ > > --+-- > > 8 | 5002075962 | 2018-02-21 19:19:41.958719+00 | 87044110:87044110: | SYNC| | | > | | | | > > | > > (1 row) > > > > ams=# > > master DB leftover event > > > > master log drop node record > > 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8 > > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=4 li_receiver=1 li_provider=4 > > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=7 li_receiver=1 li_provider=7 > > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=3 li_receiver=1 li_provider=3 > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: update provider configuration > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection for provider 4 terminated > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: disconnecting from data provider 4 > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection for provider 7 terminated > > master log drop node record > > > > replica log drop node record > > 2018-02-21 19:19:51 UTC [22650] WARN remoteWorkerThread_1: got DROP NODE for local node ID > > NOTICE: Slony-I: Please drop schema "_ams_cluster" > > 2018-02-21 19:19:53 UTC [22650] INFO remoteWorkerThread_7: SYNC 5001868819 done in 2.153 seconds > > NOTICE: drop cascades to 243 other objects > > DETAIL: drop cascades to table _ams_cluster.sl_node > > drop cascades to table
Re: [Slony1-general] slony1 drop node failure
Steve, The patch link actually goes to a page which says “Bugzilla is down for maintenance.” Is there a way to see the patch currently? Does it exist or is it scheduled in some Slony-I release? Tom( On 2/22/18, 6:06 PM, "Steve Singer"wrote: On Thu, 22 Feb 2018, Tignor, Tom wrote: Looks like? http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html I can't remember if that was what prompted http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html https://github.com/ssinger/slony1-engine/tree/bug375 I can't seem to find a reason why this wasn't committed. > > > > Hello slony1 community, > > We have a head scratcher here. It appears a DROP NODE command was not fully processed. The > command was issued and confirmed on all our nodes at approximately 2018-02-21 19:19:50 UTC. When we went to > restore it over two hours later, all replication stopped on an sl_event constraint violation. Investigation > showed a SYNC event for the dropped node with a timestamp of just a few seconds before the drop. I believe this > is a first for us. The DROP NODE command is supposed to remove all state for the dropped node. Is that right? Is > there a potential race condition somewhere which could leave behind state? > > Thanks in advance, > > > > master log replication freeze error > > 2018-02-21 21:38:52 UTC [5775] ERROR remoteWorkerThread_8: "insert into "_ams_cluster".sl_event (ev_origin, > ev_seqno, ev_timestamp, ev_snapshot, ev\ > > _type ) values ('8', '5002075962', '2018-02-21 19:19:41.958719+00', '87044110:87044110:', 'SYNC'); insert > into "_ams_cluster".sl_confirm (con_origi\ > > n, con_received, con_seqno, con_timestamp)values (8, 1, '5002075962', now()); select > "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 s'::inter\ > > val); commit transaction;" PGRES_FATAL_ERROR ERROR: duplicate key value violates unique constraint > "sl_event-pkey" > > DETAIL: Key (ev_origin, ev_seqno)=(8, 5002075962) already exists. > > 2018-02-21 21:38:52 UTC [13649] CONFIG slon: child terminated signal: 9; pid: 5775, current worker pid: 5775 > > 2018-02-21 21:38:52 UTC [13649] CONFIG slon: restart of worker in 10 seconds > > master log replication freeze error > > > > > > master DB leftover event > > a...@ams6.cmb.netmgmt:~$ psql -U akamai -d ams > > psql (9.1.24) > > Type "help" for help. > > > > ams=# select * from sl_event_bak; > > ev_origin | ev_seqno | ev_timestamp |ev_snapshot | ev_type | ev_data1 | ev_data2 | > ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_ > > data7 | ev_data8 > > ---++---++-+--+--+- > -+--+--+--+ > > --+-- > > 8 | 5002075962 | 2018-02-21 19:19:41.958719+00 | 87044110:87044110: | SYNC| | | > | | | | > > | > > (1 row) > > > > ams=# > > master DB leftover event > > > > master log drop node record > > 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8 > > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=4 li_receiver=1 li_provider=4 > > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=7 li_receiver=1 li_provider=7 > > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=3 li_receiver=1 li_provider=3 > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: update provider configuration > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection for provider 4 terminated > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: disconnecting from data provider 4 > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection for provider 7 terminated > > master log drop node record > > > > replica log drop node record > > 2018-02-21 19:19:51 UTC [22650] WARN remoteWorkerThread_1: got DROP NODE for local node ID > > NOTICE: Slony-I: Please drop schema "_ams_cluster" > > 2018-02-21 19:19:53 UTC [22650] INFO remoteWorkerThread_7: SYNC 5001868819 done in 2.153 seconds > > NOTICE: drop cascades to 243 other objects > > DETAIL: drop cascades to table
Re: [Slony1-general] slony1 drop node failure
Wow. Yes, this looks a lot like the tmblue 9/10/16 event. I’ll be very interested to take a look at that patch. Haven’t forgotten the 2.2.6 patches I have in flight. Need to finish some other work and pull those off the back burner… Thanks, Tom( On 2/22/18, 6:06 PM, "Steve Singer"wrote: On Thu, 22 Feb 2018, Tignor, Tom wrote: Looks like? http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html I can't remember if that was what prompted http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html https://github.com/ssinger/slony1-engine/tree/bug375 I can't seem to find a reason why this wasn't committed. > > > > Hello slony1 community, > > We have a head scratcher here. It appears a DROP NODE command was not fully processed. The > command was issued and confirmed on all our nodes at approximately 2018-02-21 19:19:50 UTC. When we went to > restore it over two hours later, all replication stopped on an sl_event constraint violation. Investigation > showed a SYNC event for the dropped node with a timestamp of just a few seconds before the drop. I believe this > is a first for us. The DROP NODE command is supposed to remove all state for the dropped node. Is that right? Is > there a potential race condition somewhere which could leave behind state? > > Thanks in advance, > > > > master log replication freeze error > > 2018-02-21 21:38:52 UTC [5775] ERROR remoteWorkerThread_8: "insert into "_ams_cluster".sl_event (ev_origin, > ev_seqno, ev_timestamp, ev_snapshot, ev\ > > _type ) values ('8', '5002075962', '2018-02-21 19:19:41.958719+00', '87044110:87044110:', 'SYNC'); insert > into "_ams_cluster".sl_confirm (con_origi\ > > n, con_received, con_seqno, con_timestamp)values (8, 1, '5002075962', now()); select > "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 s'::inter\ > > val); commit transaction;" PGRES_FATAL_ERROR ERROR: duplicate key value violates unique constraint > "sl_event-pkey" > > DETAIL: Key (ev_origin, ev_seqno)=(8, 5002075962) already exists. > > 2018-02-21 21:38:52 UTC [13649] CONFIG slon: child terminated signal: 9; pid: 5775, current worker pid: 5775 > > 2018-02-21 21:38:52 UTC [13649] CONFIG slon: restart of worker in 10 seconds > > master log replication freeze error > > > > > > master DB leftover event > > a...@ams6.cmb.netmgmt:~$ psql -U akamai -d ams > > psql (9.1.24) > > Type "help" for help. > > > > ams=# select * from sl_event_bak; > > ev_origin | ev_seqno | ev_timestamp |ev_snapshot | ev_type | ev_data1 | ev_data2 | > ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_ > > data7 | ev_data8 > > ---++---++-+--+--+- > -+--+--+--+ > > --+-- > > 8 | 5002075962 | 2018-02-21 19:19:41.958719+00 | 87044110:87044110: | SYNC| | | > | | | | > > | > > (1 row) > > > > ams=# > > master DB leftover event > > > > master log drop node record > > 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8 > > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=4 li_receiver=1 li_provider=4 > > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=7 li_receiver=1 li_provider=7 > > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=3 li_receiver=1 li_provider=3 > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: update provider configuration > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection for provider 4 terminated > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: disconnecting from data provider 4 > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection for provider 7 terminated > > master log drop node record > > > > replica log drop node record > > 2018-02-21 19:19:51 UTC [22650] WARN remoteWorkerThread_1: got DROP NODE for local node ID > > NOTICE: Slony-I: Please drop schema "_ams_cluster" > > 2018-02-21 19:19:53 UTC [22650] INFO remoteWorkerThread_7: SYNC 5001868819 done in 2.153 seconds > > NOTICE: drop cascades to 243 other
Re: [Slony1-general] slony1 drop node failure
On Thu, 22 Feb 2018, Tignor, Tom wrote: Looks like? http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html I can't remember if that was what prompted http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html https://github.com/ssinger/slony1-engine/tree/bug375 I can't seem to find a reason why this wasn't committed. Hello slony1 community, We have a head scratcher here. It appears a DROP NODE command was not fully processed. The command was issued and confirmed on all our nodes at approximately 2018-02-21 19:19:50 UTC. When we went to restore it over two hours later, all replication stopped on an sl_event constraint violation. Investigation showed a SYNC event for the dropped node with a timestamp of just a few seconds before the drop. I believe this is a first for us. The DROP NODE command is supposed to remove all state for the dropped node. Is that right? Is there a potential race condition somewhere which could leave behind state? Thanks in advance, master log replication freeze error 2018-02-21 21:38:52 UTC [5775] ERROR remoteWorkerThread_8: "insert into "_ams_cluster".sl_event (ev_origin, ev_seqno, ev_timestamp, ev_snapshot, ev\ _type ) values ('8', '5002075962', '2018-02-21 19:19:41.958719+00', '87044110:87044110:', 'SYNC'); insert into "_ams_cluster".sl_confirm (con_origi\ n, con_received, con_seqno, con_timestamp) values (8, 1, '5002075962', now()); select "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 s'::inter\ val); commit transaction;" PGRES_FATAL_ERROR ERROR: duplicate key value violates unique constraint "sl_event-pkey" DETAIL: Key (ev_origin, ev_seqno)=(8, 5002075962) already exists. 2018-02-21 21:38:52 UTC [13649] CONFIG slon: child terminated signal: 9; pid: 5775, current worker pid: 5775 2018-02-21 21:38:52 UTC [13649] CONFIG slon: restart of worker in 10 seconds master log replication freeze error master DB leftover event a...@ams6.cmb.netmgmt:~$ psql -U akamai -d ams psql (9.1.24) Type "help" for help. ams=# select * from sl_event_bak; ev_origin | ev_seqno | ev_timestamp | ev_snapshot | ev_type | ev_data1 | ev_data2 | ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_ data7 | ev_data8 ---++---++-+--+--+- -+--+--+--+ --+-- 8 | 5002075962 | 2018-02-21 19:19:41.958719+00 | 87044110:87044110: | SYNC | | | | | | | | (1 row) ams=# master DB leftover event master log drop node record 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=4 li_receiver=1 li_provider=4 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=7 li_receiver=1 li_provider=7 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=3 li_receiver=1 li_provider=3 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: update provider configuration 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection for provider 4 terminated 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: disconnecting from data provider 4 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection for provider 7 terminated master log drop node record replica log drop node record 2018-02-21 19:19:51 UTC [22650] WARN remoteWorkerThread_1: got DROP NODE for local node ID NOTICE: Slony-I: Please drop schema "_ams_cluster" 2018-02-21 19:19:53 UTC [22650] INFO remoteWorkerThread_7: SYNC 5001868819 done in 2.153 seconds NOTICE: drop cascades to 243 other objects DETAIL: drop cascades to table _ams_cluster.sl_node drop cascades to table _ams_cluster.sl_nodelock drop cascades to table _ams_cluster.sl_set drop cascades to table _ams_cluster.sl_setsync drop cascades to table _ams_cluster.sl_table drop cascades to table _ams_cluster.sl_sequence replica log drop node record Tom ☺ ___ Slony1-general mailing list Slony1-general@lists.slony.info http://lists.slony.info/mailman/listinfo/slony1-general
[Slony1-general] slony1 drop node failure
Hello slony1 community, We have a head scratcher here. It appears a DROP NODE command was not fully processed. The command was issued and confirmed on all our nodes at approximately 2018-02-21 19:19:50 UTC. When we went to restore it over two hours later, all replication stopped on an sl_event constraint violation. Investigation showed a SYNC event for the dropped node with a timestamp of just a few seconds before the drop. I believe this is a first for us. The DROP NODE command is supposed to remove all state for the dropped node. Is that right? Is there a potential race condition somewhere which could leave behind state? Thanks in advance, master log replication freeze error 2018-02-21 21:38:52 UTC [5775] ERROR remoteWorkerThread_8: "insert into "_ams_cluster".sl_event (ev_origin, ev_seqno, ev_timestamp, ev_snapshot, ev\ _type ) values ('8', '5002075962', '2018-02-21 19:19:41.958719+00', '87044110:87044110:', 'SYNC'); insert into "_ams_cluster".sl_confirm (con_origi\ n, con_received, con_seqno, con_timestamp)values (8, 1, '5002075962', now()); select "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 s'::inter\ val); commit transaction;" PGRES_FATAL_ERROR ERROR: duplicate key value violates unique constraint "sl_event-pkey" DETAIL: Key (ev_origin, ev_seqno)=(8, 5002075962) already exists. 2018-02-21 21:38:52 UTC [13649] CONFIG slon: child terminated signal: 9; pid: 5775, current worker pid: 5775 2018-02-21 21:38:52 UTC [13649] CONFIG slon: restart of worker in 10 seconds master log replication freeze error master DB leftover event a...@ams6.cmb.netmgmt:~$ psql -U akamai -d ams psql (9.1.24) Type "help" for help. ams=# select * from sl_event_bak; ev_origin | ev_seqno | ev_timestamp |ev_snapshot | ev_type | ev_data1 | ev_data2 | ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_ data7 | ev_data8 ---++---++-+--+--+--+--+--+--+ --+-- 8 | 5002075962 | 2018-02-21 19:19:41.958719+00 | 87044110:87044110: | SYNC| | | | | | | | (1 row) ams=# master DB leftover event master log drop node record 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=4 li_receiver=1 li_provider=4 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=7 li_receiver=1 li_provider=7 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=3 li_receiver=1 li_provider=3 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: update provider configuration 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection for provider 4 terminated 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: disconnecting from data provider 4 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection for provider 7 terminated master log drop node record replica log drop node record 2018-02-21 19:19:51 UTC [22650] WARN remoteWorkerThread_1: got DROP NODE for local node ID NOTICE: Slony-I: Please drop schema "_ams_cluster" 2018-02-21 19:19:53 UTC [22650] INFO remoteWorkerThread_7: SYNC 5001868819 done in 2.153 seconds NOTICE: drop cascades to 243 other objects DETAIL: drop cascades to table _ams_cluster.sl_node drop cascades to table _ams_cluster.sl_nodelock drop cascades to table _ams_cluster.sl_set drop cascades to table _ams_cluster.sl_setsync drop cascades to table _ams_cluster.sl_table drop cascades to table _ams_cluster.sl_sequence replica log drop node record Tom☺ ___ Slony1-general mailing list Slony1-general@lists.slony.info http://lists.slony.info/mailman/listinfo/slony1-general